Ethan Wilansky created TIKA-3890:
------------------------------------
Summary: Identifying an efficient approach for getting page count
prior to running an extraction
Key: TIKA-3890
URL: https://issues.apache.org/jira/browse/TIKA-3890
Project: Tika
Issue Type: Improvement
Components: app
Affects Versions: 2.5.0
Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores
Docker container with 5.5GB reserved memory, 6GB limit
Tika config w/ 2GB reserved memory, 5GB limit
Reporter: Ethan Wilansky
Tika is doing a great job with text extraction, until we encounter an Office
document with an unreasonably large number of pages with extractable text. For
example a Word document containing thousands of text pages. Unfortunately, we
don't have an efficient way to determine page count before calling the /tika or
/rmeta endpoints and either getting back a record size error or setting
byteArrayMaxOverride to a large number to either return the text or metadata
containing the page count. In both cases, this can take significant time to
return a result.
For example, this call:
{{curl -T ./8mb.docx -H "Content-Type:
application/vnd.openxmlformats-officedocument.wordprocessingml.document"
[http://localhost:9998/rmeta/ignore]}}
{quote}{{with the configuration:}}
{{<?xml version="1.0" encoding="UTF-8" standalone="no"?>}}
{{<properties>}}
{{ <parsers>}}
{{ <parser class="org.apache.tika.parser.DefaultParser">}}
{{ <parser-exclude
class="org.apache.tika.parser.ocr.TesseractOCRParser"/>}}
{{ <parser-exclude
class="org.apache.tika.parser.microsoft.OfficeParser"/>}}
{{ </parser>}}
{{ <parser class="org.apache.tika.parser.microsoft.OfficeParser">}}
{{ <params>}}
{{ <param name="byteArrayMaxOverride" type="int">175000000</param>}}
{{ </params>}}
{{ </parser>}}
{{ </parsers>}}
{{ <server>}}
{{ <params>}}
{{ <taskTimeoutMillis>120000</taskTimeoutMillis>}}
{{ <forkedJvmArgs>}}
{{ <arg>-Xms2000m</arg>}}
{{ <arg>-Xmx5000m</arg>}}
{{ </forkedJvmArgs>}}
{{ </params>}}
{{ </server>}}
{{</properties>}}
{quote}
returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds.
Yes, I know this is a huge docx file and I don't want to process it. If I don't
configure {{byteArrayMaxOverride}} I get this exception in just over a second:
{{Tried to allocate an array of length 172,983,026, but the maximum length for
this record type is 100,000,000.}} which is the preferred result.
The exception is the preferred result. With that in mind, can you answer these
questions?
1. Will other extractable file types that don't use the OfficeParser also throw
the same array allocation error for very large text extractions?
2. Is there any way to correlate the array length returned to the number of
lines or pages in the associated file to parse?
3. Is there an efficient way to calculate lines or pages of extractable content
in a file before sending it for extraction? It doesn't appear that /rmeta with
the /ignore path param significantly improves efficiency over calling the /tika
endpoint or /rmeta w/out /igmore
If its useful, I can share the 8MB docx file containing 14k pages.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)