[jira] [Created] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction

Ethan Wilansky (Jira) Wed, 19 Oct 2022 14:21:03 -0700

Ethan Wilansky created TIKA-3890:
------------------------------------

             Summary: Identifying an efficient approach for getting page count 
prior to running an extraction
                 Key: TIKA-3890
                 URL: https://issues.apache.org/jira/browse/TIKA-3890
             Project: Tika
          Issue Type: Improvement
          Components: app
    Affects Versions: 2.5.0
         Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores
Docker container with 5.5GB reserved memory, 6GB limit
Tika config w/ 2GB reserved memory, 5GB limit 
            Reporter: Ethan Wilansky



Tika is doing a great job with text extraction, until we encounter an Office 
document with an  unreasonably large number of pages with extractable text. For 
example a Word document containing thousands of text pages. Unfortunately, we 
don't have an efficient way to determine page count before calling the /tika or 
/rmeta endpoints and either getting back a record size error or setting  
byteArrayMaxOverride to a large number to either return the text or metadata 
containing the page count. In both cases, this can take significant time to 
return a result.

For example, this call:
{{curl -T ./8mb.docx -H "Content-Type: 
application/vnd.openxmlformats-officedocument.wordprocessingml.document" 
[http://localhost:9998/rmeta/ignore]}}
{quote}{{with the configuration:}}
{{<?xml version="1.0" encoding="UTF-8" standalone="no"?>}}
{{<properties>}}
{{  <parsers>}}
{{    <parser class="org.apache.tika.parser.DefaultParser">}}
{{      <parser-exclude 
class="org.apache.tika.parser.ocr.TesseractOCRParser"/>}}
{{      <parser-exclude 
class="org.apache.tika.parser.microsoft.OfficeParser"/>}}
{{    </parser>}}
{{    <parser class="org.apache.tika.parser.microsoft.OfficeParser">}}
{{      <params>}}
{{        <param name="byteArrayMaxOverride" type="int">175000000</param>}}
{{      </params>}}
{{    </parser>}}
{{  </parsers>}}
{{  <server>}}
{{    <params>}}
{{      <taskTimeoutMillis>120000</taskTimeoutMillis>}}
{{      <forkedJvmArgs>}}
{{        <arg>-Xms2000m</arg>}}
{{        <arg>-Xmx5000m</arg>}}
{{      </forkedJvmArgs>}}
{{    </params>}}
{{  </server>}}
{{</properties>}}
{quote}
returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds.

Yes, I know this is a huge docx file and I don't want to process it. If I don't 
configure {{byteArrayMaxOverride}} I get this exception in just over a second:

{{Tried to allocate an array of length 172,983,026, but the maximum length for 
this record type is 100,000,000.}} which is the preferred result.

The exception is the preferred result. With that in mind, can you answer these 
questions?
1. Will other extractable file types that don't use the OfficeParser also throw 
the same array allocation error for very large text extractions? 
2. Is there any way to correlate the array length returned to the number of 
lines or pages in the associated file to parse?
3. Is there an efficient way to calculate lines or pages of extractable content 
in a file before sending it for extraction? It doesn't appear that /rmeta with 
the /ignore path param significantly improves efficiency over calling the /tika 
endpoint or /rmeta w/out /igmore  

If its useful, I can share the 8MB docx file containing 14k pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction

Reply via email to