Xiaohong Yang created TIKA-3519:
-----------------------------------

             Summary: Wonder if you can add a feature for Tika parser to stop 
reading  metadata and body content if certain amount of memory or body content 
has reached
                 Key: TIKA-3519
                 URL: https://issues.apache.org/jira/browse/TIKA-3519
             Project: Tika
          Issue Type: Wish
          Components: detector
    Affects Versions: 1.26, 1.25
         Environment: Linux
            Reporter: Xiaohong Yang


We use  org.apache.tika.parser.AutoDetectParser to get the metadata and body 
content of MS office files.  We encountered the following exception with some 
files

 

Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
array of length 14523048, but 5000000 is the maximum for this record type. If 
the file is not corrupt, please open an issue on bugzilla to request increasing 
the maximum allowable size for this record type. As a temporary workaround, 
consider setting a higher override value with IOUtils.setByteArrayMaxOverride()

 

To resolve the problem we set byteArrayMaxOverride in the tika-config.xml file 
as follows

 

              <parser class="org.apache.tika.parser.microsoft.OfficeParser">

                     <params>

                           <param name="byteArrayMaxOverride" 
type="int">20000000</param>

                     </params>

              </parser>

 

This helped to parse some files that failed previously. But some other files 
still failed.  And then we increased the value to 200 MB and 500 MB.

 

Some other file may still fail with byteArrayMaxOverride set to 500 MB.  So we 
wonder if you can add a feature to the Tika parser for it  to stop reading  
metadata and body content if certain amount of memory or body content has 
reached.  The parser will return the  metadata and body content obtained so 
far. A warning message will be returned to the caller if this happens.  This 
will help us to get the metadata and body content from some files that requires 
a lot of memory.  We may not be able to successfully parse some files without 
this feature because those files fail somewhere else with the out-of-memory 
error after we set byteArrayMaxOverride to very high values and the above 
mentioned failure does not happen. With this feature we will get truncated body 
content with some files but it is better than get nothing. Actually we will 
truncate the body content ourselves if it is too large. So we do not care if 
the body content is truncated it if reaches certain amount.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to