[jira] [Commented] (TIKA-3519) Wonder if you can add a feature for Tika parser to stop reading metadata and body content if certain amount of memory or body content has reached

Xiaohong Yang (Jira) Mon, 09 Aug 2021 17:54:07 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396336#comment-17396336
 ]


Xiaohong Yang commented on TIKA-3519:
-------------------------------------

No. We have not. I will try it and let you know. 

Thank you very much.

> Wonder if you can add a feature for Tika parser to stop reading  metadata and 
> body content if certain amount of memory or body content has reached
> --------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-3519
>                 URL: https://issues.apache.org/jira/browse/TIKA-3519
>             Project: Tika
>          Issue Type: Wish
>          Components: detector
>    Affects Versions: 1.25, 1.26
>         Environment: Linux
>            Reporter: Xiaohong Yang
>            Priority: Major
>
> We use  org.apache.tika.parser.AutoDetectParser to get the metadata and body 
> content of MS office files.  We encountered the following exception with some 
> files
>  
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
> array of length 14523048, but 5000000 is the maximum for this record type. If 
> the file is not corrupt, please open an issue on bugzilla to request 
> increasing the maximum allowable size for this record type. As a temporary 
> workaround, consider setting a higher override value with 
> IOUtils.setByteArrayMaxOverride()
>  
> To resolve the problem we set byteArrayMaxOverride in the tika-config.xml 
> file as follows
>  
>               <parser class="org.apache.tika.parser.microsoft.OfficeParser">
>                      <params>
>                            <param name="byteArrayMaxOverride" 
> type="int">20000000</param>
>                      </params>
>               </parser>
>  
> This helped to parse some files that failed previously. But some other files 
> still failed.  And then we increased the value to 200 MB and 500 MB.
>  
> Some other file may still fail with byteArrayMaxOverride set to 500 MB.  So 
> we wonder if you can add a feature to the Tika parser for it  to stop reading 
>  metadata and body content if certain amount of memory or body content has 
> reached.  The parser will return the  metadata and body content obtained so 
> far. A warning message will be returned to the caller if this happens.  This 
> will help us to get the metadata and body content from some files that 
> requires a lot of memory.  We may not be able to successfully parse some 
> files without this feature because those files fail somewhere else with the 
> out-of-memory error after we set byteArrayMaxOverride to very high values and 
> the above mentioned failure does not happen. With this feature we will get 
> truncated body content with some files but it is better than get nothing. 
> Actually we will truncate the body content ourselves if it is too large. So 
> we do not care if the body content is truncated if it reaches certain amount.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3519) Wonder if you can add a feature for Tika parser to stop reading metadata and body content if certain amount of memory or body content has reached

Reply via email to