[
https://issues.apache.org/jira/browse/TIKA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396336#comment-17396336
]
Xiaohong Yang commented on TIKA-3519:
-------------------------------------
No. We have not. I will try it and let you know.
Thank you very much.
> Wonder if you can add a feature for Tika parser to stop reading metadata and
> body content if certain amount of memory or body content has reached
> --------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: TIKA-3519
> URL: https://issues.apache.org/jira/browse/TIKA-3519
> Project: Tika
> Issue Type: Wish
> Components: detector
> Affects Versions: 1.25, 1.26
> Environment: Linux
> Reporter: Xiaohong Yang
> Priority: Major
>
> We use org.apache.tika.parser.AutoDetectParser to get the metadata and body
> content of MS office files. We encountered the following exception with some
> files
>
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an
> array of length 14523048, but 5000000 is the maximum for this record type. If
> the file is not corrupt, please open an issue on bugzilla to request
> increasing the maximum allowable size for this record type. As a temporary
> workaround, consider setting a higher override value with
> IOUtils.setByteArrayMaxOverride()
>
> To resolve the problem we set byteArrayMaxOverride in the tika-config.xml
> file as follows
>
> <parser class="org.apache.tika.parser.microsoft.OfficeParser">
> <params>
> <param name="byteArrayMaxOverride"
> type="int">20000000</param>
> </params>
> </parser>
>
> This helped to parse some files that failed previously. But some other files
> still failed. And then we increased the value to 200 MB and 500 MB.
>
> Some other file may still fail with byteArrayMaxOverride set to 500 MB. So
> we wonder if you can add a feature to the Tika parser for it to stop reading
> metadata and body content if certain amount of memory or body content has
> reached. The parser will return the metadata and body content obtained so
> far. A warning message will be returned to the caller if this happens. This
> will help us to get the metadata and body content from some files that
> requires a lot of memory. We may not be able to successfully parse some
> files without this feature because those files fail somewhere else with the
> out-of-memory error after we set byteArrayMaxOverride to very high values and
> the above mentioned failure does not happen. With this feature we will get
> truncated body content with some files but it is better than get nothing.
> Actually we will truncate the body content ourselves if it is too large. So
> we do not care if the body content is truncated if it reaches certain amount.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)