[
https://issues.apache.org/jira/browse/TIKA-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ethan Wilansky closed TIKA-3880.
--------------------------------
> Tika not picking-up setByteArrayMaxOverride from tika-config
> ------------------------------------------------------------
>
> Key: TIKA-3880
> URL: https://issues.apache.org/jira/browse/TIKA-3880
> Project: Tika
> Issue Type: Improvement
> Components: app
> Affects Versions: 2.5.0
> Environment: We are running this through docker on a machine with
> plenty of memory resources allocated to Docker.
> Docker config: 32 GB, 8 processors
> Host machine: 64 GB, 32 processors
> Our docker-compose configuration is derived from:
> [https://github.com/apache/tika-docker/blob/master/docker-compose-tika-customocr.yml]
> We are experienced with Docker and are confident that the issue isn't with
> Docker.
>
> Reporter: Ethan Wilansky
> Priority: Blocker
> Fix For: 2.5.0
>
>
> I have specified this parser parameter in tika-config.xml:
> <properties>
> <parserclass="org.apache.tika.parser.microsoft.ooxml.OOXMLParser">
> <params>
> <paramname="byteArrayMaxOverride"type="int">700000000</param>
> </params>
> </parser>
> </properties>
>
> I've also verified that the tika-config.xml is being picked-up by Tika on
> startup:
> org.apache.tika.server.core.TikaServerProcess Using custom config:
> /tika-config.xml
>
> However, when I encounter a very large docx file, I can clearly see that the
> configuration in tika-config is not being picked-up:
>
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an
> array of length 686,679,089, but the maximum length for this record type is
> 100,000,000.
> If the file is not corrupt and not large, please open an issue on bugzilla to
> request
> increasing the maximum allowable size for this record type.
> You can set a higher override value with IOUtils.setByteArrayMaxOverride()
>
> I understand that this is a very large docx file. However, we can handle this
> amount of text extraction and am fine with the time it takes for Tika to
> perform this extraction and the amount of memory required to complete this
> extraction.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)