[ https://issues.apache.org/jira/browse/TIKA-4474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Manish S N updated TIKA-4474: ----------------------------- Summary: Exception on ooxml office files with large entries (was: Exception on zip-xml based office files with large entries) > Exception on ooxml office files with large entries > -------------------------------------------------- > > Key: TIKA-4474 > URL: https://issues.apache.org/jira/browse/TIKA-4474 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 3.2.2 > Environment: OS: Ubuntu 24.04.3 LTS x86_64 > Host: Precision 5560 > Kernel: 6.8.0-71-generic > Shell: zsh 5.9 > Terminal: kitty > CPU: 11th Gen Intel i7-11850H (16) @ 4.800GHz > GPU: Intel TigerLake-H GT1 [UHD Graphics] > Memory: 12574MiB / 15711MiB > Reporter: Manish S N > Priority: Major > Labels: OOXML, XLSX, tika-parsers > Attachments: testRecordFormatExceeded.xlsx > > > When we try to parse ooxml office files with an entry which expands to larger > than 100MB we get RecordFormatException from poi's IO Utils. > Eg: a large spreadsheet (attached on such file; the attached excel file is > about 12mb but has a single sheet that expands to over 300 mb) > This is caused when we use InputStream based TikaInputStream and not when we > use a file based one. > This is caused by poi IOUtils' limit of 100MB for a zip entry while we try to > make an OPCPackage out of the input stream we passed > Exception: > {code:java} > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.ooxml.OOXMLParser@12d40609 at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312) at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:153) at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204) at > redacted.for.privacy > Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an > array of length 327,956,216, but the maximum length for this record type is > 100,000,000.If the file is not corrupt and not large, please open an issue on > bugzilla to request increasing the maximum allowable size for this record > type.You can set a higher override value with > IOUtils.setByteArrayMaxOverride() at > org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:622) at > org.apache.poi.util.IOUtils.checkLength(IOUtils.java:307) at > org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:261) at > org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:235) at > org.apache.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:93) > at > org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:114) > at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:164) at > org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:455) at > org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:430) at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:127) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:117) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) > ... 75 more > {code} > Solution: > To solve it without having to override that byte array max value and > compromising anymore ram, > Just like for ODF we can force spooling the files beforehand for ooxml files > too. This ensures minimum load on ram and increase in performance too > the performance test i did for a similar issue is also for msofflice files. > and the same issue has reasons to move to spooling entirely -- This message was sent by Atlassian Jira (v8.20.10#820010)