[ https://issues.apache.org/jira/browse/TIKA-4474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Manish S N updated TIKA-4474: ----------------------------- Description: When we try to parse ooxml office files with an entry which expands to larger than 100MB we get RecordFormatException from poi's IO Utils. Eg: a large spreadsheet (attached on such file; the attached excel file is about 12mb but has a single sheet that expands to over 300 mb) This is caused when we use InputStream based TikaInputStream and not when we use a file based one. This is caused by poi IOUtils' limit of 100MB for a zip entry while we try to make an OPCPackage out of the input stream we passed Exception: {code:java} org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@12d40609 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:153) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204) at redacted.for.privacy Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 327,956,216, but the maximum length for this record type is 100,000,000.If the file is not corrupt and not large, please open an issue on bugzilla to request increasing the maximum allowable size for this record type.You can set a higher override value with IOUtils.setByteArrayMaxOverride() at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:622) at org.apache.poi.util.IOUtils.checkLength(IOUtils.java:307) at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:261) at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:235) at org.apache.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:93) at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:114) at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:164) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:455) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:430) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:127) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:117) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ... 75 more {code} Solution: To solve it without having to override that byte array max value and compromising anymore ram, Just like for ODF we can force spooling the files beforehand for ooxml files too. This ensures minimum load on ram and increase in performance too [the performance test i did for a similar issue|https://issues.apache.org/jira/browse/TIKA-4459?focusedCommentId=18010803&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-18010803] is also for msofflice files. and the same issue has reasons to move to spooling entirely was: When we try to parse ooxml office files with an entry which expands to larger than 100MB we get RecordFormatException from poi's IO Utils. Eg: a large spreadsheet (attached on such file; the attached excel file is about 12mb but has a single sheet that expands to over 300 mb) This is caused when we use InputStream based TikaInputStream and not when we use a file based one. This is caused by poi IOUtils' limit of 100MB for a zip entry while we try to make an OPCPackage out of the input stream we passed Exception: {code:java} org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@12d40609 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:153) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204) at redacted.for.privacy Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 327,956,216, but the maximum length for this record type is 100,000,000.If the file is not corrupt and not large, please open an issue on bugzilla to request increasing the maximum allowable size for this record type.You can set a higher override value with IOUtils.setByteArrayMaxOverride() at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:622) at org.apache.poi.util.IOUtils.checkLength(IOUtils.java:307) at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:261) at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:235) at org.apache.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:93) at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:114) at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:164) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:455) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:430) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:127) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:117) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ... 75 more {code} Solution: To solve it without having to override that byte array max value and compromising anymore ram, Just like for ODF we can force spooling the files beforehand for ooxml files too. This ensures minimum load on ram and increase in performance too the performance test i did for a similar issue is also for msofflice files. and the same issue has reasons to move to spooling entirely > Exception on ooxml office files with large entries > -------------------------------------------------- > > Key: TIKA-4474 > URL: https://issues.apache.org/jira/browse/TIKA-4474 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 3.2.2 > Environment: OS: Ubuntu 24.04.3 LTS x86_64 > Host: Precision 5560 > Kernel: 6.8.0-71-generic > Shell: zsh 5.9 > Terminal: kitty > CPU: 11th Gen Intel i7-11850H (16) @ 4.800GHz > GPU: Intel TigerLake-H GT1 [UHD Graphics] > Memory: 12574MiB / 15711MiB > Reporter: Manish S N > Priority: Major > Labels: OOXML, XLSX, tika-parsers > Attachments: testRecordFormatExceeded.xlsx > > > When we try to parse ooxml office files with an entry which expands to larger > than 100MB we get RecordFormatException from poi's IO Utils. > Eg: a large spreadsheet (attached on such file; the attached excel file is > about 12mb but has a single sheet that expands to over 300 mb) > This is caused when we use InputStream based TikaInputStream and not when we > use a file based one. > This is caused by poi IOUtils' limit of 100MB for a zip entry while we try to > make an OPCPackage out of the input stream we passed > Exception: > {code:java} > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.ooxml.OOXMLParser@12d40609 at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312) at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:153) at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204) at > redacted.for.privacy > Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an > array of length 327,956,216, but the maximum length for this record type is > 100,000,000.If the file is not corrupt and not large, please open an issue on > bugzilla to request increasing the maximum allowable size for this record > type.You can set a higher override value with > IOUtils.setByteArrayMaxOverride() at > org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:622) at > org.apache.poi.util.IOUtils.checkLength(IOUtils.java:307) at > org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:261) at > org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:235) at > org.apache.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:93) > at > org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:114) > at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:164) at > org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:455) at > org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:430) at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:127) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:117) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) > ... 75 more > {code} > Solution: > To solve it without having to override that byte array max value and > compromising anymore ram, > Just like for ODF we can force spooling the files beforehand for ooxml files > too. This ensures minimum load on ram and increase in performance too > [the performance test i did for a similar > issue|https://issues.apache.org/jira/browse/TIKA-4459?focusedCommentId=18010803&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-18010803] > is also for msofflice files. and the same issue has reasons to move to > spooling entirely > -- This message was sent by Atlassian Jira (v8.20.10#820010)