[ 
https://issues.apache.org/jira/browse/TIKA-4474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manish S N updated TIKA-4474:
-----------------------------
    Summary: Exception on ooxml office files with large entries  (was: 
Exception on zip-xml based office files with large entries)

> Exception on ooxml office files with large entries
> --------------------------------------------------
>
>                 Key: TIKA-4474
>                 URL: https://issues.apache.org/jira/browse/TIKA-4474
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 3.2.2
>         Environment: OS: Ubuntu 24.04.3 LTS x86_64 
> Host: Precision 5560 
> Kernel: 6.8.0-71-generic 
> Shell: zsh 5.9 
> Terminal: kitty 
> CPU: 11th Gen Intel i7-11850H (16) @ 4.800GHz 
> GPU: Intel TigerLake-H GT1 [UHD Graphics] 
> Memory: 12574MiB / 15711MiB 
>            Reporter: Manish S N
>            Priority: Major
>              Labels: OOXML, XLSX, tika-parsers
>         Attachments: testRecordFormatExceeded.xlsx
>
>
> When we try to parse ooxml office files with an entry which expands to larger 
> than 100MB we get RecordFormatException from poi's IO Utils.
> Eg: a large spreadsheet (attached on such file; the attached excel file is 
> about 12mb but has a single sheet that expands to over 300 mb)
> This is caused when we use InputStream based TikaInputStream and not when we 
> use a file based one.
> This is caused by poi IOUtils' limit of 100MB for a zip entry while we try to 
> make an OPCPackage out of the input stream we passed
> Exception:
> {code:java}
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@12d40609 at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312) at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:153) at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204) at 
> redacted.for.privacy
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
> array of length 327,956,216, but the maximum length for this record type is 
> 100,000,000.If the file is not corrupt and not large, please open an issue on 
> bugzilla to request increasing the maximum allowable size for this record 
> type.You can set a higher override value with 
> IOUtils.setByteArrayMaxOverride() at 
> org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:622) at 
> org.apache.poi.util.IOUtils.checkLength(IOUtils.java:307) at 
> org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:261) at 
> org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:235) at 
> org.apache.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:93)
>  at 
> org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:114)
>  at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:164) at 
> org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:455) at 
> org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:430) at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:127)
>  at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:117)
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
> ... 75 more
>  {code}
> Solution:
> To solve it without having to override that byte array max value and 
> compromising anymore ram,
> Just like for ODF we can force spooling the files beforehand for ooxml files 
> too. This ensures minimum load on ram and increase in performance too
> the performance test i did for a similar issue is also for msofflice files. 
> and the same issue has reasons to move to spooling entirely



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to