Manish S N created TIKA-4474:
--------------------------------

             Summary: Exception on zip-xml based office files with large entries
                 Key: TIKA-4474
                 URL: https://issues.apache.org/jira/browse/TIKA-4474
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 3.2.2
         Environment: OS: Ubuntu 24.04.3 LTS x86_64 
Host: Precision 5560 
Kernel: 6.8.0-71-generic 
Shell: zsh 5.9 
Terminal: kitty 
CPU: 11th Gen Intel i7-11850H (16) @ 4.800GHz 
GPU: Intel TigerLake-H GT1 [UHD Graphics] 
Memory: 12574MiB / 15711MiB 
            Reporter: Manish S N
         Attachments: testRecordFormatExceeded.xlsx

When we try to parse ooxml office files with an entry which expands to larger 
than 100MB we get RecordFormatException from poi's IO Utils.

Eg: a large spreadsheet (attached on such file; the attached excel file is 
about 12mb but has a single sheet that expands to over 300 mb)

This is caused when we use InputStream based TikaInputStream and not when we 
use a file based one.

This is caused by poi IOUtils' limit of 100MB for a zip entry while we try to 
make an OPCPackage out of the input stream we passed

Exception:
{code:java}
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@12d40609 at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312) at 
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:153) at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204) at 
redacted.for.privacy
Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
array of length 327,956,216, but the maximum length for this record type is 
100,000,000.If the file is not corrupt and not large, please open an issue on 
bugzilla to request increasing the maximum allowable size for this record 
type.You can set a higher override value with IOUtils.setByteArrayMaxOverride() 
at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:622) at 
org.apache.poi.util.IOUtils.checkLength(IOUtils.java:307) at 
org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:261) at 
org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:235) at 
org.apache.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:93)
 at 
org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:114)
 at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:164) at 
org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:455) at 
org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:430) at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:127)
 at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:117) 
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ... 
75 more
 {code}
Solution:

To solve it without having to override that byte array max value and 
compromising anymore ram,

Just like for ODF we can force spooling the files beforehand for ooxml files 
too. This ensures minimum load on ram and increase in performance too

the performance test i did for a similar issue is also for msofflice files. and 
the same issue has reasons to move to spooling entirely



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to