[ 
https://issues.apache.org/jira/browse/TIKA-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17422245#comment-17422245
 ] 

Tim Allison commented on TIKA-3561:
-----------------------------------

Wait, now I see the problem. I had unzipped the attachment and was processing 
just the xlsx.   I'm still able to parse the full thing with -Xmx6g.

 The problem is that the PackageParser calls POI on the stream of the embedded 
xlsx, so POI slurps the full thing into memory because it doesn't have the 
"file" available.  I wonder if we should modify the packageparser to cache the 
stream to disc in a temp file if it is above a certain size...or perhaps do the 
caching to disk in the OOXMLParser?

> Tika throwing java.lang.OutOfMemoryError
> ----------------------------------------
>
>                 Key: TIKA-3561
>                 URL: https://issues.apache.org/jira/browse/TIKA-3561
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.1.0
>            Reporter: Abha
>            Priority: Major
>         Attachments: Item.zip, out.tar.gz
>
>
> Getting Fatal Exception when processing the attached document \{item.content 
> sub doc name is item.xlsx}.
> Below is the exception log -
> Caused by: java.lang.OutOfMemoryError: Java heap spaceCaused by: 
> java.lang.OutOfMemoryError: Java heap space at 
> java.io.ByteArrayOutputStream.<init>(ByteArrayOutputStream.java:77) at 
> org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:177) at 
> org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:149) at 
> org.apache.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:47)
>  at 
> org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:53)
>  at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:106) at 
> org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:307) at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>  at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:113)
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to