[
https://issues.apache.org/jira/browse/TIKA-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101584#comment-17101584
]
Tim Allison commented on TIKA-3097:
-----------------------------------
Uncompressed, you're looking at ~150MB for the file. xml beans on top of that
add quite a bit of overhead...2 gb sounds excessive. There is a streaming
option for docx and
pptx:https://cwiki.apache.org/confluence/display/TIKA/MSOfficeParsers
I'll take a look in the debugger later today and let you know if this is a bug
or feature.
> Out of memory while parsing docx
> --------------------------------
>
> Key: TIKA-3097
> URL: https://issues.apache.org/jira/browse/TIKA-3097
> Project: Tika
> Issue Type: Bug
> Components: core, parser
> Affects Versions: 1.24
> Reporter: suchendra
> Priority: Major
> Attachments: test.docx
>
>
> I have written simple Scala code to extract the content from uploaded file
> which is docx. JVM goes OOM when tika tries to parse the file. I have
> configured JVM heap to 1GB and tried with 2GB same issue occurs, issue both
> with jar as well as in my code.
> Attached the file for reference.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)