[jira] [Commented] (TIKA-3097) Out of memory while parsing docx

Tim Allison (Jira) Wed, 17 Jun 2020 03:17:22 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138315#comment-17138315
 ]


Tim Allison commented on TIKA-3097:
-----------------------------------

Yes, even if the file is read as a stream. IIRC, some files only work with 
files because they need random access to the stream. For example, if the xlsx 
parser hits  sheet1.xml before hitting the sharedstrings.xml as it streams the 
zip entries, it’d be out of luck.

Even without needing random access, some parsers may choose to build the 
document components in memory for various reasons before we can extract text.

We try to stream as we can, but some file formats are less than helpful for 
streaming and some of the parsers in our dependencies are not optimized for 
text extraction.

If you find obvious areas for improvements, let us know.

> Out of memory while parsing docx
> --------------------------------
>
>                 Key: TIKA-3097
>                 URL: https://issues.apache.org/jira/browse/TIKA-3097
>             Project: Tika
>          Issue Type: Bug
>          Components: core, parser
>    Affects Versions: 1.24
>            Reporter: suchendra
>            Priority: Major
>         Attachments: Screenshot from 2020-05-07 08-14-25.png, samplefile.txt, 
> test.docx
>
>
> I have written simple Scala code to extract the content from uploaded file 
> which is docx. JVM goes OOM when tika tries to parse the file. I have 
> configured JVM heap to 1GB and tried with 2GB same issue occurs, issue both 
> with jar as well as in my code.
> Attached the file for reference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3097) Out of memory while parsing docx

Reply via email to