Tim Allison commented on TIKA-2575:

Is this an XLSX file or XLS?  We often see this kind of memory consumption with 
xlsx because of the shared strings table...or with large docx/pptx because our 
default parser is DOM based.

In many of the handlers, e.g. {{BodyContentHandler}}, 
{{WriteOutContentHandler}}, there is a {{writeLimit}} that will limit the 
number of bytes written.  However, this may not prevent cases where we're 
currently loading large chunks of the document into memory before parsing.

In general, though, sadly, client code has to be robust against OOM and 
permanent hangs.  How are you calling Tika?

> Provide a way to abort tika parses when tika input stream buffer grows passed 
> a certain threshold
> -------------------------------------------------------------------------------------------------
>                 Key: TIKA-2575
>                 URL: https://issues.apache.org/jira/browse/TIKA-2575
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Nicholas DiPiazza
>            Priority: Major
> Sometimes, for example, you use tika to parse an XLS file that isn't really 
> that big, maybe 60 MB. and suddenly the JVM heap size taken is >800Mb which 
> causes an OOM in my case.
> Can we make an "abort threshold" where the tika parse will halt if parse 
> output bytes exceeds this value?
> Or it is possible for users to already do this themselves by watching the 
> input stream as it grows somehow?

This message was sent by Atlassian JIRA

Reply via email to