[
https://issues.apache.org/jira/browse/TIKA-3823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571661#comment-17571661
]
earl commented on TIKA-3823:
----------------------------
[~tallison] Does encryption of the doc file have anything to do with OOM? on
following the stacktrace the doc file seems to be encrypted! Since the file is
on the customer end I'm not sure about testing it with 2.x version, anyways
will try to implement this and let you know. Thanks!
> OutOfMemoryError occurs while parsing a doc file
> ------------------------------------------------
>
> Key: TIKA-3823
> URL: https://issues.apache.org/jira/browse/TIKA-3823
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.23
> Reporter: earl
> Priority: Blocker
> Attachments: Screen Shot 2022-07-26 at 9.32.23 AM.png
>
>
> OutOfMemoryError occurs while parsing a doc file of size 450 MB, not sure
> about the uncompressed size. While analyzing the heap dump, the thread that
> parses that file has a byte array of size around 450 MB. The heap size is set
> to 2 GB still this issue persists.
> Stacktrace
> {code:java}
> at java.lang.OutOfMemoryError.<init>()V (OutOfMemoryError.java:48)
> at java.util.Arrays.copyOf([BI)[B (Arrays.java:3236)
> at java.io.ByteArrayOutputStream.toByteArray()[B
> (ByteArrayOutputStream.java:191)
> at
> org.apache.poi.hwpf.HWPFDocumentCore.getDocumentEntryBytes(Ljava/lang/String;II)[B
> (HWPFDocumentCore.java:353)
> at
> org.apache.poi.hwpf.HWPFDocument.<init>(Lorg/apache/poi/poifs/filesystem/DirectoryNode;)V
> (HWPFDocument.java:214)
> at
> org.apache.tika.parser.microsoft.WordExtractor.parse(Lorg/apache/poi/poifs/filesystem/DirectoryNode;Lorg/apache/tika/sax/XHTMLContentHandler;)V
> (WordExtractor.java:156)
> at
> org.apache.tika.parser.microsoft.OfficeParser.parse(Lorg/apache/poi/poifs/filesystem/DirectoryNode;Lorg/apache/tika/parser/ParseContext;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/sax/XHTMLContentHandler;)V
> (OfficeParser.java:175)
> at
> org.apache.tika.parser.microsoft.OfficeParser.parse(Ljava/io/InputStream;Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V
> (OfficeParser.java:131)
> at
> org.apache.tika.parser.CompositeParser.parse(Ljava/io/InputStream;Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V
> (CompositeParser.java:280)
> at
> org.apache.tika.parser.CompositeParser.parse(Ljava/io/InputStream;Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V
> (CompositeParser.java:280)
> at
> org.apache.tika.parser.AutoDetectParser.parse(Ljava/io/InputStream;Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V
> (AutoDetectParser.java:143)
> {code}
> The byte array contains something like
> "....D.d.....................|...L.P.....................................h.."
> followed by some xml data. Please let me know the issue and what this means.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)