[
https://issues.apache.org/jira/browse/TIKA-3823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571991#comment-17571991
]
earl commented on TIKA-3823:
----------------------------
That is the full Stacktrace actually. Well I traced it deeply and I too cannot
find any traces of encryption. What I'm curious is, this function
{code:java}
HWPFDocumentCore.getDocumentEntryBytes(String name, int encryptionOffset, int
len) {code}
and this line
{code:java}
return bos.toByteArray(); {code}
What is the content of this? Is it the doc content or any meta data of the file?
> OutOfMemoryError occurs while parsing a doc file
> ------------------------------------------------
>
> Key: TIKA-3823
> URL: https://issues.apache.org/jira/browse/TIKA-3823
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.23
> Reporter: earl
> Priority: Blocker
> Attachments: Screen Shot 2022-07-26 at 9.32.23 AM.png
>
>
> OutOfMemoryError occurs while parsing a doc file of size 450 MB, not sure
> about the uncompressed size. While analyzing the heap dump, the thread that
> parses that file has a byte array of size around 450 MB. The heap size is set
> to 2 GB still this issue persists.
> Stacktrace
> {code:java}
> at java.lang.OutOfMemoryError.<init>()V (OutOfMemoryError.java:48)
> at java.util.Arrays.copyOf([BI)[B (Arrays.java:3236)
> at java.io.ByteArrayOutputStream.toByteArray()[B
> (ByteArrayOutputStream.java:191)
> at
> org.apache.poi.hwpf.HWPFDocumentCore.getDocumentEntryBytes(Ljava/lang/String;II)[B
> (HWPFDocumentCore.java:353)
> at
> org.apache.poi.hwpf.HWPFDocument.<init>(Lorg/apache/poi/poifs/filesystem/DirectoryNode;)V
> (HWPFDocument.java:214)
> at
> org.apache.tika.parser.microsoft.WordExtractor.parse(Lorg/apache/poi/poifs/filesystem/DirectoryNode;Lorg/apache/tika/sax/XHTMLContentHandler;)V
> (WordExtractor.java:156)
> at
> org.apache.tika.parser.microsoft.OfficeParser.parse(Lorg/apache/poi/poifs/filesystem/DirectoryNode;Lorg/apache/tika/parser/ParseContext;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/sax/XHTMLContentHandler;)V
> (OfficeParser.java:175)
> at
> org.apache.tika.parser.microsoft.OfficeParser.parse(Ljava/io/InputStream;Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V
> (OfficeParser.java:131)
> at
> org.apache.tika.parser.CompositeParser.parse(Ljava/io/InputStream;Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V
> (CompositeParser.java:280)
> at
> org.apache.tika.parser.CompositeParser.parse(Ljava/io/InputStream;Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V
> (CompositeParser.java:280)
> at
> org.apache.tika.parser.AutoDetectParser.parse(Ljava/io/InputStream;Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V
> (AutoDetectParser.java:143)
> {code}
> The byte array contains something like
> "....D.d.....................|...L.P.....................................h.."
> followed by some xml data. Please let me know the issue and what this means.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)