[
https://issues.apache.org/jira/browse/TIKA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824965#comment-17824965
]
Nick Burch commented on TIKA-4208:
----------------------------------
I would expect that the json output version would need a bit more memory, as
we'll have to hold all the content in memory before outputting instead of just
streaming the text/html out as we go along. I wouldn't expect it to be 4gb vs
32gb though!
Any ideas anyone? Is it possible we've got an extra layer (or 2?) of buffering
above and beyond what we need for the {{-J}} option?
> OOM error in SAS7BDATParser
> ---------------------------
>
> Key: TIKA-4208
> URL: https://issues.apache.org/jira/browse/TIKA-4208
> Project: Tika
> Issue Type: Bug
> Affects Versions: 3.0.0-BETA
> Reporter: Gregory Lepore
> Priority: Minor
>
> For this ARC file:
> [https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2004/segments/NARA-000/warc/NARA-PEOT-2004-20041019023240-02598-crawling008-c_NARA-PEOT-2004-20041019053819-01693-crawling007.archive.org.arc.gz]
> I'm getting an OOM error:
> Exception in thread "main" java.lang.OutOfMemoryError: Requested array size
> exceeds VM limit
> at java.base/java.util.Arrays.copyOf(Arrays.java:3537)
> at
> java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:228)
>
> at
> java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:740)
>
> at java.base/java.lang.StringBuffer.append(StringBuffer.java:410)
> at java.base/java.io.StringWriter.write(StringWriter.java:99)
> at
> org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:96)
>
> at
> org.apache.tika.sax.ToXMLContentHandler.writeEscaped(ToXMLContentHandler.java:229)
>
> at
> org.apache.tika.sax.ToXMLContentHandler.characters(ToXMLContentHandler.java:154)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>
> at
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>
> at
> org.apache.tika.parser.RecursiveParserWrapper$RecursivelySecureContentHandler.characters(RecursiveParserWrapper.java:370)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>
> at
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>
> at
> org.apache.tika.sax.SafeContentHandler.access$101(SafeContentHandler.java:47)
> at
> org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57)
>
> at
> org.apache.tika.sax.SafeContentHandler$$Lambda$327/0x00007f94a022d1a8.write(Unknown
> Source)
> at
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106)
> at
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250)
>
> at
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
>
> at
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
>
> at
> org.apache.tika.parser.sas.SAS7BDATParser.parse(SAS7BDATParser.java:146)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
> at
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:153)
> at
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>
> at
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71)
> at
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>
> at
> org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:455)
> when extracting JSON with both the app and server version of 3.0.0 BETA.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)