[
https://issues.apache.org/jira/browse/TIKA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824874#comment-17824874
]
Nick Burch commented on TIKA-4208:
----------------------------------
How much heap size do you have allocated?
The error suggests that Tika managed to decode the string in the SAS data file,
but ran out of memory passing the string through the content handler stack to
plain text. Generally things break at the decode step if they're going to,
rather than the output!
> OOM error in SAS7BDATParser
> ---------------------------
>
> Key: TIKA-4208
> URL: https://issues.apache.org/jira/browse/TIKA-4208
> Project: Tika
> Issue Type: Bug
> Affects Versions: 3.0.0-BETA
> Reporter: Gregory Lepore
> Priority: Minor
>
> For this ARC file:
> [https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2004/segments/NARA-000/warc/NARA-PEOT-2004-20041019023240-02598-crawling008-c_NARA-PEOT-2004-20041019053819-01693-crawling007.archive.org.arc.gz]
> I'm getting an OOM error:
> Exception in thread "main" java.lang.OutOfMemoryError: Requested array size
> exceeds VM limit
> at java.base/java.util.Arrays.copyOf(Arrays.java:3537)
> at
> java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:228)
>
> at
> java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:740)
>
> at java.base/java.lang.StringBuffer.append(StringBuffer.java:410)
> at java.base/java.io.StringWriter.write(StringWriter.java:99)
> at
> org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:96)
>
> at
> org.apache.tika.sax.ToXMLContentHandler.writeEscaped(ToXMLContentHandler.java:229)
>
> at
> org.apache.tika.sax.ToXMLContentHandler.characters(ToXMLContentHandler.java:154)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>
> at
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>
> at
> org.apache.tika.parser.RecursiveParserWrapper$RecursivelySecureContentHandler.characters(RecursiveParserWrapper.java:370)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>
> at
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>
> at
> org.apache.tika.sax.SafeContentHandler.access$101(SafeContentHandler.java:47)
> at
> org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57)
>
> at
> org.apache.tika.sax.SafeContentHandler$$Lambda$327/0x00007f94a022d1a8.write(Unknown
> Source)
> at
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106)
> at
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250)
>
> at
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
>
> at
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
>
> at
> org.apache.tika.parser.sas.SAS7BDATParser.parse(SAS7BDATParser.java:146)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
> at
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:153)
> at
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>
> at
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71)
> at
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>
> at
> org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:455)
> when extracting JSON with both the app and server version of 3.0.0 BETA.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)