[
https://issues.apache.org/jira/browse/TIKA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17827507#comment-17827507
]
Tim Allison commented on TIKA-4208:
-----------------------------------
I think you've just run into a monster of a sas7bdat file. I'm able to run
{{java -Xmx6g -jar tika-app-xyx.jar -J -t}} on the file successfully. The
resulting json is 2GB -- there's a lot of compression in the sas7bdat file
because most of the values are 0. The metadata says that it has 685 "pages"
(tables?), 344772 rows and 2120 columns. With recursive json, there's not much
of an option but to bump memory, limit the amount that you write to the handler
or punt on the file altogether.
If you can find incorrect recursion or incorrect duplication of data or
something wrong with what Tika is doing, please let us know.
Separately, for these "package" files like arcs, if you can't process them all
in memory, you may need to run an initial unraveling step to extract the
embedded files, along the lines of {{java -jar tika-app.xyz.jar -z input.arc}}.
> OOM error in SAS7BDATParser
> ---------------------------
>
> Key: TIKA-4208
> URL: https://issues.apache.org/jira/browse/TIKA-4208
> Project: Tika
> Issue Type: Bug
> Affects Versions: 3.0.0-BETA
> Reporter: Gregory Lepore
> Priority: Minor
> Attachments: table23.sas7bdat.zip
>
>
> For this ARC file:
> [https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2004/segments/NARA-000/warc/NARA-PEOT-2004-20041019023240-02598-crawling008-c_NARA-PEOT-2004-20041019053819-01693-crawling007.archive.org.arc.gz]
> I'm getting an OOM error:
> Exception in thread "main" java.lang.OutOfMemoryError: Requested array size
> exceeds VM limit
> at java.base/java.util.Arrays.copyOf(Arrays.java:3537)
> at
> java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:228)
>
> at
> java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:740)
>
> at java.base/java.lang.StringBuffer.append(StringBuffer.java:410)
> at java.base/java.io.StringWriter.write(StringWriter.java:99)
> at
> org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:96)
>
> at
> org.apache.tika.sax.ToXMLContentHandler.writeEscaped(ToXMLContentHandler.java:229)
>
> at
> org.apache.tika.sax.ToXMLContentHandler.characters(ToXMLContentHandler.java:154)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>
> at
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>
> at
> org.apache.tika.parser.RecursiveParserWrapper$RecursivelySecureContentHandler.characters(RecursiveParserWrapper.java:370)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>
> at
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>
> at
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>
> at
> org.apache.tika.sax.SafeContentHandler.access$101(SafeContentHandler.java:47)
> at
> org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57)
>
> at
> org.apache.tika.sax.SafeContentHandler$$Lambda$327/0x00007f94a022d1a8.write(Unknown
> Source)
> at
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106)
> at
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250)
>
> at
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
>
> at
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
>
> at
> org.apache.tika.parser.sas.SAS7BDATParser.parse(SAS7BDATParser.java:146)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
> at
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:153)
> at
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>
> at
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71)
> at
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>
> at
> org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:455)
> when extracting JSON with both the app and server version of 3.0.0 BETA.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)