[ 
https://issues.apache.org/jira/browse/TIKA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17827507#comment-17827507
 ] 

Tim Allison commented on TIKA-4208:
-----------------------------------

I think you've just run into a monster of a sas7bdat file. I'm able to run 
{{java -Xmx6g -jar tika-app-xyx.jar -J -t}} on the file successfully. The 
resulting json is 2GB -- there's a lot of compression in the sas7bdat file 
because most of the values are 0.  The metadata says that it has 685 "pages" 
(tables?), 344772 rows and 2120 columns.  With recursive json, there's not much 
of an option but to bump memory, limit the amount that you write to the handler 
or punt on the file altogether.

If you can find incorrect recursion or incorrect duplication of data or 
something wrong with what Tika is doing, please let us know.

Separately, for these "package" files like arcs, if you can't process them all 
in memory, you may need to run an initial unraveling step to extract the 
embedded files, along the lines of {{java -jar tika-app.xyz.jar -z input.arc}}.

> OOM error in SAS7BDATParser
> ---------------------------
>
>                 Key: TIKA-4208
>                 URL: https://issues.apache.org/jira/browse/TIKA-4208
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 3.0.0-BETA
>            Reporter: Gregory Lepore
>            Priority: Minor
>         Attachments: table23.sas7bdat.zip
>
>
> For this ARC file:
> [https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2004/segments/NARA-000/warc/NARA-PEOT-2004-20041019023240-02598-crawling008-c_NARA-PEOT-2004-20041019053819-01693-crawling007.archive.org.arc.gz]
> I'm getting an OOM error:
> Exception in thread "main" java.lang.OutOfMemoryError: Requested array size 
> exceeds VM limit 
>        at java.base/java.util.Arrays.copyOf(Arrays.java:3537) 
>        at 
> java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:228)
>  
>        at 
> java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:740)
>  
>        at java.base/java.lang.StringBuffer.append(StringBuffer.java:410) 
>        at java.base/java.io.StringWriter.write(StringWriter.java:99) 
>        at 
> org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:96)
>  
>        at 
> org.apache.tika.sax.ToXMLContentHandler.writeEscaped(ToXMLContentHandler.java:229)
>  
>        at 
> org.apache.tika.sax.ToXMLContentHandler.characters(ToXMLContentHandler.java:154)
>  
>        at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>        at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>  
>        at 
> org.apache.tika.parser.RecursiveParserWrapper$RecursivelySecureContentHandler.characters(RecursiveParserWrapper.java:370)
>  
>        at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>        at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>  
>        at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>        at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>        at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>        at 
> org.apache.tika.sax.SafeContentHandler.access$101(SafeContentHandler.java:47) 
>        at 
> org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57)
>  
>        at 
> org.apache.tika.sax.SafeContentHandler$$Lambda$327/0x00007f94a022d1a8.write(Unknown
>  Source) 
>        at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106) 
>        at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250)
>  
>        at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
>  
>        at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
>  
>        at 
> org.apache.tika.parser.sas.SAS7BDATParser.parse(SAS7BDATParser.java:146) 
>        at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
>        at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
>        at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) 
>        at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:153) 
>        at 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>  
>        at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71) 
>        at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>  
>        at 
> org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:455)
> when extracting JSON with both the app and server version of 3.0.0 BETA.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to