[ 
https://issues.apache.org/jira/browse/TIKA-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16810925#comment-16810925
 ] 

Tim Allison commented on TIKA-2847:
-----------------------------------

My bag of tricks is empty. You might want to ask on the PDFBox users list. Part 
of the challenge is that the contents are highly repetitive which means that 
the compression  rate is quite high. I typically leave 1gb per thread.

> OutOfMemoryError - tika1.19.1.jar
> ---------------------------------
>
>                 Key: TIKA-2847
>                 URL: https://issues.apache.org/jira/browse/TIKA-2847
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.19.1
>            Reporter: Ashish Tiwari
>            Priority: Major
>         Attachments: testCmplData.docx
>
>
> I am trying to parse a docx file and getting below error. Same issue happens 
> if i convert attached docx file to a pdf. 
> Attached pdf file is of 3.7 mb, however i doubt it is related to size of the 
> file, as i am able to parse a file above 30mb without any issues.
> PS : This issue only happens if we have JVM configured to -Xmx512m if i 
> change value to 1024m it starts working fine.
>  
> {code:java}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842)
> at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771)
> at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
> at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
> at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)
> at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3414)
> at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1272)
> at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1259)
> at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
> at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown
>  Source)
> at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:178)
> at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:184)
> at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:138)
> at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:60)
> at 
> org.apache.poi.ooxml.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:228)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:116)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> at org.apache.tika.Tika.parseToString(Tika.java:527)
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to