[ 
https://issues.apache.org/jira/browse/TIKA-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15839709#comment-15839709
 ] 

Jorge Spinsanti commented on TIKA-2251:
---------------------------------------

{quote}
Would your preference be to catch+log this exception and continue with 
extraction with null headers?
{quote}
How I can extract the content with null headers?

Anyway, I have many other files that thrown the same error. These files are 
generated for different users so we don't have the control about the files.

OpenOffice opens the file without problem and if I save again the file 
maintaining Word compatibility, fix the problem with the file.

This explanation helps about the problem?


> TIKA-198 due to java.util.zip.ZipException: invalid literal/lengths set
> -----------------------------------------------------------------------
>
>                 Key: TIKA-2251
>                 URL: https://issues.apache.org/jira/browse/TIKA-2251
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Jorge Spinsanti
>         Attachments: ZipException.docx
>
>
> I got an exception to extract text from file. See stacktrace associated and 
> file attached to reproduce:
> {code}
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@7f54cc49
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>       ... 16 more
> Caused by: java.util.zip.ZipException: invalid literal/lengths set
>       at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>       at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:122)
>       at 
> org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:207)
>       at 
> org.apache.xerces.impl.XMLEntityManager$RewindableInputStream.read(Unknown 
> Source)
>       at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown 
> Source)
>       at 
> org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source)
>       at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>       at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>       at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>       at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>       at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>       at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
>       at 
> org.apache.poi.util.DocumentHelper.readDocument(DocumentHelper.java:137)
>       at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:115)
>       at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.HdrDocument$Factory.parse(Unknown
>  Source)
>       at 
> org.apache.poi.xwpf.usermodel.XWPFHeader.onDocumentRead(XWPFHeader.java:108)
>       at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:212)
>       at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
>       at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:124)
>       at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:58)
>       at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:232)
>       at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
>       at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>       ... 23 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to