[ 
https://issues.apache.org/jira/browse/TIKA-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2239.
-------------------------------
    Resolution: Won't Fix

If {{Can't Fix}} were an option, I'd prefer that to {{Won't Fix}}. :)

I get a "corrupted file" error when I try to open this in MSWord, and WinZip 
complains about a CRC mismatch in the numbering.xml file.  I think 
numbering.xml is truly corrupt, and there's not much we can do with that.

As mentioned, the experimental SAX parser for docx is able to handle the file, 
but that's because it is currently swallowing xml parse exceptions when reading 
the numbering.xml file.

> Illegal IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser
> ---------------------------------------------------------------------------
>
>                 Key: TIKA-2239
>                 URL: https://issues.apache.org/jira/browse/TIKA-2239
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.14
>            Reporter: Jorge Spinsanti
>         Attachments: tika2239.docx
>
>
> I got an exception to extract text from DOCX due to SAXParseException on 
> Apache POI. See stacktrace:
> {code}
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@51a94303
>       at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1114)
>       at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1050)
>       at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>       at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:199)
>       at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>       at org.eclipse.jetty.server.Server.handle(Server.java:462)
>       at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:281)
>       at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:232)
>       at 
> org.eclipse.jetty.io.AbstractConnection$1.run(AbstractConnection.java:505)
>       at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:607)
>       at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:536)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal 
> IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@51a94303
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>       ... 16 more
> Caused by: java.io.IOException: Unable to parse xml bean
>       at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:118)
>       at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.NumberingDocument$Factory.parse(Unknown
>  Source)
>       at 
> org.apache.poi.xwpf.usermodel.XWPFNumbering.onDocumentRead(XWPFNumbering.java:87)
>       at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:204)
>       at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
>       at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:124)
>       at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:58)
>       at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:232)
>       at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
>       at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>       ... 22 more
> Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 37; 
> The encoding declaration is required in the text declaration.
>       at 
> org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown 
> Source)
>       at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
>       at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>       at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>       at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>       at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
>       at org.apache.xerces.impl.XMLScanner.scanXMLDeclOrTextDecl(Unknown 
> Source)
>       at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanXMLDeclOrTextDecl(Unknown
>  Source)
>       at 
> org.apache.xerces.impl.XMLDocumentScannerImpl$XMLDeclDispatcher.dispatch(Unknown
>  Source)
>       at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
> Source)
>       at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>       at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>       at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>       at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>       at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>       at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
>       at 
> org.apache.poi.util.DocumentHelper.readDocument(DocumentHelper.java:137)
>       at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:115)
>       ... 32 more
> {code}
> I attached a file to reproduce the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to