[ 
https://issues.apache.org/jira/browse/TIKA-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15548427#comment-15548427
 ] 

Nick Burch commented on TIKA-2107:
----------------------------------

The attached file is an old Word 2 file, not supported by POI and hence not by 
Tika. Tika is correctly detecting it as the old type, and using the EmptyParser 
as expected. With both the 1.14 and 2.0 git built Tika Apps, I get this 
(expected) behaviour, of no text and no exceptions

> Old MS Word files give error while indexing
> -------------------------------------------
>
>                 Key: TIKA-2107
>                 URL: https://issues.apache.org/jira/browse/TIKA-2107
>             Project: Tika
>          Issue Type: Bug
>          Components: tika-batch
>    Affects Versions: 1.8, 2.0
>         Environment: ubuntu
>            Reporter: Gaurav
>              Labels: patch
>         Attachments: Tika 2.0 error.jpg, plen281.doc
>
>
> error while indexing old MS word files
> Screen shot of Tika 2.0 attached. 
> Error with Tika 1.8:
> Log of Tika 1.8:
> INFO: meta (application/msword)
> Oct 04, 2016 6:42:30 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: meta: Text extraction failed
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.microsoft.OfficeParser@7260e439
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:287)
>       at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>       at 
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:238)
>       at 
> org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:134)
>       at 
> org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:67)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at 
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:181)
>       at 
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:97)
>       at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200)
>       at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99)
>       at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
>       at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
>       at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
>       at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>       at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
>       at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
>       at 
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
>       at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
>       at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
>       at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>       at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>       at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>       at org.eclipse.jetty.server.Server.handle(Server.java:370)
>       at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
>       at 
> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
>       at 
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
>       at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
>       at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
>       at 
> org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
>       at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
>       at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
>       at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>       at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid 
> header signature; read 0x04094031002DA5DB, expected 0xE11AB1A1E011CFD0 - Your 
> file appears not to be a valid OLE2 document
>       at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:167)
>       at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:117)
>       at 
> org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:291)
>       at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:166)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>       ... 38 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to