[
https://issues.apache.org/jira/browse/TIKA-1953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247552#comment-15247552
]
Tim Allison edited comment on TIKA-1953 at 4/19/16 10:58 AM:
-------------------------------------------------------------
[~chrismattmann], your instincts are correct. I'm able to reproduce this in
pure Java in a unit test. This isn't a tika-server issue or a python issue.
The problem is that the RTF parser opens/closes a list roughly as they show up
in the file. If there's something corrupt in the list markers in the file, the
RTFParser transmits as is. So, if you're using the ToXMLHandler, that'll throw
the NPE if there's a closing </ul> but no opening <ul>. If you use the html,
text or body handler, there's no problem.
As [~nicholasc] pointed out in the comment on TIKA-1513, we need to make the
RTFParser more robust to corrupt lists in RTF files. This will take some time
to get right.
was (Author: [email protected]):
I'm able to reproduce this in pure Java in a unit test. This isn't a
tika-server issue or a python issue.
The problem is that the RTF parser opens/closes a list roughly as they show up
in the file. If there's something corrupt in the list markers in the file, the
RTFParser transmits as is. So, if you're using the ToXMLHandler, that'll throw
the NPE if there's a closing </ul> but no opening <ul>. If you use the html,
text or body handler, there's no problem.
As [~nicholasc] pointed out in the comment on TIKA-1513, we need to make the
RTFParser more robust to corrupt lists in RTF files. This will take some time
to get right.
> tika-server NullPointerException while processing rtfs
> ------------------------------------------------------
>
> Key: TIKA-1953
> URL: https://issues.apache.org/jira/browse/TIKA-1953
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.12
> Environment: Python 2.7.11 :: Anaconda 4.0.0 (64-bit)
> Red Hat Enterprise Linux Server release 6.7 (Santiago)
> java version "1.7.0_95"
> OpenJDK Runtime Environment (rhel-2.6.4.0.el6_7-x86_64 u95-b00)
> OpenJDK 64-Bit Server VM (build 24.95-b01, mixed mode)
> Reporter: Ravi
> Assignee: Tim Allison
> Labels: newbie, rtf, tika-python, tika-server, xmlContent,
> Fix For: 1.13
>
> Attachments: officeinstallations3.rtf
>
>
> Looks like the xmlContent=True flag causes tika.py: Warn: Tika server
> returned status: 422 error
> I start the tika server and then run the following code in the python kernel
> at bash
> import tika
> from tika import parser
> parsed = parser.from_file('/path/to/file.rtf,'http://localhost:9003',xm
> lContent=True)
> I get.. tika.py: Warn: Tika server returned status: 422
> Looking at the tika-server log I get the following dump:
> Note: The parser seems to work fine without the xmlContent=True flag set. I
> get the right output but setting this flag creates the NullPointerException
> below
> ------------------------------------------------------------------------------
> Apr 15, 2016 2:36:55 PM org.apache.tika.server.resource.TikaResource
> logRequest
> INFO: rmeta/xml (autodetecting type)
> Apr 15, 2016 2:36:55 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: rmeta/xml: Text extraction failed
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.rtf.RTFParser@21f0dbb9
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
> at
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)
> at
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:281)
> at
> org.apache.tika.server.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:138)
> at
> org.apache.tika.server.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:119)
> at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:181)
> at
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:97)
> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200)
> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99)
> at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
> at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
> at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
> at
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
> at
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
> at
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
> at org.eclipse.jetty.server.Server.handle(Server.java:370)
> at
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
> at
> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
> at
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
> at
> org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
> at
> org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
> at
> org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
> at
> org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
> at
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.access$000(ToXMLContentHandler.java:38)
> at
> org.apache.tika.sax.ToXMLContentHandler.endElement(ToXMLContentHandler.java:195)
> at
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
> at
> org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256)
> at
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
> at
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
> at
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
> at
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:273)
> at
> org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:226)
> at
> org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:478)
> at
> org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:439)
> at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:87)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ... 38 more
> ------------------------------------------------------------------------------
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)