[ 
https://issues.apache.org/jira/browse/TIKA-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Williams updated TIKA-1845:
-------------------------------
    Description: 
I have some patient letters that are RTF documents.  When I extract the text 
from these documents using tika-server-1.5.jar, it works fine.

However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and 
1.11), it fails with the stack trace and error shown below.

I can provide a sample RTF that is failing.  I'm not sure how to attach files 
to this issue so here is a link to an Evernote note containing an example RTF 
that fails:
https://www.evernote.com/shard/s66/sh/4a003611-2400-4959-a1cc-2be5b3efe2cf/284a6f2dd3e0a290

I wondered whether the error might be related to the following change that was 
introduced in 1.6?:
  * Made RTFParser's list handling slightly more robust against corrupt
    list metadata (TIKA-1305)

It's possible that there is some issue with the RTF documents, but they are 
real patient letters and they open in Microsoft Word without any problems.

Many thanks
Ian


Steps to reproduce issue
====================

1. HTTP PUT to Tika server using curl:

C:\Downloads\Apache Tika>curl -X PUT --data-binary @test-anonymised-letter.rtf 
http://localhost:9998/tika --header "Content-Type: application/rtf" --header 
"Accept: text/plain"

--> this works fine when running tika-server-1.5.jar, but fails with 
tika-server-1.6.jar


2. Screen capture from the server:
INFO: Starting Apache Tika 1.9 server
Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination
INFO: Setting the server's publish address to be http://localhost:9998/
Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: jetty-8.y.z-SNAPSHOT
Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Started SelectChannelConnector@localhost:9998
Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main
INFO: Started
Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource logRequest
INFO: tika (application/rtf)
Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse
WARNING: tika: Text extraction failed
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.rtf.RTFParser@32a6dc
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
        at 
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at 
org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244)
        at 
org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321)
        at 
org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164)
        at 
org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363)
        at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244)
        at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117)
        at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80)
        at 
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
        at 
org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83)
        at 
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
        at 
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
        at 
org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
        at 
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
        at 
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
        at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
        at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
        at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
        at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
        at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
        at org.eclipse.jetty.server.Server.handle(Server.java:370)
        at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
        at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
        at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
        at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:651)
        at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
        at 
org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
        at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
        at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
        at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
        at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
        at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.NullPointerException
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:113)
        at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
        at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:103)
        at 
org.apache.tika.parser.rtf.RTFEmbObjHandler.extractObj(RTFEmbObjHandler.java:230)
        at 
org.apache.tika.parser.rtf.RTFEmbObjHandler.handleCompletedObject(RTFEmbObjHandler.java:198)
        at 
org.apache.tika.parser.rtf.TextExtractor.processGroupEnd(TextExtractor.java:1357)
        at 
org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:456)
        at 
org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:439)
        at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:86)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
        ... 34 more

Feb 01, 2016 2:26:25 PM org.apache.cxf.jaxrs.utils.JAXRSUtils 
logMessageHandlerProblem
SEVERE: Problem with writing the data, class 
org.apache.tika.server.resource.TikaResource$4, ContentType: text/plain

  was:
I have some patient letters that are RTF documents.  When I extract the text 
from these documents using tika-server-1.5.jar, it works fine.

However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and 
1.11), it fails with the stack trace and error shown below.

I can provide a sample RTF that is failing.

I wondered whether the error might be related to the following change that was 
introduced in 1.6?:
  * Made RTFParser's list handling slightly more robust against corrupt
    list metadata (TIKA-1305)

It's possible that there is some issue with the RTF documents, but they are 
real patient letters and they open in Microsoft Word without any problems.

Many thanks
Ian


Steps to reproduce issue
====================

1. HTTP PUT to Tika server using curl:

C:\Downloads\Apache Tika>curl -X PUT --data-binary @test-anonymised-letter.rtf 
http://localhost:9998/tika --header "Content-Type: application/rtf" --header 
"Accept: text/plain"

--> this works fine when running tika-server-1.5.jar, but fails with 
tika-server-1.6.jar


2. Screen capture from the server:
INFO: Starting Apache Tika 1.9 server
Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination
INFO: Setting the server's publish address to be http://localhost:9998/
Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: jetty-8.y.z-SNAPSHOT
Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Started SelectChannelConnector@localhost:9998
Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main
INFO: Started
Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource logRequest
INFO: tika (application/rtf)
Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse
WARNING: tika: Text extraction failed
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.rtf.RTFParser@32a6dc
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
        at 
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at 
org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244)
        at 
org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321)
        at 
org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164)
        at 
org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363)
        at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244)
        at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117)
        at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80)
        at 
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
        at 
org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83)
        at 
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
        at 
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
        at 
org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
        at 
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
        at 
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
        at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
        at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
        at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
        at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
        at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
        at org.eclipse.jetty.server.Server.handle(Server.java:370)
        at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
        at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
        at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
        at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:651)
        at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
        at 
org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
        at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
        at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
        at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
        at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
        at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.NullPointerException
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:113)
        at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
        at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:103)
        at 
org.apache.tika.parser.rtf.RTFEmbObjHandler.extractObj(RTFEmbObjHandler.java:230)
        at 
org.apache.tika.parser.rtf.RTFEmbObjHandler.handleCompletedObject(RTFEmbObjHandler.java:198)
        at 
org.apache.tika.parser.rtf.TextExtractor.processGroupEnd(TextExtractor.java:1357)
        at 
org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:456)
        at 
org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:439)
        at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:86)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
        ... 34 more

Feb 01, 2016 2:26:25 PM org.apache.cxf.jaxrs.utils.JAXRSUtils 
logMessageHandlerProblem
SEVERE: Problem with writing the data, class 
org.apache.tika.server.resource.TikaResource$4, ContentType: text/plain


> Unable to extract content from certain RTFs using tika-server versions since 
> 1.5 
> ---------------------------------------------------------------------------------
>
>                 Key: TIKA-1845
>                 URL: https://issues.apache.org/jira/browse/TIKA-1845
>             Project: Tika
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 1.6, 1.9, 1.11
>         Environment: Windows
>            Reporter: Ian Williams
>
> I have some patient letters that are RTF documents.  When I extract the text 
> from these documents using tika-server-1.5.jar, it works fine.
> However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and 
> 1.11), it fails with the stack trace and error shown below.
> I can provide a sample RTF that is failing.  I'm not sure how to attach files 
> to this issue so here is a link to an Evernote note containing an example RTF 
> that fails:
> https://www.evernote.com/shard/s66/sh/4a003611-2400-4959-a1cc-2be5b3efe2cf/284a6f2dd3e0a290
> I wondered whether the error might be related to the following change that 
> was introduced in 1.6?:
>   * Made RTFParser's list handling slightly more robust against corrupt
>     list metadata (TIKA-1305)
> It's possible that there is some issue with the RTF documents, but they are 
> real patient letters and they open in Microsoft Word without any problems.
> Many thanks
> Ian
> Steps to reproduce issue
> ====================
> 1. HTTP PUT to Tika server using curl:
> C:\Downloads\Apache Tika>curl -X PUT --data-binary 
> @test-anonymised-letter.rtf http://localhost:9998/tika --header 
> "Content-Type: application/rtf" --header "Accept: text/plain"
> --> this works fine when running tika-server-1.5.jar, but fails with 
> tika-server-1.6.jar
> 2. Screen capture from the server:
> INFO: Starting Apache Tika 1.9 server
> Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination
> INFO: Setting the server's publish address to be http://localhost:9998/
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: jetty-8.y.z-SNAPSHOT
> Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Started SelectChannelConnector@localhost:9998
> Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main
> INFO: Started
> Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource 
> logRequest
> INFO: tika (application/rtf)
> Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: tika: Text extraction failed
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.rtf.RTFParser@32a6dc
>         at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
>         at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163)
>         at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>         at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>         at 
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244)
>         at 
> org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321)
>         at 
> org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164)
>         at 
> org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363)
>         at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244)
>         at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117)
>         at 
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80)
>         at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
>         at 
> org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83)
>         at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
>         at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>         at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
>         at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
>         at 
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
>         at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
>         at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
>         at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>         at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>         at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>         at org.eclipse.jetty.server.Server.handle(Server.java:370)
>         at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
>         at 
> org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
>         at 
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
>         at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:651)
>         at 
> org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
>         at 
> org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
>         at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
>         at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
>         at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>         at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>         at java.lang.Thread.run(Unknown Source)
> Caused by: java.lang.NullPointerException
>         at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:113)
>         at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>         at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:103)
>         at 
> org.apache.tika.parser.rtf.RTFEmbObjHandler.extractObj(RTFEmbObjHandler.java:230)
>         at 
> org.apache.tika.parser.rtf.RTFEmbObjHandler.handleCompletedObject(RTFEmbObjHandler.java:198)
>         at 
> org.apache.tika.parser.rtf.TextExtractor.processGroupEnd(TextExtractor.java:1357)
>         at 
> org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:456)
>         at 
> org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:439)
>         at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:86)
>         at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>         ... 34 more
> Feb 01, 2016 2:26:25 PM org.apache.cxf.jaxrs.utils.JAXRSUtils 
> logMessageHandlerProblem
> SEVERE: Problem with writing the data, class 
> org.apache.tika.server.resource.TikaResource$4, ContentType: text/plain



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to