[jira] [Created] (TIKA-2939) Figure out how to allow OCR'ing of large PDFs via tika-server

Tim Allison (Jira) Mon, 09 Sep 2019 03:36:12 -0700

Tim Allison created TIKA-2939:
---------------------------------

             Summary: Figure out how to allow OCR'ing of large PDFs via 
tika-server
                 Key: TIKA-2939
                 URL: https://issues.apache.org/jira/browse/TIKA-2939
             Project: Tika
          Issue Type: Improvement
          Components: server
            Reporter: Tim Allison



Tesseract can take quite a bit of time on large PDFs, which can lead to 
timeouts in jax-rs and the connection closing:

{noformat}
Caused by: com.ctc.wstx.exc.WstxIOException: Closed
        at com.ctc.wstx.sw.BaseStreamWriter.flush(BaseStreamWriter.java:262)
        at 
org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.handleMessage(JAXRSDefaultFaultOutInterceptor.java:104)
Caused by: org.eclipse.jetty.io.EofException: Closed
        at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:491)
        at 
org.apache.cxf.transport.http_jetty.JettyHTTPDestination$JettyOutputStream.write(JettyHTTPDestination.java:322)
        at 
org.apache.cxf.io.AbstractWrappedOutputStream.write(AbstractWrappedOutputStream.java:51)
        at 
com.ctc.wstx.sw.EncodingXmlWriter.flushBuffer(EncodingXmlWriter.java:742)
        at com.ctc.wstx.sw.EncodingXmlWriter.flush(EncodingXmlWriter.java:176)
        at com.ctc.wstx.sw.BaseStreamWriter.flush(BaseStreamWriter.java:260)
{noformat}

I tried expanding the timeouts on the client side: 
{noformat}
 RequestConfig config = RequestConfig.custom()
                .setConnectTimeout(TIMEOUT * 1000)
                .setConnectionRequestTimeout(TIMEOUT * 1000)
                .setSocketTimeout(TIMEOUT * 1000).build();
{noformat}

But this doesn't solve the problem.

How can we/can we increase the timeout on the server side and is there a 
maximum?

If we can't fix the problem with timeouts, we should figure out a way to let 
people select only a few pages for OCR so that clients can iterate through 
large PDFs.

This issue is different from TIKA-1871 in that the problem isn't chunking the 
large document to get the file to tika-server; rather the problem is the amount 
of time it can take tika-server to run OCR on every page of a large PDF and 
return the full results.




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (TIKA-2939) Figure out how to allow OCR'ing of large PDFs via tika-server

Reply via email to