[
https://issues.apache.org/jira/browse/TIKA-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226797#comment-17226797
]
Tim Allison commented on TIKA-3220:
-----------------------------------
I'll clean up the unit tests to make it clearer that the existing behavior is
the "expected". Admittedly, this is not optimal. Currently, there's a thread
in the ForkServer that checks for "isParsing" and whether the elapsed time is
longer than the allowed time. If that's the case, System.exit(0). So, the
server process terminates itself, and the client gets an IOException while
trying to read from its InputStream.
So, where to go from here?
>From least work to most work.
1) We can improve the error message to say that the server crashed because of
oom or timeout.
2) We can maybe check the process for an agreed upon exit value for a timeout
and then throw that exception.
3) Grabbing the content and still shutting down will be difficult at a quick
look at the code...is this necessary?
> ForkParser displays incorrect message when parse timeout is reached
> -------------------------------------------------------------------
>
> Key: TIKA-3220
> URL: https://issues.apache.org/jira/browse/TIKA-3220
> Project: Tika
> Issue Type: Bug
> Reporter: Nicholas DiPiazza
> Priority: Major
>
> Build this ForkParser example
> https://github.com/nddipiazza/tika-fork-parser-example
> but change the server timeout to be 10 seconds.
> {code}
> forkParser.setServerWaitTimeoutMillis(10000);
> {code}
> Now run it with the following (open licensed xls file)
> https://public.opendatasoft.com/explore/dataset/activite-epidemique-covid-19-departement-france/download/?format=xls&timezone=America/Chicago&lang=en&use_labels_for_header=true
> The purpose of this is to test the timeout feature on the ForkParser.
> {code}
> /home/ndipiazza/lucidworks/tika-fork-parser-example/tika-fork-main/build/dist
> /home/ndipiazza/Downloads/coronavirus-tranche-age-urgences-sosmedecins-dep-france.xls
> {code}
> Expected Result:
> Stop parsing after it reached the max time and either return the bytes so far
> or throw an error with the correct message stating that timeout was exceeded.
> Actual result:
> You get the following error message.
> {code}
> Exception in thread "main" org.apache.tika.exception.TikaException: Could not
> parse
> at
> org.apache.tika.client.CollectingParser.parseInternal(CollectingParser.java:104)
> at
> org.apache.tika.client.CollectingParser.parse(CollectingParser.java:70)
> at org.apache.tika.client.TikaForkExample.main(TikaForkExample.java:49)
> Caused by: org.apache.tika.exception.TikaException: Failed to communicate
> with a forked parser process. The process has most likely crashed due to some
> error like running out of memory. A new process will be started for the next
> parsing request.
> at org.apache.tika.fork.ForkParser.parse(ForkParser.java:275)
> at
> org.apache.tika.client.CollectingParser.parseInternal(CollectingParser.java:101)
> ... 2 more
> Caused by: java.io.IOException: Lost connection to a forked server process
> at org.apache.tika.fork.ForkClient.waitForResponse(ForkClient.java:284)
> at org.apache.tika.fork.ForkClient.call(ForkClient.java:209)
> at org.apache.tika.fork.ForkParser.parse(ForkParser.java:267)
> ... 3 more
> {code}
> If you increase the timeout, the file parses fine. It is not a memory issue.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)