following up on this thread in case anyone stumbles upon it in searches - I am abandoning this tika-fork code and replacing it with a TikaServerPool that pools tika-server JVM instances with --spawnChild enabled, and a client that fires off /rmeta/text requests to round robin selected members of this pool. this has it's own set of quirks... but all-in-all the results are much more robust for multiple-million document crawls. I am finding way less timeout exceptions.
I had to put a retry around requests to the tika api calls because sometimes they flake out for a period of time then come back. but that seems to be the end of it. On Thu, Jun 25, 2020 at 1:10 PM Nicholas DiPiazza < nicholas.dipia...@gmail.com> wrote: > I need some help a project I'm trying to port over to be a part of Tika. > > I am trying to extend the existing Fork Parser to add a "Fork Parser 2.0" > which supports connection pools using commons-pool, and supports an > improved ability to "stop parsing after N characters". > > Here is the latest code: > https://github.com/nddipiazza/tika-fork/tree/2.3.1 > > When I use this project, it works great on my local environment. When I > throw it out in the world, I get intermittent errors related to timeouts: > > Parse error for input: > c:\test\docs\c6f13fe7-40bb-4c64-8cfe-5d748b5c8567.xlsm > Caused by: java.util.concurrent.ExecutionException: > java.lang.RuntimeException: Failed to read content from forked Tika parser > JVM > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > ~[?:1.8.0_181] > at java.util.concurrent.FutureTask.get(FutureTask.java:206) > ~[?:1.8.0_181] > at org.apache.tika.client.TikaRunner.parseImpl(TikaRunner.java:124) > ~[tika-fork-client-2.3.1.jar:?] > at org.apache.tika.client.TikaRunner.parse(TikaRunner.java:58) > ~[tika-fork-client-2.3.1.jar:?] > at org.apache.tika.client.TikaProcess.parse(TikaProcess.java:185) > ~[tika-fork-client-2.3.1.jar:?] > at > org.apache.tika.client.TikaProcessPool.parse(TikaProcessPool.java:145) > ~[tika-fork-client-2.3.1.jar:?] > at > com.lucidworks.apollo.pipeline.parse.impl.tika.TikaForkParser.parse(TikaForkParser.java:236) > ~[lucid-parsing-4.2.2.jar:?] > ... 12 more > Caused by: java.lang.RuntimeException: Failed to read content from forked > Tika parser JVM > at > org.apache.tika.client.TikaRunner.lambda$parseImpl$2(TikaRunner.java:118) > ~[tika-fork-client-2.3.1.jar:?] > ... 4 more > Caused by: java.util.concurrent.TimeoutException: Timed out waiting 120000 > ms for metadata after content was fully parsed. > at > org.apache.tika.client.TikaRunner.lambda$parseImpl$2(TikaRunner.java:112) > ~[tika-fork-client-2.3.1.jar:?] > ... 4 more > > And because each timeout requires the tika forked JVM to be killed and > respawned, this can cause some churning that leads to more timeouts because > of the amount of time it takes to start up a tika JVM. > > Does anyone have any experience with the existing tika parser? I would > imagine this file contains my main issue: > https://github.com/nddipiazza/tika-fork/blob/2.3.1/tika-fork-main/src/main/java/org/apache/tika/fork/main/TikaForkMain.java > > I'm attempting to use 3 executors independently. And what I'm thinking is > I should be doing this in a different way that isn't so fragile with > respect to timeouts. > > Does anyone have some time to code review this and tell me what they might > think is wrong? > > -Nicholas DiPiazza > > >