following up on this thread in case anyone stumbles upon it in searches - I
am abandoning this tika-fork code and replacing it with a TikaServerPool
that pools tika-server JVM instances with --spawnChild enabled, and a
client that fires off /rmeta/text requests to round robin selected members
of this pool. this has it's own set of quirks... but all-in-all the results
are much more robust for multiple-million document crawls. I am finding way
less timeout exceptions.

I had to put a retry around requests to the tika api calls because
sometimes they flake out for a period of time then come back. but that
seems to be the end of it.

On Thu, Jun 25, 2020 at 1:10 PM Nicholas DiPiazza <
nicholas.dipia...@gmail.com> wrote:

> I need some help a project I'm trying to port over to be a part of Tika.
>
> I am trying to extend the existing Fork Parser to add a "Fork Parser 2.0"
> which supports connection pools using commons-pool, and supports an
> improved ability to "stop parsing after N characters".
>
> Here is the latest code:
> https://github.com/nddipiazza/tika-fork/tree/2.3.1
>
> When I use this project, it works great on my local environment. When I
> throw it out in the world, I get intermittent errors related to timeouts:
>
> Parse error for input:
> c:\test\docs\c6f13fe7-40bb-4c64-8cfe-5d748b5c8567.xlsm
> Caused by: java.util.concurrent.ExecutionException:
> java.lang.RuntimeException: Failed to read content from forked Tika parser
> JVM
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> ~[?:1.8.0_181]
>   at java.util.concurrent.FutureTask.get(FutureTask.java:206)
> ~[?:1.8.0_181]
>   at org.apache.tika.client.TikaRunner.parseImpl(TikaRunner.java:124)
> ~[tika-fork-client-2.3.1.jar:?]
>   at org.apache.tika.client.TikaRunner.parse(TikaRunner.java:58)
> ~[tika-fork-client-2.3.1.jar:?]
>   at org.apache.tika.client.TikaProcess.parse(TikaProcess.java:185)
> ~[tika-fork-client-2.3.1.jar:?]
>   at
> org.apache.tika.client.TikaProcessPool.parse(TikaProcessPool.java:145)
> ~[tika-fork-client-2.3.1.jar:?]
>   at
> com.lucidworks.apollo.pipeline.parse.impl.tika.TikaForkParser.parse(TikaForkParser.java:236)
> ~[lucid-parsing-4.2.2.jar:?]
>   ... 12 more
> Caused by: java.lang.RuntimeException: Failed to read content from forked
> Tika parser JVM
>   at
> org.apache.tika.client.TikaRunner.lambda$parseImpl$2(TikaRunner.java:118)
> ~[tika-fork-client-2.3.1.jar:?]
>   ... 4 more
> Caused by: java.util.concurrent.TimeoutException: Timed out waiting 120000
> ms for metadata after content was fully parsed.
>   at
> org.apache.tika.client.TikaRunner.lambda$parseImpl$2(TikaRunner.java:112)
> ~[tika-fork-client-2.3.1.jar:?]
>   ... 4 more
>
> And because each timeout requires the tika forked JVM to be killed and
> respawned, this can cause some churning that leads to more timeouts because
> of the amount of time it takes to start up a tika JVM.
>
> Does anyone have any experience with the existing tika parser? I would
> imagine this file contains my main issue:
> https://github.com/nddipiazza/tika-fork/blob/2.3.1/tika-fork-main/src/main/java/org/apache/tika/fork/main/TikaForkMain.java
>
> I'm attempting to use 3 executors independently. And what I'm thinking is
> I should be doing this in a different way that isn't so fragile with
> respect to timeouts.
>
> Does anyone have some time to code review this and tell me what they might
> think is wrong?
>
> -Nicholas DiPiazza
>
>
>

Reply via email to