>I had to put a retry around requests to the tika api calls because
sometimes they flake
Yes. This is an important point. Note that it is not flaking, it is an
intended restart after catastrophic failure. But, yes, absolutely, clients
have retry logic.
I just update the wiki to make this point.
Please let us know what else you find.
Cheers,
Tim
On Wed, Jul 8, 2020 at 10:00 PM Nicholas DiPiazza <
[email protected]> wrote:
> following up on this thread in case anyone stumbles upon it in searches - I
> am abandoning this tika-fork code and replacing it with a TikaServerPool
> that pools tika-server JVM instances with --spawnChild enabled, and a
> client that fires off /rmeta/text requests to round robin selected members
> of this pool. this has it's own set of quirks... but all-in-all the results
> are much more robust for multiple-million document crawls. I am finding way
> less timeout exceptions.
>
> I had to put a retry around requests to the tika api calls because
> sometimes they flake out for a period of time then come back. but that
> seems to be the end of it.
>
> On Thu, Jun 25, 2020 at 1:10 PM Nicholas DiPiazza <
> [email protected]> wrote:
>
> > I need some help a project I'm trying to port over to be a part of Tika.
> >
> > I am trying to extend the existing Fork Parser to add a "Fork Parser 2.0"
> > which supports connection pools using commons-pool, and supports an
> > improved ability to "stop parsing after N characters".
> >
> > Here is the latest code:
> > https://github.com/nddipiazza/tika-fork/tree/2.3.1
> >
> > When I use this project, it works great on my local environment. When I
> > throw it out in the world, I get intermittent errors related to timeouts:
> >
> > Parse error for input:
> > c:\test\docs\c6f13fe7-40bb-4c64-8cfe-5d748b5c8567.xlsm
> > Caused by: java.util.concurrent.ExecutionException:
> > java.lang.RuntimeException: Failed to read content from forked Tika
> parser
> > JVM
> > at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> > ~[?:1.8.0_181]
> > at java.util.concurrent.FutureTask.get(FutureTask.java:206)
> > ~[?:1.8.0_181]
> > at org.apache.tika.client.TikaRunner.parseImpl(TikaRunner.java:124)
> > ~[tika-fork-client-2.3.1.jar:?]
> > at org.apache.tika.client.TikaRunner.parse(TikaRunner.java:58)
> > ~[tika-fork-client-2.3.1.jar:?]
> > at org.apache.tika.client.TikaProcess.parse(TikaProcess.java:185)
> > ~[tika-fork-client-2.3.1.jar:?]
> > at
> > org.apache.tika.client.TikaProcessPool.parse(TikaProcessPool.java:145)
> > ~[tika-fork-client-2.3.1.jar:?]
> > at
> >
> com.lucidworks.apollo.pipeline.parse.impl.tika.TikaForkParser.parse(TikaForkParser.java:236)
> > ~[lucid-parsing-4.2.2.jar:?]
> > ... 12 more
> > Caused by: java.lang.RuntimeException: Failed to read content from forked
> > Tika parser JVM
> > at
> > org.apache.tika.client.TikaRunner.lambda$parseImpl$2(TikaRunner.java:118)
> > ~[tika-fork-client-2.3.1.jar:?]
> > ... 4 more
> > Caused by: java.util.concurrent.TimeoutException: Timed out waiting
> 120000
> > ms for metadata after content was fully parsed.
> > at
> > org.apache.tika.client.TikaRunner.lambda$parseImpl$2(TikaRunner.java:112)
> > ~[tika-fork-client-2.3.1.jar:?]
> > ... 4 more
> >
> > And because each timeout requires the tika forked JVM to be killed and
> > respawned, this can cause some churning that leads to more timeouts
> because
> > of the amount of time it takes to start up a tika JVM.
> >
> > Does anyone have any experience with the existing tika parser? I would
> > imagine this file contains my main issue:
> >
> https://github.com/nddipiazza/tika-fork/blob/2.3.1/tika-fork-main/src/main/java/org/apache/tika/fork/main/TikaForkMain.java
> >
> > I'm attempting to use 3 executors independently. And what I'm thinking is
> > I should be doing this in a different way that isn't so fragile with
> > respect to timeouts.
> >
> > Does anyone have some time to code review this and tell me what they
> might
> > think is wrong?
> >
> > -Nicholas DiPiazza
> >
> >
> >
>