I need some help a project I'm trying to port over to be a part of Tika.

I am trying to extend the existing Fork Parser to add a "Fork Parser 2.0"
which supports connection pools using commons-pool, and supports an
improved ability to "stop parsing after N characters".

Here is the latest code: https://github.com/nddipiazza/tika-fork/tree/2.3.1

When I use this project, it works great on my local environment. When I
throw it out in the world, I get intermittent errors related to timeouts:

Parse error for input:
c:\test\docs\c6f13fe7-40bb-4c64-8cfe-5d748b5c8567.xlsm
Caused by: java.util.concurrent.ExecutionException:
java.lang.RuntimeException: Failed to read content from forked Tika parser
JVM
  at java.util.concurrent.FutureTask.report(FutureTask.java:122)
~[?:1.8.0_181]
  at java.util.concurrent.FutureTask.get(FutureTask.java:206) ~[?:1.8.0_181]
  at org.apache.tika.client.TikaRunner.parseImpl(TikaRunner.java:124)
~[tika-fork-client-2.3.1.jar:?]
  at org.apache.tika.client.TikaRunner.parse(TikaRunner.java:58)
~[tika-fork-client-2.3.1.jar:?]
  at org.apache.tika.client.TikaProcess.parse(TikaProcess.java:185)
~[tika-fork-client-2.3.1.jar:?]
  at org.apache.tika.client.TikaProcessPool.parse(TikaProcessPool.java:145)
~[tika-fork-client-2.3.1.jar:?]
  at
com.lucidworks.apollo.pipeline.parse.impl.tika.TikaForkParser.parse(TikaForkParser.java:236)
~[lucid-parsing-4.2.2.jar:?]
  ... 12 more
Caused by: java.lang.RuntimeException: Failed to read content from forked
Tika parser JVM
  at
org.apache.tika.client.TikaRunner.lambda$parseImpl$2(TikaRunner.java:118)
~[tika-fork-client-2.3.1.jar:?]
  ... 4 more
Caused by: java.util.concurrent.TimeoutException: Timed out waiting 120000
ms for metadata after content was fully parsed.
  at
org.apache.tika.client.TikaRunner.lambda$parseImpl$2(TikaRunner.java:112)
~[tika-fork-client-2.3.1.jar:?]
  ... 4 more

And because each timeout requires the tika forked JVM to be killed and
respawned, this can cause some churning that leads to more timeouts because
of the amount of time it takes to start up a tika JVM.

Does anyone have any experience with the existing tika parser? I would
imagine this file contains my main issue:
https://github.com/nddipiazza/tika-fork/blob/2.3.1/tika-fork-main/src/main/java/org/apache/tika/fork/main/TikaForkMain.java

I'm attempting to use 3 executors independently. And what I'm thinking is I
should be doing this in a different way that isn't so fragile with respect
to timeouts.

Does anyone have some time to code review this and tell me what they might
think is wrong?

-Nicholas DiPiazza

Reply via email to