I need some help a project I'm trying to port over to be a part of Tika. I am trying to extend the existing Fork Parser to add a "Fork Parser 2.0" which supports connection pools using commons-pool, and supports an improved ability to "stop parsing after N characters".
Here is the latest code: https://github.com/nddipiazza/tika-fork/tree/2.3.1 When I use this project, it works great on my local environment. When I throw it out in the world, I get intermittent errors related to timeouts: Parse error for input: c:\test\docs\c6f13fe7-40bb-4c64-8cfe-5d748b5c8567.xlsm Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Failed to read content from forked Tika parser JVM at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:1.8.0_181] at java.util.concurrent.FutureTask.get(FutureTask.java:206) ~[?:1.8.0_181] at org.apache.tika.client.TikaRunner.parseImpl(TikaRunner.java:124) ~[tika-fork-client-2.3.1.jar:?] at org.apache.tika.client.TikaRunner.parse(TikaRunner.java:58) ~[tika-fork-client-2.3.1.jar:?] at org.apache.tika.client.TikaProcess.parse(TikaProcess.java:185) ~[tika-fork-client-2.3.1.jar:?] at org.apache.tika.client.TikaProcessPool.parse(TikaProcessPool.java:145) ~[tika-fork-client-2.3.1.jar:?] at com.lucidworks.apollo.pipeline.parse.impl.tika.TikaForkParser.parse(TikaForkParser.java:236) ~[lucid-parsing-4.2.2.jar:?] ... 12 more Caused by: java.lang.RuntimeException: Failed to read content from forked Tika parser JVM at org.apache.tika.client.TikaRunner.lambda$parseImpl$2(TikaRunner.java:118) ~[tika-fork-client-2.3.1.jar:?] ... 4 more Caused by: java.util.concurrent.TimeoutException: Timed out waiting 120000 ms for metadata after content was fully parsed. at org.apache.tika.client.TikaRunner.lambda$parseImpl$2(TikaRunner.java:112) ~[tika-fork-client-2.3.1.jar:?] ... 4 more And because each timeout requires the tika forked JVM to be killed and respawned, this can cause some churning that leads to more timeouts because of the amount of time it takes to start up a tika JVM. Does anyone have any experience with the existing tika parser? I would imagine this file contains my main issue: https://github.com/nddipiazza/tika-fork/blob/2.3.1/tika-fork-main/src/main/java/org/apache/tika/fork/main/TikaForkMain.java I'm attempting to use 3 executors independently. And what I'm thinking is I should be doing this in a different way that isn't so fragile with respect to timeouts. Does anyone have some time to code review this and tell me what they might think is wrong? -Nicholas DiPiazza
