[
https://issues.apache.org/jira/browse/TIKA-2776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701938#comment-16701938
]
Tim Allison commented on TIKA-2776:
-----------------------------------
Thank you for the follow up!
To confirm/summarize...
1. I introduced a change in behavior (bug) into legacy server mode in 1.19
(maybe 1.18?) that causes tika-server to return 'not available' forever after
an OOM. The legacy behavior was to ignore OOMs and _hope_ nothing too bad
happened to your JVM. That said, the change of behavior I introduced is bad,
very bad. I've fixed this in 1.20, which should be out in a few weeks.
2. tika-server in -spawnChild mode was restarting the child because you were
getting timeouts. This caused problems with Manifold. You've bumped out the
timeout to ~16 minutes, and you currently don't have any files that take longer
than that...so all appears to work for now.
3. I _think_ we found that {{-spawnChild}} was behaving as it was designed to
do. To confirm, we did not find that the parent process shutdown, and we did
find that the child restarted within a few seconds. Is this correct?
My opinion/advice:
Depending on the nature of your documents, if you have large enough batches of
crazy enough documents, you will eventually hit an infinite loop, and the child
will timeout and restart. So, for now, you've wallpapered over a problem by
bumping out the timeout, but the timeouts will eventually happen. So, what can
we do in Tika, what can Manifold do, what can you do to help avoid this
eventuality?
Again, many, many thanks for your patience getting the logging up and running.
I still need to improve our wiki on logging with tika-server (based on our
interaction) even more.
> Tika server child restart
> -------------------------
>
> Key: TIKA-2776
> URL: https://issues.apache.org/jira/browse/TIKA-2776
> Project: Tika
> Issue Type: Bug
> Reporter: Mario Bisonti
> Assignee: Tim Allison
> Priority: Blocker
> Fix For: 2.0.0, 1.20
>
> Attachments: Log.zip, MCF_JOB.png, log4j.xml, log4j_child.xml,
> log4j_child.xml, man_tika.zip, tikalogchild.log
>
>
> Hallo.
> I use tika server standalone started with the option:
> java -jar /opt/tika/tika-server-1.19.1.jar -spawnChild
> I use ManifoldCF and Solr to index file using tika server.
> It happens that indexing is continuously crashed because I obtain many:
> Tika down, retrying: Connection reset
> etc.
> I suspect that, when a process is restarted, the client crash as mentioned
> here:
> _If the child process is in the process of shutting down, and it gets a new
> request it will return 503 -- Service Unavailable. If the server times out on
> a file, the client will receive an IOException from the closed socket. Note
> that all other files that are being processed will end with an IOException
> from a closed socket when the child process shuts down; e.g. if you send
> three files to tika-server concurrently, and one of them causes a
> catastrophic problem requiring the child to shut down, you won't be able to
> tell which file caused the problems. In the future, we may implement a
> gentler shutdown than we currently have._
> as reported here https://wiki.apache.org/tika/TikaJAXRS
> How could I workaround it ?
> Thanks a lot
> Mario
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)