[jira] [Commented] (TIKA-2776) Tika server child restart

Tim Allison (JIRA) Mon, 26 Nov 2018 07:16:48 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-2776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16699108#comment-16699108
 ]


Tim Allison commented on TIKA-2776:
-----------------------------------

Three cheers for logging, and thank you for your patience in configuring those!

Yes, exactly!  It looks like the child process restarted at 2018-11-26 13:18:26 
{{2018-11-26 13:18:26 INFO  MetadataResource:431 - meta 
(application/vnd.openxmlformats}} and then processed more files successfully.  
It can take few seconds for the server to restart, and it looks in the 
{{manifoldcf.log}} like the initial connectivity dropped at 13:18:25, and then 
there are problems logged through the end of 13:18:26 with worker threads not 
able to reach the server.  This is expected.  Are the clients (worker thread 
88, 39, 8, 86, 87, 982, 99, 75, 12) able to sleep and retry after failed 
connectivity or do they just try once and give up?  

As a side note, if you add a header telling tika-server what the file name is, 
that filename will be included in the log message so you can figure out which 
file caused the timeout.  

See: https://wiki.apache.org/tika/TikaJAXRS ... in short, add the header to 
your request:
{{"Content-Disposition: attachment; filename=foo.csv"}}

Some reasons for timeouts: the vm is overtaxed and processing is just slow, 
infinite loop in a parser (these are rare but they can happen), OCR can take 
minutes per document (do you have tesseract installed)?



> Tika server child restart
> -------------------------
>
>                 Key: TIKA-2776
>                 URL: https://issues.apache.org/jira/browse/TIKA-2776
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Mario Bisonti
>            Assignee: Tim Allison
>            Priority: Blocker
>             Fix For: 2.0.0, 1.20
>
>         Attachments: Log.zip, MCF_JOB.png, log4j.xml, log4j_child.xml, 
> log4j_child.xml, man_tika.zip, tikalogchild.log
>
>
> Hallo.
> I use tika server standalone started with the option:
> java -jar /opt/tika/tika-server-1.19.1.jar -spawnChild
> I use ManifoldCF and Solr to index file using tika server.
> It happens that indexing is continuously crashed because I obtain many:
> Tika down, retrying: Connection reset
> etc.
> I suspect that, when a process is restarted, the client crash as mentioned 
> here:
> _If the child process is in the process of shutting down, and it gets a new 
> request it will return 503 -- Service Unavailable. If the server times out on 
> a file, the client will receive an IOException from the closed socket. Note 
> that all other files that are being processed will end with an IOException 
> from a closed socket when the child process shuts down; e.g. if you send 
> three files to tika-server concurrently, and one of them causes a 
> catastrophic problem requiring the child to shut down, you won't be able to 
> tell which file caused the problems. In the future, we may implement a 
> gentler shutdown than we currently have._
> as reported here https://wiki.apache.org/tika/TikaJAXRS
> How could I workaround it ?
> Thanks a lot
> Mario



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2776) Tika server child restart

Reply via email to