Re: 1.20?
+1 would be nice to get the recent ENVI work released as well folks. On 2018/11/20 23:04:29, Tim Allison wrote: > All, >POI 4.0.1 will be out shortly with some important bug fixes. What would > you all think of targeting 1st/2nd week of December for 1.20? > > Cheers, > Tim >
[jira] [Comment Edited] (TIKA-2776) Tika server child restart
[ https://issues.apache.org/jira/browse/TIKA-2776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699108#comment-16699108 ] Tim Allison edited comment on TIKA-2776 at 11/28/18 2:20 PM: - Three cheers for logging, and thank you for your patience in configuring those! Yes, exactly! It looks like the child process restarted at 2018-11-26 13:18:26 {{2018-11-26 13:18:26 INFO MetadataResource:431 - meta (application/vnd.openxmlformats}} and then processed more files successfully. It can take few seconds for the server to restart, and it looks in the {{manifoldcf.log}} like the initial connectivity dropped at 13:18:25, and then there are problems logged through the end of 13:18:26 with worker threads not able to reach the server. This is expected. Are the clients (worker thread 88, 39, 8, 86, 87, 982, 99, 75, 12) able to sleep and retry after failed connectivity or do they just try once and give up? As a side note, if you add a header telling tika-server what the file name is, that filename will be included in the log message so you can figure out which file caused the timeout. See: https://wiki.apache.org/tika/TikaJAXRS ... in short, add the header to your request: {{"Content-Disposition: attachment; filename=foo.csv"}} Some reasons for timeouts: the vm is overtaxed and processing is just slow, infinite loop in a parser (these are rare but they -can- will happen), OCR can take minutes per document (do you have tesseract installed)? was (Author: talli...@mitre.org): Three cheers for logging, and thank you for your patience in configuring those! Yes, exactly! It looks like the child process restarted at 2018-11-26 13:18:26 {{2018-11-26 13:18:26 INFO MetadataResource:431 - meta (application/vnd.openxmlformats}} and then processed more files successfully. It can take few seconds for the server to restart, and it looks in the {{manifoldcf.log}} like the initial connectivity dropped at 13:18:25, and then there are problems logged through the end of 13:18:26 with worker threads not able to reach the server. This is expected. Are the clients (worker thread 88, 39, 8, 86, 87, 982, 99, 75, 12) able to sleep and retry after failed connectivity or do they just try once and give up? As a side note, if you add a header telling tika-server what the file name is, that filename will be included in the log message so you can figure out which file caused the timeout. See: https://wiki.apache.org/tika/TikaJAXRS ... in short, add the header to your request: {{"Content-Disposition: attachment; filename=foo.csv"}} Some reasons for timeouts: the vm is overtaxed and processing is just slow, infinite loop in a parser (these are rare but they can happen), OCR can take minutes per document (do you have tesseract installed)? > Tika server child restart > - > > Key: TIKA-2776 > URL: https://issues.apache.org/jira/browse/TIKA-2776 > Project: Tika > Issue Type: Bug >Reporter: Mario Bisonti >Assignee: Tim Allison >Priority: Blocker > Fix For: 2.0.0, 1.20 > > Attachments: Log.zip, MCF_JOB.png, log4j.xml, log4j_child.xml, > log4j_child.xml, man_tika.zip, tikalogchild.log > > > Hallo. > I use tika server standalone started with the option: > java -jar /opt/tika/tika-server-1.19.1.jar -spawnChild > I use ManifoldCF and Solr to index file using tika server. > It happens that indexing is continuously crashed because I obtain many: > Tika down, retrying: Connection reset > etc. > I suspect that, when a process is restarted, the client crash as mentioned > here: > _If the child process is in the process of shutting down, and it gets a new > request it will return 503 -- Service Unavailable. If the server times out on > a file, the client will receive an IOException from the closed socket. Note > that all other files that are being processed will end with an IOException > from a closed socket when the child process shuts down; e.g. if you send > three files to tika-server concurrently, and one of them causes a > catastrophic problem requiring the child to shut down, you won't be able to > tell which file caused the problems. In the future, we may implement a > gentler shutdown than we currently have._ > as reported here https://wiki.apache.org/tika/TikaJAXRS > How could I workaround it ? > Thanks a lot > Mario -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2776) Tika server child restart
[ https://issues.apache.org/jira/browse/TIKA-2776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701938#comment-16701938 ] Tim Allison commented on TIKA-2776: --- Thank you for the follow up! To confirm/summarize... 1. I introduced a change in behavior (bug) into legacy server mode in 1.19 (maybe 1.18?) that causes tika-server to return 'not available' forever after an OOM. The legacy behavior was to ignore OOMs and _hope_ nothing too bad happened to your JVM. That said, the change of behavior I introduced is bad, very bad. I've fixed this in 1.20, which should be out in a few weeks. 2. tika-server in -spawnChild mode was restarting the child because you were getting timeouts. This caused problems with Manifold. You've bumped out the timeout to ~16 minutes, and you currently don't have any files that take longer than that...so all appears to work for now. 3. I _think_ we found that {{-spawnChild}} was behaving as it was designed to do. To confirm, we did not find that the parent process shutdown, and we did find that the child restarted within a few seconds. Is this correct? My opinion/advice: Depending on the nature of your documents, if you have large enough batches of crazy enough documents, you will eventually hit an infinite loop, and the child will timeout and restart. So, for now, you've wallpapered over a problem by bumping out the timeout, but the timeouts will eventually happen. So, what can we do in Tika, what can Manifold do, what can you do to help avoid this eventuality? Again, many, many thanks for your patience getting the logging up and running. I still need to improve our wiki on logging with tika-server (based on our interaction) even more. > Tika server child restart > - > > Key: TIKA-2776 > URL: https://issues.apache.org/jira/browse/TIKA-2776 > Project: Tika > Issue Type: Bug >Reporter: Mario Bisonti >Assignee: Tim Allison >Priority: Blocker > Fix For: 2.0.0, 1.20 > > Attachments: Log.zip, MCF_JOB.png, log4j.xml, log4j_child.xml, > log4j_child.xml, man_tika.zip, tikalogchild.log > > > Hallo. > I use tika server standalone started with the option: > java -jar /opt/tika/tika-server-1.19.1.jar -spawnChild > I use ManifoldCF and Solr to index file using tika server. > It happens that indexing is continuously crashed because I obtain many: > Tika down, retrying: Connection reset > etc. > I suspect that, when a process is restarted, the client crash as mentioned > here: > _If the child process is in the process of shutting down, and it gets a new > request it will return 503 -- Service Unavailable. If the server times out on > a file, the client will receive an IOException from the closed socket. Note > that all other files that are being processed will end with an IOException > from a closed socket when the child process shuts down; e.g. if you send > three files to tika-server concurrently, and one of them causes a > catastrophic problem requiring the child to shut down, you won't be able to > tell which file caused the problems. In the future, we may implement a > gentler shutdown than we currently have._ > as reported here https://wiki.apache.org/tika/TikaJAXRS > How could I workaround it ? > Thanks a lot > Mario -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2776) Tika server child restart
[ https://issues.apache.org/jira/browse/TIKA-2776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701864#comment-16701864 ] Mario Bisonti commented on TIKA-2776: - Hallo Tim tesseract is not installed It seems that with the parameter "-spawnChild -taskTimeoutMillis 100" no more shutdown the child I will update about the evolution of my issue with the client that index many documents > Tika server child restart > - > > Key: TIKA-2776 > URL: https://issues.apache.org/jira/browse/TIKA-2776 > Project: Tika > Issue Type: Bug >Reporter: Mario Bisonti >Assignee: Tim Allison >Priority: Blocker > Fix For: 2.0.0, 1.20 > > Attachments: Log.zip, MCF_JOB.png, log4j.xml, log4j_child.xml, > log4j_child.xml, man_tika.zip, tikalogchild.log > > > Hallo. > I use tika server standalone started with the option: > java -jar /opt/tika/tika-server-1.19.1.jar -spawnChild > I use ManifoldCF and Solr to index file using tika server. > It happens that indexing is continuously crashed because I obtain many: > Tika down, retrying: Connection reset > etc. > I suspect that, when a process is restarted, the client crash as mentioned > here: > _If the child process is in the process of shutting down, and it gets a new > request it will return 503 -- Service Unavailable. If the server times out on > a file, the client will receive an IOException from the closed socket. Note > that all other files that are being processed will end with an IOException > from a closed socket when the child process shuts down; e.g. if you send > three files to tika-server concurrently, and one of them causes a > catastrophic problem requiring the child to shut down, you won't be able to > tell which file caused the problems. In the future, we may implement a > gentler shutdown than we currently have._ > as reported here https://wiki.apache.org/tika/TikaJAXRS > How could I workaround it ? > Thanks a lot > Mario -- This message was sent by Atlassian JIRA (v7.6.3#76005)