Re: 1.20?

2018-11-28 Thread Lewis John McGibbney
+1 would be nice to get the recent ENVI work released as well folks. 

On 2018/11/20 23:04:29, Tim Allison  wrote: 
> All,
>POI 4.0.1 will be out shortly with some important bug fixes.  What would
> you all think of targeting 1st/2nd week of December for 1.20?
> 
>  Cheers,
>  Tim
> 


[jira] [Comment Edited] (TIKA-2776) Tika server child restart

2018-11-28 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699108#comment-16699108
 ] 

Tim Allison edited comment on TIKA-2776 at 11/28/18 2:20 PM:
-

Three cheers for logging, and thank you for your patience in configuring those!

Yes, exactly!  It looks like the child process restarted at 2018-11-26 13:18:26 
{{2018-11-26 13:18:26 INFO  MetadataResource:431 - meta 
(application/vnd.openxmlformats}} and then processed more files successfully.  
It can take few seconds for the server to restart, and it looks in the 
{{manifoldcf.log}} like the initial connectivity dropped at 13:18:25, and then 
there are problems logged through the end of 13:18:26 with worker threads not 
able to reach the server.  This is expected.  Are the clients (worker thread 
88, 39, 8, 86, 87, 982, 99, 75, 12) able to sleep and retry after failed 
connectivity or do they just try once and give up?  

As a side note, if you add a header telling tika-server what the file name is, 
that filename will be included in the log message so you can figure out which 
file caused the timeout.  

See: https://wiki.apache.org/tika/TikaJAXRS ... in short, add the header to 
your request:
{{"Content-Disposition: attachment; filename=foo.csv"}}

Some reasons for timeouts: the vm is overtaxed and processing is just slow, 
infinite loop in a parser (these are rare but they -can- will happen), OCR can 
take minutes per document (do you have tesseract installed)?




was (Author: talli...@mitre.org):
Three cheers for logging, and thank you for your patience in configuring those!

Yes, exactly!  It looks like the child process restarted at 2018-11-26 13:18:26 
{{2018-11-26 13:18:26 INFO  MetadataResource:431 - meta 
(application/vnd.openxmlformats}} and then processed more files successfully.  
It can take few seconds for the server to restart, and it looks in the 
{{manifoldcf.log}} like the initial connectivity dropped at 13:18:25, and then 
there are problems logged through the end of 13:18:26 with worker threads not 
able to reach the server.  This is expected.  Are the clients (worker thread 
88, 39, 8, 86, 87, 982, 99, 75, 12) able to sleep and retry after failed 
connectivity or do they just try once and give up?  

As a side note, if you add a header telling tika-server what the file name is, 
that filename will be included in the log message so you can figure out which 
file caused the timeout.  

See: https://wiki.apache.org/tika/TikaJAXRS ... in short, add the header to 
your request:
{{"Content-Disposition: attachment; filename=foo.csv"}}

Some reasons for timeouts: the vm is overtaxed and processing is just slow, 
infinite loop in a parser (these are rare but they can happen), OCR can take 
minutes per document (do you have tesseract installed)?



> Tika server child restart
> -
>
> Key: TIKA-2776
> URL: https://issues.apache.org/jira/browse/TIKA-2776
> Project: Tika
>  Issue Type: Bug
>Reporter: Mario Bisonti
>Assignee: Tim Allison
>Priority: Blocker
> Fix For: 2.0.0, 1.20
>
> Attachments: Log.zip, MCF_JOB.png, log4j.xml, log4j_child.xml, 
> log4j_child.xml, man_tika.zip, tikalogchild.log
>
>
> Hallo.
> I use tika server standalone started with the option:
> java -jar /opt/tika/tika-server-1.19.1.jar -spawnChild
> I use ManifoldCF and Solr to index file using tika server.
> It happens that indexing is continuously crashed because I obtain many:
> Tika down, retrying: Connection reset
> etc.
> I suspect that, when a process is restarted, the client crash as mentioned 
> here:
> _If the child process is in the process of shutting down, and it gets a new 
> request it will return 503 -- Service Unavailable. If the server times out on 
> a file, the client will receive an IOException from the closed socket. Note 
> that all other files that are being processed will end with an IOException 
> from a closed socket when the child process shuts down; e.g. if you send 
> three files to tika-server concurrently, and one of them causes a 
> catastrophic problem requiring the child to shut down, you won't be able to 
> tell which file caused the problems. In the future, we may implement a 
> gentler shutdown than we currently have._
> as reported here https://wiki.apache.org/tika/TikaJAXRS
> How could I workaround it ?
> Thanks a lot
> Mario



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2776) Tika server child restart

2018-11-28 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701938#comment-16701938
 ] 

Tim Allison commented on TIKA-2776:
---

Thank you for the follow up!

To confirm/summarize...
1. I introduced a change in behavior (bug) into legacy server mode in 1.19 
(maybe 1.18?) that causes tika-server to return 'not available' forever after 
an OOM.  The legacy behavior was to ignore OOMs and _hope_ nothing too bad 
happened to your JVM.  That said, the change of behavior I introduced is bad, 
very bad.  I've fixed this in 1.20, which should be out in a few weeks.
2. tika-server in -spawnChild mode was restarting the child because you were 
getting timeouts.  This caused problems with Manifold.  You've bumped out the 
timeout to ~16 minutes, and you currently don't have any files that take longer 
than that...so all appears to work for now.
3. I _think_ we found that {{-spawnChild}} was behaving as it was designed to 
do.  To confirm, we did not find that the parent process shutdown, and we did 
find that the child restarted within a few seconds.  Is this correct?

My opinion/advice:
Depending on the nature of your documents, if you have large enough batches of 
crazy enough documents, you will eventually hit an infinite loop, and the child 
will timeout and restart.  So, for now, you've wallpapered over a problem by 
bumping out the timeout, but the timeouts will eventually happen.  So, what can 
we do in Tika, what can Manifold do, what can you do to help avoid this 
eventuality?

Again, many, many thanks for your patience getting the logging up and running.  
I still need to improve our wiki on logging with tika-server (based on our 
interaction) even more.  


> Tika server child restart
> -
>
> Key: TIKA-2776
> URL: https://issues.apache.org/jira/browse/TIKA-2776
> Project: Tika
>  Issue Type: Bug
>Reporter: Mario Bisonti
>Assignee: Tim Allison
>Priority: Blocker
> Fix For: 2.0.0, 1.20
>
> Attachments: Log.zip, MCF_JOB.png, log4j.xml, log4j_child.xml, 
> log4j_child.xml, man_tika.zip, tikalogchild.log
>
>
> Hallo.
> I use tika server standalone started with the option:
> java -jar /opt/tika/tika-server-1.19.1.jar -spawnChild
> I use ManifoldCF and Solr to index file using tika server.
> It happens that indexing is continuously crashed because I obtain many:
> Tika down, retrying: Connection reset
> etc.
> I suspect that, when a process is restarted, the client crash as mentioned 
> here:
> _If the child process is in the process of shutting down, and it gets a new 
> request it will return 503 -- Service Unavailable. If the server times out on 
> a file, the client will receive an IOException from the closed socket. Note 
> that all other files that are being processed will end with an IOException 
> from a closed socket when the child process shuts down; e.g. if you send 
> three files to tika-server concurrently, and one of them causes a 
> catastrophic problem requiring the child to shut down, you won't be able to 
> tell which file caused the problems. In the future, we may implement a 
> gentler shutdown than we currently have._
> as reported here https://wiki.apache.org/tika/TikaJAXRS
> How could I workaround it ?
> Thanks a lot
> Mario



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2776) Tika server child restart

2018-11-28 Thread Mario Bisonti (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701864#comment-16701864
 ] 

Mario Bisonti commented on TIKA-2776:
-

Hallo Tim

tesseract is not installed

 

It seems that with the parameter "-spawnChild -taskTimeoutMillis 100" no 
more shutdown the child

 

I will update about the evolution of my issue with the client that index many 
documents

> Tika server child restart
> -
>
> Key: TIKA-2776
> URL: https://issues.apache.org/jira/browse/TIKA-2776
> Project: Tika
>  Issue Type: Bug
>Reporter: Mario Bisonti
>Assignee: Tim Allison
>Priority: Blocker
> Fix For: 2.0.0, 1.20
>
> Attachments: Log.zip, MCF_JOB.png, log4j.xml, log4j_child.xml, 
> log4j_child.xml, man_tika.zip, tikalogchild.log
>
>
> Hallo.
> I use tika server standalone started with the option:
> java -jar /opt/tika/tika-server-1.19.1.jar -spawnChild
> I use ManifoldCF and Solr to index file using tika server.
> It happens that indexing is continuously crashed because I obtain many:
> Tika down, retrying: Connection reset
> etc.
> I suspect that, when a process is restarted, the client crash as mentioned 
> here:
> _If the child process is in the process of shutting down, and it gets a new 
> request it will return 503 -- Service Unavailable. If the server times out on 
> a file, the client will receive an IOException from the closed socket. Note 
> that all other files that are being processed will end with an IOException 
> from a closed socket when the child process shuts down; e.g. if you send 
> three files to tika-server concurrently, and one of them causes a 
> catastrophic problem requiring the child to shut down, you won't be able to 
> tell which file caused the problems. In the future, we may implement a 
> gentler shutdown than we currently have._
> as reported here https://wiki.apache.org/tika/TikaJAXRS
> How could I workaround it ?
> Thanks a lot
> Mario



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)