[jira] [Updated] (TIKA-2776) Tika server child restart

2018-11-22 Thread Mario Bisonti (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mario Bisonti updated TIKA-2776:

Attachment: Log.zip

> Tika server child restart
> -
>
> Key: TIKA-2776
> URL: https://issues.apache.org/jira/browse/TIKA-2776
> Project: Tika
>  Issue Type: Bug
>Reporter: Mario Bisonti
>Assignee: Tim Allison
>Priority: Blocker
> Fix For: 2.0.0, 1.20
>
> Attachments: Log.zip, log4j.xml, log4j_child.xml, log4j_child.xml, 
> tikalogchild.log
>
>
> Hallo.
> I use tika server standalone started with the option:
> java -jar /opt/tika/tika-server-1.19.1.jar -spawnChild
> I use ManifoldCF and Solr to index file using tika server.
> It happens that indexing is continuously crashed because I obtain many:
> Tika down, retrying: Connection reset
> etc.
> I suspect that, when a process is restarted, the client crash as mentioned 
> here:
> _If the child process is in the process of shutting down, and it gets a new 
> request it will return 503 -- Service Unavailable. If the server times out on 
> a file, the client will receive an IOException from the closed socket. Note 
> that all other files that are being processed will end with an IOException 
> from a closed socket when the child process shuts down; e.g. if you send 
> three files to tika-server concurrently, and one of them causes a 
> catastrophic problem requiring the child to shut down, you won't be able to 
> tell which file caused the problems. In the future, we may implement a 
> gentler shutdown than we currently have._
> as reported here https://wiki.apache.org/tika/TikaJAXRS
> How could I workaround it ?
> Thanks a lot
> Mario



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2776) Tika server child restart

2018-11-22 Thread Mario Bisonti (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695874#comment-16695874
 ] 

Mario Bisonti commented on TIKA-2776:
-

Hallo Tim, now I am able to generate the log, finally.

Today, I started a processing from my client to parse with tika.

It started to process at 8:30 a.m. and at 13:34 as you see in the MCF_Client.log

I see that Tika created log tikalogchild.log1and wrote on it in at the 12:57, 
after a new log tikalogchild.log at 13:34 I suppose, when the child is 
restarted.

So, I suppose that the client crashed because this restart?

 

I attatch in the Log.zip the three files.

 

Could you help me to understand, how to solve this issue?

 

I am using tika-server-1.20-20181114.215706-48.jar

 

Thanks a lot.

Mario[^Log.zip]

> Tika server child restart
> -
>
> Key: TIKA-2776
> URL: https://issues.apache.org/jira/browse/TIKA-2776
> Project: Tika
>  Issue Type: Bug
>Reporter: Mario Bisonti
>Assignee: Tim Allison
>Priority: Blocker
> Fix For: 2.0.0, 1.20
>
> Attachments: Log.zip, log4j.xml, log4j_child.xml, log4j_child.xml, 
> tikalogchild.log
>
>
> Hallo.
> I use tika server standalone started with the option:
> java -jar /opt/tika/tika-server-1.19.1.jar -spawnChild
> I use ManifoldCF and Solr to index file using tika server.
> It happens that indexing is continuously crashed because I obtain many:
> Tika down, retrying: Connection reset
> etc.
> I suspect that, when a process is restarted, the client crash as mentioned 
> here:
> _If the child process is in the process of shutting down, and it gets a new 
> request it will return 503 -- Service Unavailable. If the server times out on 
> a file, the client will receive an IOException from the closed socket. Note 
> that all other files that are being processed will end with an IOException 
> from a closed socket when the child process shuts down; e.g. if you send 
> three files to tika-server concurrently, and one of them causes a 
> catastrophic problem requiring the child to shut down, you won't be able to 
> tell which file caused the problems. In the future, we may implement a 
> gentler shutdown than we currently have._
> as reported here https://wiki.apache.org/tika/TikaJAXRS
> How could I workaround it ?
> Thanks a lot
> Mario



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2776) Tika server child restart

2018-11-22 Thread Mario Bisonti (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695874#comment-16695874
 ] 

Mario Bisonti edited comment on TIKA-2776 at 11/22/18 12:52 PM:


Hallo Tim, now I am able to generate the log, finally.

Today, I started a processing from my client to parse with tika.

It started to process at 8:30 a.m. and at 13:34 as you see in the MCF_Client.log

I see that Tika created log tikalogchild.log1and wrote on it in at the 12:57, 
after a new log tikalogchild.log at 13:34 I suppose, when the child is 
restarted.

So, I suppose that the client crashed because this restart?

 

I attatch in the Log.zip the three files.

 

Could you help me to understand, how to solve this issue?

 

I am using tika-server-1.20-20181114.215706-48.jar

 

Thanks a lot.

Mario  

 

[^Log.zip]


was (Author: bisontim):
Hallo Tim, now I am able to generate the log, finally.

Today, I started a processing from my client to parse with tika.

It started to process at 8:30 a.m. and at 13:34 as you see in the MCF_Client.log

I see that Tika created log tikalogchild.log1and wrote on it in at the 12:57, 
after a new log tikalogchild.log at 13:34 I suppose, when the child is 
restarted.

So, I suppose that the client crashed because this restart?

 

I attatch in the Log.zip the three files.

 

Could you help me to understand, how to solve this issue?

 

I am using tika-server-1.20-20181114.215706-48.jar

 

Thanks a lot.

Mario[^Log.zip]

> Tika server child restart
> -
>
> Key: TIKA-2776
> URL: https://issues.apache.org/jira/browse/TIKA-2776
> Project: Tika
>  Issue Type: Bug
>Reporter: Mario Bisonti
>Assignee: Tim Allison
>Priority: Blocker
> Fix For: 2.0.0, 1.20
>
> Attachments: Log.zip, log4j.xml, log4j_child.xml, log4j_child.xml, 
> tikalogchild.log
>
>
> Hallo.
> I use tika server standalone started with the option:
> java -jar /opt/tika/tika-server-1.19.1.jar -spawnChild
> I use ManifoldCF and Solr to index file using tika server.
> It happens that indexing is continuously crashed because I obtain many:
> Tika down, retrying: Connection reset
> etc.
> I suspect that, when a process is restarted, the client crash as mentioned 
> here:
> _If the child process is in the process of shutting down, and it gets a new 
> request it will return 503 -- Service Unavailable. If the server times out on 
> a file, the client will receive an IOException from the closed socket. Note 
> that all other files that are being processed will end with an IOException 
> from a closed socket when the child process shuts down; e.g. if you send 
> three files to tika-server concurrently, and one of them causes a 
> catastrophic problem requiring the child to shut down, you won't be able to 
> tell which file caused the problems. In the future, we may implement a 
> gentler shutdown than we currently have._
> as reported here https://wiki.apache.org/tika/TikaJAXRS
> How could I workaround it ?
> Thanks a lot
> Mario



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2749) OCR on PDFs should "just work" out of the box

2018-11-22 Thread Luis Filipe Nassif (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696034#comment-16696034
 ] 

Luis Filipe Nassif commented on TIKA-2749:
--

I don't do that. I thought you questioned if doing that would improve contrast.

 

Thanks for pointing c7atess.

> OCR on PDFs should "just work" out of the box
> -
>
> Key: TIKA-2749
> URL: https://issues.apache.org/jira/browse/TIKA-2749
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> There are now two different ways (with various parameters) to trigger OCR on 
> inline images within PDFs.  The user has to 1) understand that these are 
> available and then 2) elect to turn one of those on.
> I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid 
> strategy between the 2 options.  Users should still be allowed to configure 
> as they wish, of course. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2749) OCR on PDFs should "just work" out of the box

2018-11-22 Thread Rick Leir (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695964#comment-16695964
 ] 

Rick Leir commented on TIKA-2749:
-

Luis, Tesseract accepts TIFF and JPEG, so why convert it to a PDF? 

Yes, the contrast needs to be adjusted in many cases . This can be automated, 
as shown in the c7atess project. Cheers -- Rick


-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com


> OCR on PDFs should "just work" out of the box
> -
>
> Key: TIKA-2749
> URL: https://issues.apache.org/jira/browse/TIKA-2749
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> There are now two different ways (with various parameters) to trigger OCR on 
> inline images within PDFs.  The user has to 1) understand that these are 
> available and then 2) elect to turn one of those on.
> I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid 
> strategy between the 2 options.  Users should still be allowed to configure 
> as they wish, of course. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)