[jira] [Updated] (TIKA-2776) Tika server child restart
[ https://issues.apache.org/jira/browse/TIKA-2776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mario Bisonti updated TIKA-2776: Attachment: Log.zip > Tika server child restart > - > > Key: TIKA-2776 > URL: https://issues.apache.org/jira/browse/TIKA-2776 > Project: Tika > Issue Type: Bug >Reporter: Mario Bisonti >Assignee: Tim Allison >Priority: Blocker > Fix For: 2.0.0, 1.20 > > Attachments: Log.zip, log4j.xml, log4j_child.xml, log4j_child.xml, > tikalogchild.log > > > Hallo. > I use tika server standalone started with the option: > java -jar /opt/tika/tika-server-1.19.1.jar -spawnChild > I use ManifoldCF and Solr to index file using tika server. > It happens that indexing is continuously crashed because I obtain many: > Tika down, retrying: Connection reset > etc. > I suspect that, when a process is restarted, the client crash as mentioned > here: > _If the child process is in the process of shutting down, and it gets a new > request it will return 503 -- Service Unavailable. If the server times out on > a file, the client will receive an IOException from the closed socket. Note > that all other files that are being processed will end with an IOException > from a closed socket when the child process shuts down; e.g. if you send > three files to tika-server concurrently, and one of them causes a > catastrophic problem requiring the child to shut down, you won't be able to > tell which file caused the problems. In the future, we may implement a > gentler shutdown than we currently have._ > as reported here https://wiki.apache.org/tika/TikaJAXRS > How could I workaround it ? > Thanks a lot > Mario -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2776) Tika server child restart
[ https://issues.apache.org/jira/browse/TIKA-2776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695874#comment-16695874 ] Mario Bisonti commented on TIKA-2776: - Hallo Tim, now I am able to generate the log, finally. Today, I started a processing from my client to parse with tika. It started to process at 8:30 a.m. and at 13:34 as you see in the MCF_Client.log I see that Tika created log tikalogchild.log1and wrote on it in at the 12:57, after a new log tikalogchild.log at 13:34 I suppose, when the child is restarted. So, I suppose that the client crashed because this restart? I attatch in the Log.zip the three files. Could you help me to understand, how to solve this issue? I am using tika-server-1.20-20181114.215706-48.jar Thanks a lot. Mario[^Log.zip] > Tika server child restart > - > > Key: TIKA-2776 > URL: https://issues.apache.org/jira/browse/TIKA-2776 > Project: Tika > Issue Type: Bug >Reporter: Mario Bisonti >Assignee: Tim Allison >Priority: Blocker > Fix For: 2.0.0, 1.20 > > Attachments: Log.zip, log4j.xml, log4j_child.xml, log4j_child.xml, > tikalogchild.log > > > Hallo. > I use tika server standalone started with the option: > java -jar /opt/tika/tika-server-1.19.1.jar -spawnChild > I use ManifoldCF and Solr to index file using tika server. > It happens that indexing is continuously crashed because I obtain many: > Tika down, retrying: Connection reset > etc. > I suspect that, when a process is restarted, the client crash as mentioned > here: > _If the child process is in the process of shutting down, and it gets a new > request it will return 503 -- Service Unavailable. If the server times out on > a file, the client will receive an IOException from the closed socket. Note > that all other files that are being processed will end with an IOException > from a closed socket when the child process shuts down; e.g. if you send > three files to tika-server concurrently, and one of them causes a > catastrophic problem requiring the child to shut down, you won't be able to > tell which file caused the problems. In the future, we may implement a > gentler shutdown than we currently have._ > as reported here https://wiki.apache.org/tika/TikaJAXRS > How could I workaround it ? > Thanks a lot > Mario -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-2776) Tika server child restart
[ https://issues.apache.org/jira/browse/TIKA-2776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695874#comment-16695874 ] Mario Bisonti edited comment on TIKA-2776 at 11/22/18 12:52 PM: Hallo Tim, now I am able to generate the log, finally. Today, I started a processing from my client to parse with tika. It started to process at 8:30 a.m. and at 13:34 as you see in the MCF_Client.log I see that Tika created log tikalogchild.log1and wrote on it in at the 12:57, after a new log tikalogchild.log at 13:34 I suppose, when the child is restarted. So, I suppose that the client crashed because this restart? I attatch in the Log.zip the three files. Could you help me to understand, how to solve this issue? I am using tika-server-1.20-20181114.215706-48.jar Thanks a lot. Mario [^Log.zip] was (Author: bisontim): Hallo Tim, now I am able to generate the log, finally. Today, I started a processing from my client to parse with tika. It started to process at 8:30 a.m. and at 13:34 as you see in the MCF_Client.log I see that Tika created log tikalogchild.log1and wrote on it in at the 12:57, after a new log tikalogchild.log at 13:34 I suppose, when the child is restarted. So, I suppose that the client crashed because this restart? I attatch in the Log.zip the three files. Could you help me to understand, how to solve this issue? I am using tika-server-1.20-20181114.215706-48.jar Thanks a lot. Mario[^Log.zip] > Tika server child restart > - > > Key: TIKA-2776 > URL: https://issues.apache.org/jira/browse/TIKA-2776 > Project: Tika > Issue Type: Bug >Reporter: Mario Bisonti >Assignee: Tim Allison >Priority: Blocker > Fix For: 2.0.0, 1.20 > > Attachments: Log.zip, log4j.xml, log4j_child.xml, log4j_child.xml, > tikalogchild.log > > > Hallo. > I use tika server standalone started with the option: > java -jar /opt/tika/tika-server-1.19.1.jar -spawnChild > I use ManifoldCF and Solr to index file using tika server. > It happens that indexing is continuously crashed because I obtain many: > Tika down, retrying: Connection reset > etc. > I suspect that, when a process is restarted, the client crash as mentioned > here: > _If the child process is in the process of shutting down, and it gets a new > request it will return 503 -- Service Unavailable. If the server times out on > a file, the client will receive an IOException from the closed socket. Note > that all other files that are being processed will end with an IOException > from a closed socket when the child process shuts down; e.g. if you send > three files to tika-server concurrently, and one of them causes a > catastrophic problem requiring the child to shut down, you won't be able to > tell which file caused the problems. In the future, we may implement a > gentler shutdown than we currently have._ > as reported here https://wiki.apache.org/tika/TikaJAXRS > How could I workaround it ? > Thanks a lot > Mario -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2749) OCR on PDFs should "just work" out of the box
[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696034#comment-16696034 ] Luis Filipe Nassif commented on TIKA-2749: -- I don't do that. I thought you questioned if doing that would improve contrast. Thanks for pointing c7atess. > OCR on PDFs should "just work" out of the box > - > > Key: TIKA-2749 > URL: https://issues.apache.org/jira/browse/TIKA-2749 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > There are now two different ways (with various parameters) to trigger OCR on > inline images within PDFs. The user has to 1) understand that these are > available and then 2) elect to turn one of those on. > I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid > strategy between the 2 options. Users should still be allowed to configure > as they wish, of course. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2749) OCR on PDFs should "just work" out of the box
[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695964#comment-16695964 ] Rick Leir commented on TIKA-2749: - Luis, Tesseract accepts TIFF and JPEG, so why convert it to a PDF? Yes, the contrast needs to be adjusted in many cases . This can be automated, as shown in the c7atess project. Cheers -- Rick -- Sorry for being brief. Alternate email is rickleir at yahoo dot com > OCR on PDFs should "just work" out of the box > - > > Key: TIKA-2749 > URL: https://issues.apache.org/jira/browse/TIKA-2749 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > There are now two different ways (with various parameters) to trigger OCR on > inline images within PDFs. The user has to 1) understand that these are > available and then 2) elect to turn one of those on. > I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid > strategy between the 2 options. Users should still be allowed to configure > as they wish, of course. -- This message was sent by Atlassian JIRA (v7.6.3#76005)