[jira] [Commented] (TIKA-1776) tika stop converting at this pdf document

Tim Allison (JIRA) Tue, 20 Oct 2015 05:48:08 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14965053#comment-14965053
 ]


Tim Allison commented on TIKA-1776:
-----------------------------------

I'm not able to reproduce this on Windows or RHEL with PDFBox's app or with 
tika-app 1.10 using the same call that you are.

Are you able to reproduce this outside of ruby?

Tika does hang forever sometimes...very rarely, and we need to fix it when it 
does, but anyone calling Tika needs to be aware of this and protect against it.

If you try calling the actual tika-batch code via the app: {{java -jar 
tika-app.jar -i <input_dir> -o <output_dir>}}

That should automatically restart the process if it runs into a hang.

> tika stop converting at this pdf document
> -----------------------------------------
>
>                 Key: TIKA-1776
>                 URL: https://issues.apache.org/jira/browse/TIKA-1776
>             Project: Tika
>          Issue Type: Bug
>          Components: batch
>    Affects Versions: 1.10
>         Environment: Intel Core I5 4GB Ram, Notebook
> OS: debian8, x64, Gnome
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
> ruby 2.2.3p173 (2015-08-18 revision 51636) [x86_64-linux]
>            Reporter: tranquillo
>
> Hi and thank you all for this great project,
> I use https://github.com/offenesdresden/ratsinfo-scraper to download 
> thousands of pdfs and convert it from pdf to xml, that works pretty well and 
> need max 1-2minutes even for big files. But since over 15hours the process 
> hangs with CPU load = 0% at one file: 
> http://ratsinfo.dresden.de/getfile.php?id=149624&type=do 
> wich is just 5mb large, but contains text, scans and CAD plans.
> I run "get_xml()" from follwing class (located in tika_app.rb):
> -----------------------------
> require 'rubygems'
> require 'stringio'
> require 'open4'
> class TikaApp
>     def initialize(document)
>         filename = File.basename(document)
>         t = Time.now
>         puts t.strftime("%H:%M:%S") + ": analyze #{filename}"
>         @document = document
>         java_cmd = 'java'
>         java_args = '-server -Djava.awt.headless=true'
>         tika_path = "tika-app.jar"
>         @tika_cmd = "#{java_cmd} #{java_args} -jar '#{tika_path}'"
>     end
>     def get_xml
>         run_tika('--xml')
>     end
>     def get_metadata
>         run_tika('--metadata --json')
>     end
>     private
>     def run_tika(option)
>         final_cmd = "#{@tika_cmd} #{option} '#{@document}'"
>         pid, stdin, stdout, stderr = Open4::popen4(final_cmd)
>         stdout_result = stdout.read.strip
>         stderr_result = stderr.read.strip
>         unless strip_stderr(stderr_result).empty?
>         end
>         stdout_result
>     ensure
>         stdin.close
>         stdout.close
>         stderr.close
>     end
>     def strip_stderr(s)
>         s.gsub(/^(info|warn) - .*$/i, '').strip
>     end
> end
> ----------
> The tika command with this function looks like this: 
> java -server -Djava.awt.headless=true -jar 'tika-app.jar' --xml 
> '~/data/00149624.pdf'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1776) tika stop converting at this pdf document

Reply via email to