Re: OCR with tika-server

Mattmann, Chris A (3980) Fri, 03 Oct 2014 10:52:49 -0700

Hi Kevin just checking back - did you get it working?

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++







-----Original Message-----
From: <Mattmann>, Chris Mattmann <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Wednesday, October 1, 2014 at 2:13 PM
To: "[email protected]" <[email protected]>
Subject: Re: OCR with tika-server

>What type of image is it, Kevin?
>
>If it’s a TIFF, you need to install tesseract with special lib tiff
>parameters. See:
>
>https://gist.github.com/henrik/1967035
>
>
>Can you parse the image file with tesseract by itself, without
>Tika’s tmp image?
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: [email protected]
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: <Ramirez>, "Paul M   (398J)" <[email protected]>
>Reply-To: "[email protected]" <[email protected]>
>Date: Wednesday, October 1, 2014 at 1:47 PM
>To: "<[email protected]>" <[email protected]>
>Subject: Re: OCR with tika-server
>
>>Nothing to be embarrassed about at all Kevin. I actually thought maybe it
>>was just a typo issue and I randomly happen to catch that. I've
>>definitely done that one before myself.
>>
>>Bummed that was not the problem.
>>
>>--Paul
>>
>>On Oct 1, 2014, at 1:00 PM, kevin slote <[email protected]>
>> wrote:
>>
>>> What I wrote there did have a typo in it. (It's not every day you get
>>>to
>>> embarrass yourself in front of a bunch of guys from NASA)
>>> 
>>> But that was not what I had in my terminal when I tested it.
>>> 
>>> 
>>> 
>>> The actual PATH was:
>>> 
>>> 
>>> 
>>> 
>>> 
>>>"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/
>>>g
>>>ames:/usr/bin/tesseract"
>>> 
>>> 
>>> 
>>> I think what was actually wrong with the path is that I added the
>>>entire
>>> path to the tesseract executable, which was in my /usr/bin/ directory,
>>> instead of just the directory where tesseract lives.  Is this true?
>>> 
>>> 
>>> 
>>> I deleted the hard coding from the TesseractOCRConfig.jave and then
>>>printed
>>> config.getTesseractPath() to stdout.  This field was empty.
>>> 
>>> However, I have tesseract installed system wide on this ubuntu vm.
>>> 
>>> So the canRun method evaluated as true whether or not the tesseractPath
>>>was
>>> configured correctly.
>>> 
>>> 
>>> 
>>> I have been slowly trying to debug this all day.  It looks like tika is
>>> making a tmp file with the .tmp preffix.
>>> 
>>> I commented out some of the code to so that they remained in /tmp/.
>>> 
>>> 
>>> 
>>> It looks like tesseract doesn't like that.
>>> 
>>> I tried to ocr these .tmp files to see if I could isolate what was
>>>going
>>> wrong for me.
>>> 
>>> 
>>> 
>>> kslote@ubuntu:~/tika/tika$ tesseract
>>> /tmp/apache-tika-7112319184053570698.tmp out
>>> 
>>> Tesseract Open Source OCR Engine
>>> 
>>> name_to_image_type:Error:Unrecognized image
>>> type:/tmp/apache-tika-7112319184053570698.tmp
>>> 
>>> IMAGE::read_header:Error:Can't read this image
>>> type:/tmp/apache-tika-7112319184053570698.tmp
>>> 
>>> tesseract:Error:Read of file
>>>failed:/tmp/apache-tika-7112319184053570698.tmp
>>> 
>>> Segmentation fault
>>> 
>>> 
>>> 
>>> On the wiki it mentions something about getting tesseract to work with
>>> .tiff files.  For whatever reason, the tesseract I have installed only
>>> works for .tiff files.  Would it be recommend that I re install
>>>tesseract
>>> from the source?
>>> 
>>> On Tue, Sep 30, 2014 at 7:28 PM, Ramirez, Paul M (398J) <
>>> [email protected]> wrote:
>>> 
>>>> Is that a typo in your path to tesseract?
>>>> 
>>>> /urs/bin/tesseract => /usr/bin/tesseract
>>>> 
>>>> --Paul
>>>> 
>>>>> On Sep 30, 2014, at 1:48 PM, "kevin slote" <[email protected]> wrote:
>>>>> 
>>>>> Unfortunately, that did not do it either.
>>>>> 
>>>>> I did:
>>>>> 
>>>>>  $export
>>>>> 
>>>> 
>>>>PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/
>>>>g
>>>>ames:/urs/bin/tesseract
>>>>> 
>>>>> Here is the output from printenv
>>>>> 
>>>>> kslote@ubuntu:~/tika/tika$ printenv
>>>>> SHELL=/bin/bash
>>>>> USERNAME=kslote
>>>>> XDG_CONFIG_DIRS=/etc/xdg/xdg-gnome:/etc/xdg
>>>>> DESKTOP_SESSION=gnome
>>>>> 
>>>> 
>>>>PATH=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:
>>>>/
>>>>usr/games:/urs/bin/tesseract
>>>>> PWD=/home/kslote/tika/tika
>>>>> HOME=/home/kslote
>>>>> LOGNAME=kslote
>>>>> _=/usr/bin/printenv
>>>>> 
>>>>> 
>>>>> On Tue, Sep 30, 2014 at 4:13 PM, Tyler Palsulich
>>>>><[email protected]>
>>>>> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> Hmm. Could you try adding tesseract to your PATH? How did you
>>>>>>install
>>>>>> Tesseract? You should be able to do a straightforward `sudo apt-get
>>>> install
>>>>>> tesseract-ocr`. After that, the OCR tests should pass. We're still
>>>> running
>>>>>> into TIKA-1422, where a mail test fails. But, you can run just the
>>>>>>OCR
>>>>>> tests with `mvn test
>>>>>>-Dtest=org.apache.tika.parser.ocr.TesseractOCRTest
>>>>>> -DfailIfNoTests=false`.
>>>>>> 
>>>>>> Let me know if that works for you!
>>>>>> Tyler
>>>>>> 
>>>>>>> On Tue, Sep 30, 2014 at 4:00 PM, kevin slote <[email protected]>
>>>> wrote:
>>>>>>> 
>>>>>>> I am working on ubuntu 10.4. and I am having some trouble.
>>>>>>> Tesseract is installed correctly, but just doing a clone from the
>>>>>>>repo
>>>>>> and
>>>>>>> installing with maven, I am getting some errors.
>>>>>>> 
>>>>>>> This is before I did anything with tesseract installed.
>>>>>>> 
>>>>>>> Failed tests:
>>>> testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest):
>>>>>>> Check for the image's text.
>>>>>>> testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
>>>>>>> testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
>>>>>>> 
>>>>>>> Next I hard coded the tesseractPath:
>>>>>>> 
>>>>>>> I went into the TesseractOCRConfig.java and hard coded
>>>>>>>'tesseractPath.'
>>>>>>> The all tests passed and it built successfully, but then I went to
>>>>>>>post
>>>>>>> some tiff's to the server.
>>>>>>> That didn't work. So I tried adding some System.out.println("hello
>>>>>> world")
>>>>>>> (a little crude I know) inside the unit tests to confirm that
>>>>>>>tesseract
>>>>>>> was working correctly.  It looks like something happens in the unit
>>>> test
>>>>>> in
>>>>>>> TesseractOCRTest.java
>>>>>>> on the line that says TesseractOCRConfig config = new
>>>>>>> TesseractOCRConfig();. Printing to stdout before works, but I get
>>>> nothing
>>>>>>> after. That happens before the assumeTrue(canRun(config));. So an
>>>>>> exception
>>>>>>> is not get raised.
>>>>>>> 
>>>>>>> Then once everything is built, ocr does not work.  That was why I
>>>>>> figured I
>>>>>>> would ask to see if I missed some sort of configuration step in
>>>> building
>>>>>>> it.
>>>>>>> 
>>>>>>> Thanks a ton.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Sep 30, 2014 at 2:57 PM, Mattmann, Chris A (3980) <
>>>>>>> [email protected]> wrote:
>>>>>>> 
>>>>>>>> Dear Kevin,
>>>>>>>> 
>>>>>>>> Sure, it already works :) 1.7-SNAPSHOT.
>>>>>>>> 
>>>>>>>> See this wiki page:
>>>>>>>> 
>>>>>>>> https://wiki.apache.org/tika/TikaOCR
>>>>>>>> 
>>>>>>>> I¹d be happy to discuss more.
>>>>>>>> 
>>>>>>>> Thanks!
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Chris
>>>>>>>> 
>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>> Chris Mattmann, Ph.D.
>>>>>>>> Chief Architect
>>>>>>>> Instrument Software and Science Data Systems Section (398)
>>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>>>> Office: 168-519, Mailstop: 168-527
>>>>>>>> Email: [email protected]
>>>>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>> Adjunct Associate Professor, Computer Science Department
>>>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: kevin slote <[email protected]>
>>>>>>>> Reply-To: "[email protected]" <[email protected]>
>>>>>>>> Date: Tuesday, September 30, 2014 at 8:52 AM
>>>>>>>> To: "[email protected]" <[email protected]>
>>>>>>>> Subject: OCR with tika-server
>>>>>>>> 
>>>>>>>>> Hello all,
>>>>>>>>> 
>>>>>>>>> I have been testing out the integration of tika with tesseract.
>>>>>>>>> I was wondering if there is  a way to get tika-server to run with
>>>>>>>>> tesseract's OCR capabilities?
>>>>>>>>> 
>>>>>>>>> Best
>>>>>>>>> 
>>>>>>>>> Kevin Slote
>>>>>> 
>>>> 
>>
>

Re: OCR with tika-server

Reply via email to