Re: OCR with tika-server

Ramirez, Paul M (398J) Wed, 01 Oct 2014 13:48:26 -0700

Nothing to be embarrassed about at all Kevin. I actually thought maybe it was 
just a typo issue and I randomly happen to catch that. I've definitely done 
that one before myself.


Bummed that was not the problem. 

--Paul

On Oct 1, 2014, at 1:00 PM, kevin slote <[email protected]>
 wrote:

> What I wrote there did have a typo in it. (It's not every day you get to
> embarrass yourself in front of a bunch of guys from NASA)
> 
> But that was not what I had in my terminal when I tested it.
> 
> 
> 
> The actual PATH was:
> 
> 
> 
> 
> "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/bin/tesseract"
> 
> 
> 
> I think what was actually wrong with the path is that I added the entire
> path to the tesseract executable, which was in my /usr/bin/ directory,
> instead of just the directory where tesseract lives.  Is this true?
> 
> 
> 
> I deleted the hard coding from the TesseractOCRConfig.jave and then printed
> config.getTesseractPath() to stdout.  This field was empty.
> 
> However, I have tesseract installed system wide on this ubuntu vm.
> 
> So the canRun method evaluated as true whether or not the tesseractPath was
> configured correctly.
> 
> 
> 
> I have been slowly trying to debug this all day.  It looks like tika is
> making a tmp file with the .tmp preffix.
> 
> I commented out some of the code to so that they remained in /tmp/.
> 
> 
> 
> It looks like tesseract doesn't like that.
> 
> I tried to ocr these .tmp files to see if I could isolate what was going
> wrong for me.
> 
> 
> 
> kslote@ubuntu:~/tika/tika$ tesseract
> /tmp/apache-tika-7112319184053570698.tmp out
> 
> Tesseract Open Source OCR Engine
> 
> name_to_image_type:Error:Unrecognized image
> type:/tmp/apache-tika-7112319184053570698.tmp
> 
> IMAGE::read_header:Error:Can't read this image
> type:/tmp/apache-tika-7112319184053570698.tmp
> 
> tesseract:Error:Read of file failed:/tmp/apache-tika-7112319184053570698.tmp
> 
> Segmentation fault
> 
> 
> 
> On the wiki it mentions something about getting tesseract to work with
> .tiff files.  For whatever reason, the tesseract I have installed only
> works for .tiff files.  Would it be recommend that I re install tesseract
> from the source?
> 
> On Tue, Sep 30, 2014 at 7:28 PM, Ramirez, Paul M (398J) <
> [email protected]> wrote:
> 
>> Is that a typo in your path to tesseract?
>> 
>> /urs/bin/tesseract => /usr/bin/tesseract
>> 
>> --Paul
>> 
>>> On Sep 30, 2014, at 1:48 PM, "kevin slote" <[email protected]> wrote:
>>> 
>>> Unfortunately, that did not do it either.
>>> 
>>> I did:
>>> 
>>>  $export
>>> 
>> PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/urs/bin/tesseract
>>> 
>>> Here is the output from printenv
>>> 
>>> kslote@ubuntu:~/tika/tika$ printenv
>>> SHELL=/bin/bash
>>> USERNAME=kslote
>>> XDG_CONFIG_DIRS=/etc/xdg/xdg-gnome:/etc/xdg
>>> DESKTOP_SESSION=gnome
>>> 
>> PATH=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/urs/bin/tesseract
>>> PWD=/home/kslote/tika/tika
>>> HOME=/home/kslote
>>> LOGNAME=kslote
>>> _=/usr/bin/printenv
>>> 
>>> 
>>> On Tue, Sep 30, 2014 at 4:13 PM, Tyler Palsulich <[email protected]>
>>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> Hmm. Could you try adding tesseract to your PATH? How did you install
>>>> Tesseract? You should be able to do a straightforward `sudo apt-get
>> install
>>>> tesseract-ocr`. After that, the OCR tests should pass. We're still
>> running
>>>> into TIKA-1422, where a mail test fails. But, you can run just the OCR
>>>> tests with `mvn test -Dtest=org.apache.tika.parser.ocr.TesseractOCRTest
>>>> -DfailIfNoTests=false`.
>>>> 
>>>> Let me know if that works for you!
>>>> Tyler
>>>> 
>>>>> On Tue, Sep 30, 2014 at 4:00 PM, kevin slote <[email protected]>
>> wrote:
>>>>> 
>>>>> I am working on ubuntu 10.4. and I am having some trouble.
>>>>> Tesseract is installed correctly, but just doing a clone from the repo
>>>> and
>>>>> installing with maven, I am getting some errors.
>>>>> 
>>>>> This is before I did anything with tesseract installed.
>>>>> 
>>>>> Failed tests:
>> testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest):
>>>>> Check for the image's text.
>>>>> testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
>>>>> testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
>>>>> 
>>>>> Next I hard coded the tesseractPath:
>>>>> 
>>>>> I went into the TesseractOCRConfig.java and hard coded 'tesseractPath.'
>>>>> The all tests passed and it built successfully, but then I went to post
>>>>> some tiff's to the server.
>>>>> That didn't work. So I tried adding some System.out.println("hello
>>>> world")
>>>>> (a little crude I know) inside the unit tests to confirm that tesseract
>>>>> was working correctly.  It looks like something happens in the unit
>> test
>>>> in
>>>>> TesseractOCRTest.java
>>>>> on the line that says TesseractOCRConfig config = new
>>>>> TesseractOCRConfig();. Printing to stdout before works, but I get
>> nothing
>>>>> after. That happens before the assumeTrue(canRun(config));. So an
>>>> exception
>>>>> is not get raised.
>>>>> 
>>>>> Then once everything is built, ocr does not work.  That was why I
>>>> figured I
>>>>> would ask to see if I missed some sort of configuration step in
>> building
>>>>> it.
>>>>> 
>>>>> Thanks a ton.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, Sep 30, 2014 at 2:57 PM, Mattmann, Chris A (3980) <
>>>>> [email protected]> wrote:
>>>>> 
>>>>>> Dear Kevin,
>>>>>> 
>>>>>> Sure, it already works :) 1.7-SNAPSHOT.
>>>>>> 
>>>>>> See this wiki page:
>>>>>> 
>>>>>> https://wiki.apache.org/tika/TikaOCR
>>>>>> 
>>>>>> I¹d be happy to discuss more.
>>>>>> 
>>>>>> Thanks!
>>>>>> 
>>>>>> Cheers,
>>>>>> Chris
>>>>>> 
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Chris Mattmann, Ph.D.
>>>>>> Chief Architect
>>>>>> Instrument Software and Science Data Systems Section (398)
>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>> Office: 168-519, Mailstop: 168-527
>>>>>> Email: [email protected]
>>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Adjunct Associate Professor, Computer Science Department
>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: kevin slote <[email protected]>
>>>>>> Reply-To: "[email protected]" <[email protected]>
>>>>>> Date: Tuesday, September 30, 2014 at 8:52 AM
>>>>>> To: "[email protected]" <[email protected]>
>>>>>> Subject: OCR with tika-server
>>>>>> 
>>>>>>> Hello all,
>>>>>>> 
>>>>>>> I have been testing out the integration of tika with tesseract.
>>>>>>> I was wondering if there is  a way to get tika-server to run with
>>>>>>> tesseract's OCR capabilities?
>>>>>>> 
>>>>>>> Best
>>>>>>> 
>>>>>>> Kevin Slote
>>>> 
>>

Re: OCR with tika-server

Reply via email to