Re: OCR with tika-server

Mattmann, Chris A (3980) Fri, 03 Oct 2014 20:04:22 -0700

Kevin glad it is now fixed with you!

If you get a chance, please feel free to document
this on the wiki:


https://wiki.apache.org/tika/TikaOCR


You can sign up for an account, and then I can grant
you permissions to edit the file. Let me know!

Cheers,
Chris



++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: kevin slote <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Friday, October 3, 2014 at 4:10 PM
To: "[email protected]" <[email protected]>
Subject: Re: OCR with tika-server

>Hi all,
>
>I just confirmed that the problem was that my version of tesseract was too
>old.
>Maybe it would be a good idea to put something in the canRun method at the
>top of the tesseract unit test to also check that the version of tesseract
>is relevant?
>
>Older versions of tesseract do not have a "-v" or "--version" flag.  So
>maybe use ProcessBuilder to run that command and parse the string to see
>if
>it returned an error?
>
>Thanks for everyone's help.
>
>On Fri, Oct 3, 2014 at 2:30 PM, kevin slote <[email protected]> wrote:
>
>> Thanks for following up!
>>
>> I was trying to dig deeper before I responded.
>>
>> Tyler,
>>
>> I followed those instructions.  My version of Tesseract does not ocr the
>> google logo because it is not a tiff.  I used imagemagick to convert it
>>to
>> a tif and tesseract returned "check_legal_image_size:Error:Only
>>1,2,4,5,6,8
>> bpp are supported:32" error which usually means it needs to be re-sized
>> with imagemagick.
>>
>>
>> Chris,
>>
>> I wrote a python wrapper for tesseract that can parse the documents that
>> were in your test-document repository concerning OCR (testOCR.pdf,
>>etc.) It
>> looks like right now, in TesseractOCRParser.java, the command line
>>argument
>> that is passed to the os points to a .tmp file in /tmp/.
>>
>> So the command that is executed is
>>
>>    "tesseract /tmp/apache-tika-2409864150710514587.tmp
>> /tmp/apache-tika-1277985370508249503.tmp -l eng -psm 1"
>>
>> This is not working for me.  When I grab those .tmp files and try to ocr
>> them from the command line, tesseract gets thrown for a loop.
>>
>> From what I can tell, is the tesseract I have installed can only handle
>> .tif files.
>> I can back this up by citing the tesseract page:
>> https://code.google.com/p/tesseract-ocr/wiki/ReadMe
>>
>>  If Tesseract isn't available for your distribution, or you want to use
>>a
>> newer version than they offer, you can compile your own
>> <https://code.google.com/p/tesseract-ocr/wiki/Compiling>. Note that
>>older
>> versions of Tesseract only supported processing .tiff files.
>>
>> So, I think that upgrading tesseract or moving to ubuntu 12 or higher
>>will
>> solve my problems.
>>
>> I will let the listserv know if that fixes it.
>>
>>
>> Kevin Slote
>>
>>
>>
>> On Wed, Oct 1, 2014 at 5:13 PM, Mattmann, Chris A (3980) <
>> [email protected]> wrote:
>>
>>> What type of image is it, Kevin?
>>>
>>> If it’s a TIFF, you need to install tesseract with special lib tiff
>>> parameters. See:
>>>
>>> https://gist.github.com/henrik/1967035
>>>
>>>
>>> Can you parse the image file with tesseract by itself, without
>>> Tika’s tmp image?
>>>
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398)
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: [email protected]
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: <Ramirez>, "Paul M   (398J)" <[email protected]>
>>> Reply-To: "[email protected]" <[email protected]>
>>> Date: Wednesday, October 1, 2014 at 1:47 PM
>>> To: "<[email protected]>" <[email protected]>
>>> Subject: Re: OCR with tika-server
>>>
>>> >Nothing to be embarrassed about at all Kevin. I actually thought
>>>maybe it
>>> >was just a typo issue and I randomly happen to catch that. I've
>>> >definitely done that one before myself.
>>> >
>>> >Bummed that was not the problem.
>>> >
>>> >--Paul
>>> >
>>> >On Oct 1, 2014, at 1:00 PM, kevin slote <[email protected]>
>>> > wrote:
>>> >
>>> >> What I wrote there did have a typo in it. (It's not every day you
>>>get
>>> to
>>> >> embarrass yourself in front of a bunch of guys from NASA)
>>> >>
>>> >> But that was not what I had in my terminal when I tested it.
>>> >>
>>> >>
>>> >>
>>> >> The actual PATH was:
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>>
>>> 
>>>>>"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/us
>>>>>r/g
>>> >>ames:/usr/bin/tesseract"
>>> >>
>>> >>
>>> >>
>>> >> I think what was actually wrong with the path is that I added the
>>> entire
>>> >> path to the tesseract executable, which was in my /usr/bin/
>>>directory,
>>> >> instead of just the directory where tesseract lives.  Is this true?
>>> >>
>>> >>
>>> >>
>>> >> I deleted the hard coding from the TesseractOCRConfig.jave and then
>>> >>printed
>>> >> config.getTesseractPath() to stdout.  This field was empty.
>>> >>
>>> >> However, I have tesseract installed system wide on this ubuntu vm.
>>> >>
>>> >> So the canRun method evaluated as true whether or not the
>>>tesseractPath
>>> >>was
>>> >> configured correctly.
>>> >>
>>> >>
>>> >>
>>> >> I have been slowly trying to debug this all day.  It looks like
>>>tika is
>>> >> making a tmp file with the .tmp preffix.
>>> >>
>>> >> I commented out some of the code to so that they remained in /tmp/.
>>> >>
>>> >>
>>> >>
>>> >> It looks like tesseract doesn't like that.
>>> >>
>>> >> I tried to ocr these .tmp files to see if I could isolate what was
>>> going
>>> >> wrong for me.
>>> >>
>>> >>
>>> >>
>>> >> kslote@ubuntu:~/tika/tika$ tesseract
>>> >> /tmp/apache-tika-7112319184053570698.tmp out
>>> >>
>>> >> Tesseract Open Source OCR Engine
>>> >>
>>> >> name_to_image_type:Error:Unrecognized image
>>> >> type:/tmp/apache-tika-7112319184053570698.tmp
>>> >>
>>> >> IMAGE::read_header:Error:Can't read this image
>>> >> type:/tmp/apache-tika-7112319184053570698.tmp
>>> >>
>>> >> tesseract:Error:Read of file
>>> >>failed:/tmp/apache-tika-7112319184053570698.tmp
>>> >>
>>> >> Segmentation fault
>>> >>
>>> >>
>>> >>
>>> >> On the wiki it mentions something about getting tesseract to work
>>>with
>>> >> .tiff files.  For whatever reason, the tesseract I have installed
>>>only
>>> >> works for .tiff files.  Would it be recommend that I re install
>>> >>tesseract
>>> >> from the source?
>>> >>
>>> >> On Tue, Sep 30, 2014 at 7:28 PM, Ramirez, Paul M (398J) <
>>> >> [email protected]> wrote:
>>> >>
>>> >>> Is that a typo in your path to tesseract?
>>> >>>
>>> >>> /urs/bin/tesseract => /usr/bin/tesseract
>>> >>>
>>> >>> --Paul
>>> >>>
>>> >>>> On Sep 30, 2014, at 1:48 PM, "kevin slote" <[email protected]>
>>> wrote:
>>> >>>>
>>> >>>> Unfortunately, that did not do it either.
>>> >>>>
>>> >>>> I did:
>>> >>>>
>>> >>>>  $export
>>> >>>>
>>> >>>
>>>
>>> 
>>>>>>PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/us
>>>>>>r/g
>>> >>>ames:/urs/bin/tesseract
>>> >>>>
>>> >>>> Here is the output from printenv
>>> >>>>
>>> >>>> kslote@ubuntu:~/tika/tika$ printenv
>>> >>>> SHELL=/bin/bash
>>> >>>> USERNAME=kslote
>>> >>>> XDG_CONFIG_DIRS=/etc/xdg/xdg-gnome:/etc/xdg
>>> >>>> DESKTOP_SESSION=gnome
>>> >>>>
>>> >>>
>>>
>>> 
>>>>>>PATH=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bi
>>>>>>n:/
>>> >>>usr/games:/urs/bin/tesseract
>>> >>>> PWD=/home/kslote/tika/tika
>>> >>>> HOME=/home/kslote
>>> >>>> LOGNAME=kslote
>>> >>>> _=/usr/bin/printenv
>>> >>>>
>>> >>>>
>>> >>>> On Tue, Sep 30, 2014 at 4:13 PM, Tyler Palsulich
>>> >>>><[email protected]>
>>> >>>> wrote:
>>> >>>>
>>> >>>>> Hi,
>>> >>>>>
>>> >>>>> Hmm. Could you try adding tesseract to your PATH? How did you
>>> install
>>> >>>>> Tesseract? You should be able to do a straightforward `sudo
>>>apt-get
>>> >>> install
>>> >>>>> tesseract-ocr`. After that, the OCR tests should pass. We're
>>>still
>>> >>> running
>>> >>>>> into TIKA-1422, where a mail test fails. But, you can run just
>>>the
>>> >>>>>OCR
>>> >>>>> tests with `mvn test
>>> >>>>>-Dtest=org.apache.tika.parser.ocr.TesseractOCRTest
>>> >>>>> -DfailIfNoTests=false`.
>>> >>>>>
>>> >>>>> Let me know if that works for you!
>>> >>>>> Tyler
>>> >>>>>
>>> >>>>>> On Tue, Sep 30, 2014 at 4:00 PM, kevin slote <[email protected]>
>>> >>> wrote:
>>> >>>>>>
>>> >>>>>> I am working on ubuntu 10.4. and I am having some trouble.
>>> >>>>>> Tesseract is installed correctly, but just doing a clone from
>>>the
>>> >>>>>>repo
>>> >>>>> and
>>> >>>>>> installing with maven, I am getting some errors.
>>> >>>>>>
>>> >>>>>> This is before I did anything with tesseract installed.
>>> >>>>>>
>>> >>>>>> Failed tests:
>>> >>> testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest):
>>> >>>>>> Check for the image's text.
>>> >>>>>> testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
>>> >>>>>> testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
>>> >>>>>>
>>> >>>>>> Next I hard coded the tesseractPath:
>>> >>>>>>
>>> >>>>>> I went into the TesseractOCRConfig.java and hard coded
>>> >>>>>>'tesseractPath.'
>>> >>>>>> The all tests passed and it built successfully, but then I went
>>>to
>>> >>>>>>post
>>> >>>>>> some tiff's to the server.
>>> >>>>>> That didn't work. So I tried adding some
>>>System.out.println("hello
>>> >>>>> world")
>>> >>>>>> (a little crude I know) inside the unit tests to confirm that
>>> >>>>>>tesseract
>>> >>>>>> was working correctly.  It looks like something happens in the
>>>unit
>>> >>> test
>>> >>>>> in
>>> >>>>>> TesseractOCRTest.java
>>> >>>>>> on the line that says TesseractOCRConfig config = new
>>> >>>>>> TesseractOCRConfig();. Printing to stdout before works, but I
>>>get
>>> >>> nothing
>>> >>>>>> after. That happens before the assumeTrue(canRun(config));. So
>>>an
>>> >>>>> exception
>>> >>>>>> is not get raised.
>>> >>>>>>
>>> >>>>>> Then once everything is built, ocr does not work.  That was why
>>>I
>>> >>>>> figured I
>>> >>>>>> would ask to see if I missed some sort of configuration step in
>>> >>> building
>>> >>>>>> it.
>>> >>>>>>
>>> >>>>>> Thanks a ton.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> On Tue, Sep 30, 2014 at 2:57 PM, Mattmann, Chris A (3980) <
>>> >>>>>> [email protected]> wrote:
>>> >>>>>>
>>> >>>>>>> Dear Kevin,
>>> >>>>>>>
>>> >>>>>>> Sure, it already works :) 1.7-SNAPSHOT.
>>> >>>>>>>
>>> >>>>>>> See this wiki page:
>>> >>>>>>>
>>> >>>>>>> https://wiki.apache.org/tika/TikaOCR
>>> >>>>>>>
>>> >>>>>>> I¹d be happy to discuss more.
>>> >>>>>>>
>>> >>>>>>> Thanks!
>>> >>>>>>>
>>> >>>>>>> Cheers,
>>> >>>>>>> Chris
>>> >>>>>>>
>>> >>>>>>> 
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> >>>>>>> Chris Mattmann, Ph.D.
>>> >>>>>>> Chief Architect
>>> >>>>>>> Instrument Software and Science Data Systems Section (398)
>>> >>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> >>>>>>> Office: 168-519, Mailstop: 168-527
>>> >>>>>>> Email: [email protected]
>>> >>>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> >>>>>>> 
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> >>>>>>> Adjunct Associate Professor, Computer Science Department
>>> >>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>> >>>>>>> 
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> -----Original Message-----
>>> >>>>>>> From: kevin slote <[email protected]>
>>> >>>>>>> Reply-To: "[email protected]" <[email protected]>
>>> >>>>>>> Date: Tuesday, September 30, 2014 at 8:52 AM
>>> >>>>>>> To: "[email protected]" <[email protected]>
>>> >>>>>>> Subject: OCR with tika-server
>>> >>>>>>>
>>> >>>>>>>> Hello all,
>>> >>>>>>>>
>>> >>>>>>>> I have been testing out the integration of tika with
>>>tesseract.
>>> >>>>>>>> I was wondering if there is  a way to get tika-server to run
>>>with
>>> >>>>>>>> tesseract's OCR capabilities?
>>> >>>>>>>>
>>> >>>>>>>> Best
>>> >>>>>>>>
>>> >>>>>>>> Kevin Slote
>>> >>>>>
>>> >>>
>>> >
>>>
>>>
>>

Re: OCR with tika-server

Reply via email to