Re: OCR with tika-server

kevin slote Wed, 01 Oct 2014 13:01:22 -0700

What I wrote there did have a typo in it. (It's not every day you get to
embarrass yourself in front of a bunch of guys from NASA)


But that was not what I had in my terminal when I tested it.



The actual PATH was:




"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/bin/tesseract"



I think what was actually wrong with the path is that I added the entire
path to the tesseract executable, which was in my /usr/bin/ directory,
instead of just the directory where tesseract lives.  Is this true?



I deleted the hard coding from the TesseractOCRConfig.jave and then printed
config.getTesseractPath() to stdout.  This field was empty.

However, I have tesseract installed system wide on this ubuntu vm.

So the canRun method evaluated as true whether or not the tesseractPath was
configured correctly.



I have been slowly trying to debug this all day.  It looks like tika is
making a tmp file with the .tmp preffix.

I commented out some of the code to so that they remained in /tmp/.



It looks like tesseract doesn't like that.

I tried to ocr these .tmp files to see if I could isolate what was going
wrong for me.



kslote@ubuntu:~/tika/tika$ tesseract
/tmp/apache-tika-7112319184053570698.tmp out

Tesseract Open Source OCR Engine

name_to_image_type:Error:Unrecognized image
type:/tmp/apache-tika-7112319184053570698.tmp

IMAGE::read_header:Error:Can't read this image
type:/tmp/apache-tika-7112319184053570698.tmp

tesseract:Error:Read of file failed:/tmp/apache-tika-7112319184053570698.tmp

Segmentation fault



On the wiki it mentions something about getting tesseract to work with
.tiff files.  For whatever reason, the tesseract I have installed only
works for .tiff files.  Would it be recommend that I re install tesseract
from the source?

On Tue, Sep 30, 2014 at 7:28 PM, Ramirez, Paul M (398J) <
[email protected]> wrote:

> Is that a typo in your path to tesseract?
>
> /urs/bin/tesseract => /usr/bin/tesseract
>
> --Paul
>
> > On Sep 30, 2014, at 1:48 PM, "kevin slote" <[email protected]> wrote:
> >
> > Unfortunately, that did not do it either.
> >
> > I did:
> >
> >   $export
> >
> PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/urs/bin/tesseract
> >
> > Here is the output from printenv
> >
> > kslote@ubuntu:~/tika/tika$ printenv
> > SHELL=/bin/bash
> > USERNAME=kslote
> > XDG_CONFIG_DIRS=/etc/xdg/xdg-gnome:/etc/xdg
> > DESKTOP_SESSION=gnome
> >
> PATH=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/urs/bin/tesseract
> > PWD=/home/kslote/tika/tika
> > HOME=/home/kslote
> > LOGNAME=kslote
> > _=/usr/bin/printenv
> >
> >
> > On Tue, Sep 30, 2014 at 4:13 PM, Tyler Palsulich <[email protected]>
> > wrote:
> >
> >> Hi,
> >>
> >> Hmm. Could you try adding tesseract to your PATH? How did you install
> >> Tesseract? You should be able to do a straightforward `sudo apt-get
> install
> >> tesseract-ocr`. After that, the OCR tests should pass. We're still
> running
> >> into TIKA-1422, where a mail test fails. But, you can run just the OCR
> >> tests with `mvn test -Dtest=org.apache.tika.parser.ocr.TesseractOCRTest
> >> -DfailIfNoTests=false`.
> >>
> >> Let me know if that works for you!
> >> Tyler
> >>
> >>> On Tue, Sep 30, 2014 at 4:00 PM, kevin slote <[email protected]>
> wrote:
> >>>
> >>> I am working on ubuntu 10.4. and I am having some trouble.
> >>> Tesseract is installed correctly, but just doing a clone from the repo
> >> and
> >>> installing with maven, I am getting some errors.
> >>>
> >>> This is before I did anything with tesseract installed.
> >>>
> >>> Failed tests:
>  testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest):
> >>> Check for the image's text.
> >>>  testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
> >>>  testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest)
> >>>
> >>> Next I hard coded the tesseractPath:
> >>>
> >>> I went into the TesseractOCRConfig.java and hard coded 'tesseractPath.'
> >>> The all tests passed and it built successfully, but then I went to post
> >>> some tiff's to the server.
> >>> That didn't work. So I tried adding some System.out.println("hello
> >> world")
> >>> (a little crude I know) inside the unit tests to confirm that tesseract
> >>> was working correctly.  It looks like something happens in the unit
> test
> >> in
> >>> TesseractOCRTest.java
> >>> on the line that says TesseractOCRConfig config = new
> >>> TesseractOCRConfig();. Printing to stdout before works, but I get
> nothing
> >>> after. That happens before the assumeTrue(canRun(config));. So an
> >> exception
> >>> is not get raised.
> >>>
> >>> Then once everything is built, ocr does not work.  That was why I
> >> figured I
> >>> would ask to see if I missed some sort of configuration step in
> building
> >>> it.
> >>>
> >>> Thanks a ton.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Tue, Sep 30, 2014 at 2:57 PM, Mattmann, Chris A (3980) <
> >>> [email protected]> wrote:
> >>>
> >>>> Dear Kevin,
> >>>>
> >>>> Sure, it already works :) 1.7-SNAPSHOT.
> >>>>
> >>>> See this wiki page:
> >>>>
> >>>> https://wiki.apache.org/tika/TikaOCR
> >>>>
> >>>> I¹d be happy to discuss more.
> >>>>
> >>>> Thanks!
> >>>>
> >>>> Cheers,
> >>>> Chris
> >>>>
> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>> Chris Mattmann, Ph.D.
> >>>> Chief Architect
> >>>> Instrument Software and Science Data Systems Section (398)
> >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>>> Office: 168-519, Mailstop: 168-527
> >>>> Email: [email protected]
> >>>> WWW:  http://sunset.usc.edu/~mattmann/
> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>> Adjunct Associate Professor, Computer Science Department
> >>>> University of Southern California, Los Angeles, CA 90089 USA
> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> -----Original Message-----
> >>>> From: kevin slote <[email protected]>
> >>>> Reply-To: "[email protected]" <[email protected]>
> >>>> Date: Tuesday, September 30, 2014 at 8:52 AM
> >>>> To: "[email protected]" <[email protected]>
> >>>> Subject: OCR with tika-server
> >>>>
> >>>>> Hello all,
> >>>>>
> >>>>> I have been testing out the integration of tika with tesseract.
> >>>>> I was wondering if there is  a way to get tika-server to run with
> >>>>> tesseract's OCR capabilities?
> >>>>>
> >>>>> Best
> >>>>>
> >>>>> Kevin Slote
> >>
>

Re: OCR with tika-server

Reply via email to