What I wrote there did have a typo in it. (It's not every day you get to embarrass yourself in front of a bunch of guys from NASA)
But that was not what I had in my terminal when I tested it. The actual PATH was: "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/bin/tesseract" I think what was actually wrong with the path is that I added the entire path to the tesseract executable, which was in my /usr/bin/ directory, instead of just the directory where tesseract lives. Is this true? I deleted the hard coding from the TesseractOCRConfig.jave and then printed config.getTesseractPath() to stdout. This field was empty. However, I have tesseract installed system wide on this ubuntu vm. So the canRun method evaluated as true whether or not the tesseractPath was configured correctly. I have been slowly trying to debug this all day. It looks like tika is making a tmp file with the .tmp preffix. I commented out some of the code to so that they remained in /tmp/. It looks like tesseract doesn't like that. I tried to ocr these .tmp files to see if I could isolate what was going wrong for me. kslote@ubuntu:~/tika/tika$ tesseract /tmp/apache-tika-7112319184053570698.tmp out Tesseract Open Source OCR Engine name_to_image_type:Error:Unrecognized image type:/tmp/apache-tika-7112319184053570698.tmp IMAGE::read_header:Error:Can't read this image type:/tmp/apache-tika-7112319184053570698.tmp tesseract:Error:Read of file failed:/tmp/apache-tika-7112319184053570698.tmp Segmentation fault On the wiki it mentions something about getting tesseract to work with .tiff files. For whatever reason, the tesseract I have installed only works for .tiff files. Would it be recommend that I re install tesseract from the source? On Tue, Sep 30, 2014 at 7:28 PM, Ramirez, Paul M (398J) < [email protected]> wrote: > Is that a typo in your path to tesseract? > > /urs/bin/tesseract => /usr/bin/tesseract > > --Paul > > > On Sep 30, 2014, at 1:48 PM, "kevin slote" <[email protected]> wrote: > > > > Unfortunately, that did not do it either. > > > > I did: > > > > $export > > > PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/urs/bin/tesseract > > > > Here is the output from printenv > > > > kslote@ubuntu:~/tika/tika$ printenv > > SHELL=/bin/bash > > USERNAME=kslote > > XDG_CONFIG_DIRS=/etc/xdg/xdg-gnome:/etc/xdg > > DESKTOP_SESSION=gnome > > > PATH=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/urs/bin/tesseract > > PWD=/home/kslote/tika/tika > > HOME=/home/kslote > > LOGNAME=kslote > > _=/usr/bin/printenv > > > > > > On Tue, Sep 30, 2014 at 4:13 PM, Tyler Palsulich <[email protected]> > > wrote: > > > >> Hi, > >> > >> Hmm. Could you try adding tesseract to your PATH? How did you install > >> Tesseract? You should be able to do a straightforward `sudo apt-get > install > >> tesseract-ocr`. After that, the OCR tests should pass. We're still > running > >> into TIKA-1422, where a mail test fails. But, you can run just the OCR > >> tests with `mvn test -Dtest=org.apache.tika.parser.ocr.TesseractOCRTest > >> -DfailIfNoTests=false`. > >> > >> Let me know if that works for you! > >> Tyler > >> > >>> On Tue, Sep 30, 2014 at 4:00 PM, kevin slote <[email protected]> > wrote: > >>> > >>> I am working on ubuntu 10.4. and I am having some trouble. > >>> Tesseract is installed correctly, but just doing a clone from the repo > >> and > >>> installing with maven, I am getting some errors. > >>> > >>> This is before I did anything with tesseract installed. > >>> > >>> Failed tests: > testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest): > >>> Check for the image's text. > >>> testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest) > >>> testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest) > >>> > >>> Next I hard coded the tesseractPath: > >>> > >>> I went into the TesseractOCRConfig.java and hard coded 'tesseractPath.' > >>> The all tests passed and it built successfully, but then I went to post > >>> some tiff's to the server. > >>> That didn't work. So I tried adding some System.out.println("hello > >> world") > >>> (a little crude I know) inside the unit tests to confirm that tesseract > >>> was working correctly. It looks like something happens in the unit > test > >> in > >>> TesseractOCRTest.java > >>> on the line that says TesseractOCRConfig config = new > >>> TesseractOCRConfig();. Printing to stdout before works, but I get > nothing > >>> after. That happens before the assumeTrue(canRun(config));. So an > >> exception > >>> is not get raised. > >>> > >>> Then once everything is built, ocr does not work. That was why I > >> figured I > >>> would ask to see if I missed some sort of configuration step in > building > >>> it. > >>> > >>> Thanks a ton. > >>> > >>> > >>> > >>> > >>> > >>> On Tue, Sep 30, 2014 at 2:57 PM, Mattmann, Chris A (3980) < > >>> [email protected]> wrote: > >>> > >>>> Dear Kevin, > >>>> > >>>> Sure, it already works :) 1.7-SNAPSHOT. > >>>> > >>>> See this wiki page: > >>>> > >>>> https://wiki.apache.org/tika/TikaOCR > >>>> > >>>> I¹d be happy to discuss more. > >>>> > >>>> Thanks! > >>>> > >>>> Cheers, > >>>> Chris > >>>> > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>> Chris Mattmann, Ph.D. > >>>> Chief Architect > >>>> Instrument Software and Science Data Systems Section (398) > >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >>>> Office: 168-519, Mailstop: 168-527 > >>>> Email: [email protected] > >>>> WWW: http://sunset.usc.edu/~mattmann/ > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>> Adjunct Associate Professor, Computer Science Department > >>>> University of Southern California, Los Angeles, CA 90089 USA > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> -----Original Message----- > >>>> From: kevin slote <[email protected]> > >>>> Reply-To: "[email protected]" <[email protected]> > >>>> Date: Tuesday, September 30, 2014 at 8:52 AM > >>>> To: "[email protected]" <[email protected]> > >>>> Subject: OCR with tika-server > >>>> > >>>>> Hello all, > >>>>> > >>>>> I have been testing out the integration of tika with tesseract. > >>>>> I was wondering if there is a way to get tika-server to run with > >>>>> tesseract's OCR capabilities? > >>>>> > >>>>> Best > >>>>> > >>>>> Kevin Slote > >> >
