Nothing to be embarrassed about at all Kevin. I actually thought maybe it was just a typo issue and I randomly happen to catch that. I've definitely done that one before myself.
Bummed that was not the problem. --Paul On Oct 1, 2014, at 1:00 PM, kevin slote <[email protected]> wrote: > What I wrote there did have a typo in it. (It's not every day you get to > embarrass yourself in front of a bunch of guys from NASA) > > But that was not what I had in my terminal when I tested it. > > > > The actual PATH was: > > > > > "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/bin/tesseract" > > > > I think what was actually wrong with the path is that I added the entire > path to the tesseract executable, which was in my /usr/bin/ directory, > instead of just the directory where tesseract lives. Is this true? > > > > I deleted the hard coding from the TesseractOCRConfig.jave and then printed > config.getTesseractPath() to stdout. This field was empty. > > However, I have tesseract installed system wide on this ubuntu vm. > > So the canRun method evaluated as true whether or not the tesseractPath was > configured correctly. > > > > I have been slowly trying to debug this all day. It looks like tika is > making a tmp file with the .tmp preffix. > > I commented out some of the code to so that they remained in /tmp/. > > > > It looks like tesseract doesn't like that. > > I tried to ocr these .tmp files to see if I could isolate what was going > wrong for me. > > > > kslote@ubuntu:~/tika/tika$ tesseract > /tmp/apache-tika-7112319184053570698.tmp out > > Tesseract Open Source OCR Engine > > name_to_image_type:Error:Unrecognized image > type:/tmp/apache-tika-7112319184053570698.tmp > > IMAGE::read_header:Error:Can't read this image > type:/tmp/apache-tika-7112319184053570698.tmp > > tesseract:Error:Read of file failed:/tmp/apache-tika-7112319184053570698.tmp > > Segmentation fault > > > > On the wiki it mentions something about getting tesseract to work with > .tiff files. For whatever reason, the tesseract I have installed only > works for .tiff files. Would it be recommend that I re install tesseract > from the source? > > On Tue, Sep 30, 2014 at 7:28 PM, Ramirez, Paul M (398J) < > [email protected]> wrote: > >> Is that a typo in your path to tesseract? >> >> /urs/bin/tesseract => /usr/bin/tesseract >> >> --Paul >> >>> On Sep 30, 2014, at 1:48 PM, "kevin slote" <[email protected]> wrote: >>> >>> Unfortunately, that did not do it either. >>> >>> I did: >>> >>> $export >>> >> PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/urs/bin/tesseract >>> >>> Here is the output from printenv >>> >>> kslote@ubuntu:~/tika/tika$ printenv >>> SHELL=/bin/bash >>> USERNAME=kslote >>> XDG_CONFIG_DIRS=/etc/xdg/xdg-gnome:/etc/xdg >>> DESKTOP_SESSION=gnome >>> >> PATH=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/urs/bin/tesseract >>> PWD=/home/kslote/tika/tika >>> HOME=/home/kslote >>> LOGNAME=kslote >>> _=/usr/bin/printenv >>> >>> >>> On Tue, Sep 30, 2014 at 4:13 PM, Tyler Palsulich <[email protected]> >>> wrote: >>> >>>> Hi, >>>> >>>> Hmm. Could you try adding tesseract to your PATH? How did you install >>>> Tesseract? You should be able to do a straightforward `sudo apt-get >> install >>>> tesseract-ocr`. After that, the OCR tests should pass. We're still >> running >>>> into TIKA-1422, where a mail test fails. But, you can run just the OCR >>>> tests with `mvn test -Dtest=org.apache.tika.parser.ocr.TesseractOCRTest >>>> -DfailIfNoTests=false`. >>>> >>>> Let me know if that works for you! >>>> Tyler >>>> >>>>> On Tue, Sep 30, 2014 at 4:00 PM, kevin slote <[email protected]> >> wrote: >>>>> >>>>> I am working on ubuntu 10.4. and I am having some trouble. >>>>> Tesseract is installed correctly, but just doing a clone from the repo >>>> and >>>>> installing with maven, I am getting some errors. >>>>> >>>>> This is before I did anything with tesseract installed. >>>>> >>>>> Failed tests: >> testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest): >>>>> Check for the image's text. >>>>> testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest) >>>>> testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest) >>>>> >>>>> Next I hard coded the tesseractPath: >>>>> >>>>> I went into the TesseractOCRConfig.java and hard coded 'tesseractPath.' >>>>> The all tests passed and it built successfully, but then I went to post >>>>> some tiff's to the server. >>>>> That didn't work. So I tried adding some System.out.println("hello >>>> world") >>>>> (a little crude I know) inside the unit tests to confirm that tesseract >>>>> was working correctly. It looks like something happens in the unit >> test >>>> in >>>>> TesseractOCRTest.java >>>>> on the line that says TesseractOCRConfig config = new >>>>> TesseractOCRConfig();. Printing to stdout before works, but I get >> nothing >>>>> after. That happens before the assumeTrue(canRun(config));. So an >>>> exception >>>>> is not get raised. >>>>> >>>>> Then once everything is built, ocr does not work. That was why I >>>> figured I >>>>> would ask to see if I missed some sort of configuration step in >> building >>>>> it. >>>>> >>>>> Thanks a ton. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Sep 30, 2014 at 2:57 PM, Mattmann, Chris A (3980) < >>>>> [email protected]> wrote: >>>>> >>>>>> Dear Kevin, >>>>>> >>>>>> Sure, it already works :) 1.7-SNAPSHOT. >>>>>> >>>>>> See this wiki page: >>>>>> >>>>>> https://wiki.apache.org/tika/TikaOCR >>>>>> >>>>>> I¹d be happy to discuss more. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> Cheers, >>>>>> Chris >>>>>> >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> Chris Mattmann, Ph.D. >>>>>> Chief Architect >>>>>> Instrument Software and Science Data Systems Section (398) >>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>> Office: 168-519, Mailstop: 168-527 >>>>>> Email: [email protected] >>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> Adjunct Associate Professor, Computer Science Department >>>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: kevin slote <[email protected]> >>>>>> Reply-To: "[email protected]" <[email protected]> >>>>>> Date: Tuesday, September 30, 2014 at 8:52 AM >>>>>> To: "[email protected]" <[email protected]> >>>>>> Subject: OCR with tika-server >>>>>> >>>>>>> Hello all, >>>>>>> >>>>>>> I have been testing out the integration of tika with tesseract. >>>>>>> I was wondering if there is a way to get tika-server to run with >>>>>>> tesseract's OCR capabilities? >>>>>>> >>>>>>> Best >>>>>>> >>>>>>> Kevin Slote >>>> >>
