What type of image is it, Kevin? If it’s a TIFF, you need to install tesseract with special lib tiff parameters. See:
https://gist.github.com/henrik/1967035 Can you parse the image file with tesseract by itself, without Tika’s tmp image? ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: <Ramirez>, "Paul M (398J)" <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Wednesday, October 1, 2014 at 1:47 PM To: "<[email protected]>" <[email protected]> Subject: Re: OCR with tika-server >Nothing to be embarrassed about at all Kevin. I actually thought maybe it >was just a typo issue and I randomly happen to catch that. I've >definitely done that one before myself. > >Bummed that was not the problem. > >--Paul > >On Oct 1, 2014, at 1:00 PM, kevin slote <[email protected]> > wrote: > >> What I wrote there did have a typo in it. (It's not every day you get to >> embarrass yourself in front of a bunch of guys from NASA) >> >> But that was not what I had in my terminal when I tested it. >> >> >> >> The actual PATH was: >> >> >> >> >> >>"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/g >>ames:/usr/bin/tesseract" >> >> >> >> I think what was actually wrong with the path is that I added the entire >> path to the tesseract executable, which was in my /usr/bin/ directory, >> instead of just the directory where tesseract lives. Is this true? >> >> >> >> I deleted the hard coding from the TesseractOCRConfig.jave and then >>printed >> config.getTesseractPath() to stdout. This field was empty. >> >> However, I have tesseract installed system wide on this ubuntu vm. >> >> So the canRun method evaluated as true whether or not the tesseractPath >>was >> configured correctly. >> >> >> >> I have been slowly trying to debug this all day. It looks like tika is >> making a tmp file with the .tmp preffix. >> >> I commented out some of the code to so that they remained in /tmp/. >> >> >> >> It looks like tesseract doesn't like that. >> >> I tried to ocr these .tmp files to see if I could isolate what was going >> wrong for me. >> >> >> >> kslote@ubuntu:~/tika/tika$ tesseract >> /tmp/apache-tika-7112319184053570698.tmp out >> >> Tesseract Open Source OCR Engine >> >> name_to_image_type:Error:Unrecognized image >> type:/tmp/apache-tika-7112319184053570698.tmp >> >> IMAGE::read_header:Error:Can't read this image >> type:/tmp/apache-tika-7112319184053570698.tmp >> >> tesseract:Error:Read of file >>failed:/tmp/apache-tika-7112319184053570698.tmp >> >> Segmentation fault >> >> >> >> On the wiki it mentions something about getting tesseract to work with >> .tiff files. For whatever reason, the tesseract I have installed only >> works for .tiff files. Would it be recommend that I re install >>tesseract >> from the source? >> >> On Tue, Sep 30, 2014 at 7:28 PM, Ramirez, Paul M (398J) < >> [email protected]> wrote: >> >>> Is that a typo in your path to tesseract? >>> >>> /urs/bin/tesseract => /usr/bin/tesseract >>> >>> --Paul >>> >>>> On Sep 30, 2014, at 1:48 PM, "kevin slote" <[email protected]> wrote: >>>> >>>> Unfortunately, that did not do it either. >>>> >>>> I did: >>>> >>>> $export >>>> >>> >>>PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/g >>>ames:/urs/bin/tesseract >>>> >>>> Here is the output from printenv >>>> >>>> kslote@ubuntu:~/tika/tika$ printenv >>>> SHELL=/bin/bash >>>> USERNAME=kslote >>>> XDG_CONFIG_DIRS=/etc/xdg/xdg-gnome:/etc/xdg >>>> DESKTOP_SESSION=gnome >>>> >>> >>>PATH=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/ >>>usr/games:/urs/bin/tesseract >>>> PWD=/home/kslote/tika/tika >>>> HOME=/home/kslote >>>> LOGNAME=kslote >>>> _=/usr/bin/printenv >>>> >>>> >>>> On Tue, Sep 30, 2014 at 4:13 PM, Tyler Palsulich >>>><[email protected]> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> Hmm. Could you try adding tesseract to your PATH? How did you install >>>>> Tesseract? You should be able to do a straightforward `sudo apt-get >>> install >>>>> tesseract-ocr`. After that, the OCR tests should pass. We're still >>> running >>>>> into TIKA-1422, where a mail test fails. But, you can run just the >>>>>OCR >>>>> tests with `mvn test >>>>>-Dtest=org.apache.tika.parser.ocr.TesseractOCRTest >>>>> -DfailIfNoTests=false`. >>>>> >>>>> Let me know if that works for you! >>>>> Tyler >>>>> >>>>>> On Tue, Sep 30, 2014 at 4:00 PM, kevin slote <[email protected]> >>> wrote: >>>>>> >>>>>> I am working on ubuntu 10.4. and I am having some trouble. >>>>>> Tesseract is installed correctly, but just doing a clone from the >>>>>>repo >>>>> and >>>>>> installing with maven, I am getting some errors. >>>>>> >>>>>> This is before I did anything with tesseract installed. >>>>>> >>>>>> Failed tests: >>> testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest): >>>>>> Check for the image's text. >>>>>> testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest) >>>>>> testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest) >>>>>> >>>>>> Next I hard coded the tesseractPath: >>>>>> >>>>>> I went into the TesseractOCRConfig.java and hard coded >>>>>>'tesseractPath.' >>>>>> The all tests passed and it built successfully, but then I went to >>>>>>post >>>>>> some tiff's to the server. >>>>>> That didn't work. So I tried adding some System.out.println("hello >>>>> world") >>>>>> (a little crude I know) inside the unit tests to confirm that >>>>>>tesseract >>>>>> was working correctly. It looks like something happens in the unit >>> test >>>>> in >>>>>> TesseractOCRTest.java >>>>>> on the line that says TesseractOCRConfig config = new >>>>>> TesseractOCRConfig();. Printing to stdout before works, but I get >>> nothing >>>>>> after. That happens before the assumeTrue(canRun(config));. So an >>>>> exception >>>>>> is not get raised. >>>>>> >>>>>> Then once everything is built, ocr does not work. That was why I >>>>> figured I >>>>>> would ask to see if I missed some sort of configuration step in >>> building >>>>>> it. >>>>>> >>>>>> Thanks a ton. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Sep 30, 2014 at 2:57 PM, Mattmann, Chris A (3980) < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Dear Kevin, >>>>>>> >>>>>>> Sure, it already works :) 1.7-SNAPSHOT. >>>>>>> >>>>>>> See this wiki page: >>>>>>> >>>>>>> https://wiki.apache.org/tika/TikaOCR >>>>>>> >>>>>>> I¹d be happy to discuss more. >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> Cheers, >>>>>>> Chris >>>>>>> >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>> Chris Mattmann, Ph.D. >>>>>>> Chief Architect >>>>>>> Instrument Software and Science Data Systems Section (398) >>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>>> Office: 168-519, Mailstop: 168-527 >>>>>>> Email: [email protected] >>>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>> Adjunct Associate Professor, Computer Science Department >>>>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: kevin slote <[email protected]> >>>>>>> Reply-To: "[email protected]" <[email protected]> >>>>>>> Date: Tuesday, September 30, 2014 at 8:52 AM >>>>>>> To: "[email protected]" <[email protected]> >>>>>>> Subject: OCR with tika-server >>>>>>> >>>>>>>> Hello all, >>>>>>>> >>>>>>>> I have been testing out the integration of tika with tesseract. >>>>>>>> I was wondering if there is a way to get tika-server to run with >>>>>>>> tesseract's OCR capabilities? >>>>>>>> >>>>>>>> Best >>>>>>>> >>>>>>>> Kevin Slote >>>>> >>> >
