Thanks for following up! I was trying to dig deeper before I responded.
Tyler, I followed those instructions. My version of Tesseract does not ocr the google logo because it is not a tiff. I used imagemagick to convert it to a tif and tesseract returned "check_legal_image_size:Error:Only 1,2,4,5,6,8 bpp are supported:32" error which usually means it needs to be re-sized with imagemagick. Chris, I wrote a python wrapper for tesseract that can parse the documents that were in your test-document repository concerning OCR (testOCR.pdf, etc.) It looks like right now, in TesseractOCRParser.java, the command line argument that is passed to the os points to a .tmp file in /tmp/. So the command that is executed is "tesseract /tmp/apache-tika-2409864150710514587.tmp /tmp/apache-tika-1277985370508249503.tmp -l eng -psm 1" This is not working for me. When I grab those .tmp files and try to ocr them from the command line, tesseract gets thrown for a loop. >From what I can tell, is the tesseract I have installed can only handle .tif files. I can back this up by citing the tesseract page: https://code.google.com/p/tesseract-ocr/wiki/ReadMe If Tesseract isn't available for your distribution, or you want to use a newer version than they offer, you can compile your own <https://code.google.com/p/tesseract-ocr/wiki/Compiling>. Note that older versions of Tesseract only supported processing .tiff files. So, I think that upgrading tesseract or moving to ubuntu 12 or higher will solve my problems. I will let the listserv know if that fixes it. Kevin Slote On Wed, Oct 1, 2014 at 5:13 PM, Mattmann, Chris A (3980) < [email protected]> wrote: > What type of image is it, Kevin? > > If it’s a TIFF, you need to install tesseract with special lib tiff > parameters. See: > > https://gist.github.com/henrik/1967035 > > > Can you parse the image file with tesseract by itself, without > Tika’s tmp image? > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > -----Original Message----- > From: <Ramirez>, "Paul M (398J)" <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Wednesday, October 1, 2014 at 1:47 PM > To: "<[email protected]>" <[email protected]> > Subject: Re: OCR with tika-server > > >Nothing to be embarrassed about at all Kevin. I actually thought maybe it > >was just a typo issue and I randomly happen to catch that. I've > >definitely done that one before myself. > > > >Bummed that was not the problem. > > > >--Paul > > > >On Oct 1, 2014, at 1:00 PM, kevin slote <[email protected]> > > wrote: > > > >> What I wrote there did have a typo in it. (It's not every day you get to > >> embarrass yourself in front of a bunch of guys from NASA) > >> > >> But that was not what I had in my terminal when I tested it. > >> > >> > >> > >> The actual PATH was: > >> > >> > >> > >> > >> > >>"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/g > >>ames:/usr/bin/tesseract" > >> > >> > >> > >> I think what was actually wrong with the path is that I added the entire > >> path to the tesseract executable, which was in my /usr/bin/ directory, > >> instead of just the directory where tesseract lives. Is this true? > >> > >> > >> > >> I deleted the hard coding from the TesseractOCRConfig.jave and then > >>printed > >> config.getTesseractPath() to stdout. This field was empty. > >> > >> However, I have tesseract installed system wide on this ubuntu vm. > >> > >> So the canRun method evaluated as true whether or not the tesseractPath > >>was > >> configured correctly. > >> > >> > >> > >> I have been slowly trying to debug this all day. It looks like tika is > >> making a tmp file with the .tmp preffix. > >> > >> I commented out some of the code to so that they remained in /tmp/. > >> > >> > >> > >> It looks like tesseract doesn't like that. > >> > >> I tried to ocr these .tmp files to see if I could isolate what was going > >> wrong for me. > >> > >> > >> > >> kslote@ubuntu:~/tika/tika$ tesseract > >> /tmp/apache-tika-7112319184053570698.tmp out > >> > >> Tesseract Open Source OCR Engine > >> > >> name_to_image_type:Error:Unrecognized image > >> type:/tmp/apache-tika-7112319184053570698.tmp > >> > >> IMAGE::read_header:Error:Can't read this image > >> type:/tmp/apache-tika-7112319184053570698.tmp > >> > >> tesseract:Error:Read of file > >>failed:/tmp/apache-tika-7112319184053570698.tmp > >> > >> Segmentation fault > >> > >> > >> > >> On the wiki it mentions something about getting tesseract to work with > >> .tiff files. For whatever reason, the tesseract I have installed only > >> works for .tiff files. Would it be recommend that I re install > >>tesseract > >> from the source? > >> > >> On Tue, Sep 30, 2014 at 7:28 PM, Ramirez, Paul M (398J) < > >> [email protected]> wrote: > >> > >>> Is that a typo in your path to tesseract? > >>> > >>> /urs/bin/tesseract => /usr/bin/tesseract > >>> > >>> --Paul > >>> > >>>> On Sep 30, 2014, at 1:48 PM, "kevin slote" <[email protected]> wrote: > >>>> > >>>> Unfortunately, that did not do it either. > >>>> > >>>> I did: > >>>> > >>>> $export > >>>> > >>> > >>>PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/g > >>>ames:/urs/bin/tesseract > >>>> > >>>> Here is the output from printenv > >>>> > >>>> kslote@ubuntu:~/tika/tika$ printenv > >>>> SHELL=/bin/bash > >>>> USERNAME=kslote > >>>> XDG_CONFIG_DIRS=/etc/xdg/xdg-gnome:/etc/xdg > >>>> DESKTOP_SESSION=gnome > >>>> > >>> > >>>PATH=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/ > >>>usr/games:/urs/bin/tesseract > >>>> PWD=/home/kslote/tika/tika > >>>> HOME=/home/kslote > >>>> LOGNAME=kslote > >>>> _=/usr/bin/printenv > >>>> > >>>> > >>>> On Tue, Sep 30, 2014 at 4:13 PM, Tyler Palsulich > >>>><[email protected]> > >>>> wrote: > >>>> > >>>>> Hi, > >>>>> > >>>>> Hmm. Could you try adding tesseract to your PATH? How did you install > >>>>> Tesseract? You should be able to do a straightforward `sudo apt-get > >>> install > >>>>> tesseract-ocr`. After that, the OCR tests should pass. We're still > >>> running > >>>>> into TIKA-1422, where a mail test fails. But, you can run just the > >>>>>OCR > >>>>> tests with `mvn test > >>>>>-Dtest=org.apache.tika.parser.ocr.TesseractOCRTest > >>>>> -DfailIfNoTests=false`. > >>>>> > >>>>> Let me know if that works for you! > >>>>> Tyler > >>>>> > >>>>>> On Tue, Sep 30, 2014 at 4:00 PM, kevin slote <[email protected]> > >>> wrote: > >>>>>> > >>>>>> I am working on ubuntu 10.4. and I am having some trouble. > >>>>>> Tesseract is installed correctly, but just doing a clone from the > >>>>>>repo > >>>>> and > >>>>>> installing with maven, I am getting some errors. > >>>>>> > >>>>>> This is before I did anything with tesseract installed. > >>>>>> > >>>>>> Failed tests: > >>> testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest): > >>>>>> Check for the image's text. > >>>>>> testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest) > >>>>>> testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest) > >>>>>> > >>>>>> Next I hard coded the tesseractPath: > >>>>>> > >>>>>> I went into the TesseractOCRConfig.java and hard coded > >>>>>>'tesseractPath.' > >>>>>> The all tests passed and it built successfully, but then I went to > >>>>>>post > >>>>>> some tiff's to the server. > >>>>>> That didn't work. So I tried adding some System.out.println("hello > >>>>> world") > >>>>>> (a little crude I know) inside the unit tests to confirm that > >>>>>>tesseract > >>>>>> was working correctly. It looks like something happens in the unit > >>> test > >>>>> in > >>>>>> TesseractOCRTest.java > >>>>>> on the line that says TesseractOCRConfig config = new > >>>>>> TesseractOCRConfig();. Printing to stdout before works, but I get > >>> nothing > >>>>>> after. That happens before the assumeTrue(canRun(config));. So an > >>>>> exception > >>>>>> is not get raised. > >>>>>> > >>>>>> Then once everything is built, ocr does not work. That was why I > >>>>> figured I > >>>>>> would ask to see if I missed some sort of configuration step in > >>> building > >>>>>> it. > >>>>>> > >>>>>> Thanks a ton. > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Tue, Sep 30, 2014 at 2:57 PM, Mattmann, Chris A (3980) < > >>>>>> [email protected]> wrote: > >>>>>> > >>>>>>> Dear Kevin, > >>>>>>> > >>>>>>> Sure, it already works :) 1.7-SNAPSHOT. > >>>>>>> > >>>>>>> See this wiki page: > >>>>>>> > >>>>>>> https://wiki.apache.org/tika/TikaOCR > >>>>>>> > >>>>>>> I¹d be happy to discuss more. > >>>>>>> > >>>>>>> Thanks! > >>>>>>> > >>>>>>> Cheers, > >>>>>>> Chris > >>>>>>> > >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>>> Chris Mattmann, Ph.D. > >>>>>>> Chief Architect > >>>>>>> Instrument Software and Science Data Systems Section (398) > >>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >>>>>>> Office: 168-519, Mailstop: 168-527 > >>>>>>> Email: [email protected] > >>>>>>> WWW: http://sunset.usc.edu/~mattmann/ > >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>>> Adjunct Associate Professor, Computer Science Department > >>>>>>> University of Southern California, Los Angeles, CA 90089 USA > >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -----Original Message----- > >>>>>>> From: kevin slote <[email protected]> > >>>>>>> Reply-To: "[email protected]" <[email protected]> > >>>>>>> Date: Tuesday, September 30, 2014 at 8:52 AM > >>>>>>> To: "[email protected]" <[email protected]> > >>>>>>> Subject: OCR with tika-server > >>>>>>> > >>>>>>>> Hello all, > >>>>>>>> > >>>>>>>> I have been testing out the integration of tika with tesseract. > >>>>>>>> I was wondering if there is a way to get tika-server to run with > >>>>>>>> tesseract's OCR capabilities? > >>>>>>>> > >>>>>>>> Best > >>>>>>>> > >>>>>>>> Kevin Slote > >>>>> > >>> > > > >
