Hi Kevin just checking back - did you get it working? ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message----- From: <Mattmann>, Chris Mattmann <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Wednesday, October 1, 2014 at 2:13 PM To: "[email protected]" <[email protected]> Subject: Re: OCR with tika-server >What type of image is it, Kevin? > >If it’s a TIFF, you need to install tesseract with special lib tiff >parameters. See: > >https://gist.github.com/henrik/1967035 > > >Can you parse the image file with tesseract by itself, without >Tika’s tmp image? > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Chris Mattmann, Ph.D. >Chief Architect >Instrument Software and Science Data Systems Section (398) >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >Office: 168-519, Mailstop: 168-527 >Email: [email protected] >WWW: http://sunset.usc.edu/~mattmann/ >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Adjunct Associate Professor, Computer Science Department >University of Southern California, Los Angeles, CA 90089 USA >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > >-----Original Message----- >From: <Ramirez>, "Paul M (398J)" <[email protected]> >Reply-To: "[email protected]" <[email protected]> >Date: Wednesday, October 1, 2014 at 1:47 PM >To: "<[email protected]>" <[email protected]> >Subject: Re: OCR with tika-server > >>Nothing to be embarrassed about at all Kevin. I actually thought maybe it >>was just a typo issue and I randomly happen to catch that. I've >>definitely done that one before myself. >> >>Bummed that was not the problem. >> >>--Paul >> >>On Oct 1, 2014, at 1:00 PM, kevin slote <[email protected]> >> wrote: >> >>> What I wrote there did have a typo in it. (It's not every day you get >>>to >>> embarrass yourself in front of a bunch of guys from NASA) >>> >>> But that was not what I had in my terminal when I tested it. >>> >>> >>> >>> The actual PATH was: >>> >>> >>> >>> >>> >>>"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/ >>>g >>>ames:/usr/bin/tesseract" >>> >>> >>> >>> I think what was actually wrong with the path is that I added the >>>entire >>> path to the tesseract executable, which was in my /usr/bin/ directory, >>> instead of just the directory where tesseract lives. Is this true? >>> >>> >>> >>> I deleted the hard coding from the TesseractOCRConfig.jave and then >>>printed >>> config.getTesseractPath() to stdout. This field was empty. >>> >>> However, I have tesseract installed system wide on this ubuntu vm. >>> >>> So the canRun method evaluated as true whether or not the tesseractPath >>>was >>> configured correctly. >>> >>> >>> >>> I have been slowly trying to debug this all day. It looks like tika is >>> making a tmp file with the .tmp preffix. >>> >>> I commented out some of the code to so that they remained in /tmp/. >>> >>> >>> >>> It looks like tesseract doesn't like that. >>> >>> I tried to ocr these .tmp files to see if I could isolate what was >>>going >>> wrong for me. >>> >>> >>> >>> kslote@ubuntu:~/tika/tika$ tesseract >>> /tmp/apache-tika-7112319184053570698.tmp out >>> >>> Tesseract Open Source OCR Engine >>> >>> name_to_image_type:Error:Unrecognized image >>> type:/tmp/apache-tika-7112319184053570698.tmp >>> >>> IMAGE::read_header:Error:Can't read this image >>> type:/tmp/apache-tika-7112319184053570698.tmp >>> >>> tesseract:Error:Read of file >>>failed:/tmp/apache-tika-7112319184053570698.tmp >>> >>> Segmentation fault >>> >>> >>> >>> On the wiki it mentions something about getting tesseract to work with >>> .tiff files. For whatever reason, the tesseract I have installed only >>> works for .tiff files. Would it be recommend that I re install >>>tesseract >>> from the source? >>> >>> On Tue, Sep 30, 2014 at 7:28 PM, Ramirez, Paul M (398J) < >>> [email protected]> wrote: >>> >>>> Is that a typo in your path to tesseract? >>>> >>>> /urs/bin/tesseract => /usr/bin/tesseract >>>> >>>> --Paul >>>> >>>>> On Sep 30, 2014, at 1:48 PM, "kevin slote" <[email protected]> wrote: >>>>> >>>>> Unfortunately, that did not do it either. >>>>> >>>>> I did: >>>>> >>>>> $export >>>>> >>>> >>>>PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/ >>>>g >>>>ames:/urs/bin/tesseract >>>>> >>>>> Here is the output from printenv >>>>> >>>>> kslote@ubuntu:~/tika/tika$ printenv >>>>> SHELL=/bin/bash >>>>> USERNAME=kslote >>>>> XDG_CONFIG_DIRS=/etc/xdg/xdg-gnome:/etc/xdg >>>>> DESKTOP_SESSION=gnome >>>>> >>>> >>>>PATH=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin: >>>>/ >>>>usr/games:/urs/bin/tesseract >>>>> PWD=/home/kslote/tika/tika >>>>> HOME=/home/kslote >>>>> LOGNAME=kslote >>>>> _=/usr/bin/printenv >>>>> >>>>> >>>>> On Tue, Sep 30, 2014 at 4:13 PM, Tyler Palsulich >>>>><[email protected]> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> Hmm. Could you try adding tesseract to your PATH? How did you >>>>>>install >>>>>> Tesseract? You should be able to do a straightforward `sudo apt-get >>>> install >>>>>> tesseract-ocr`. After that, the OCR tests should pass. We're still >>>> running >>>>>> into TIKA-1422, where a mail test fails. But, you can run just the >>>>>>OCR >>>>>> tests with `mvn test >>>>>>-Dtest=org.apache.tika.parser.ocr.TesseractOCRTest >>>>>> -DfailIfNoTests=false`. >>>>>> >>>>>> Let me know if that works for you! >>>>>> Tyler >>>>>> >>>>>>> On Tue, Sep 30, 2014 at 4:00 PM, kevin slote <[email protected]> >>>> wrote: >>>>>>> >>>>>>> I am working on ubuntu 10.4. and I am having some trouble. >>>>>>> Tesseract is installed correctly, but just doing a clone from the >>>>>>>repo >>>>>> and >>>>>>> installing with maven, I am getting some errors. >>>>>>> >>>>>>> This is before I did anything with tesseract installed. >>>>>>> >>>>>>> Failed tests: >>>> testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest): >>>>>>> Check for the image's text. >>>>>>> testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest) >>>>>>> testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest) >>>>>>> >>>>>>> Next I hard coded the tesseractPath: >>>>>>> >>>>>>> I went into the TesseractOCRConfig.java and hard coded >>>>>>>'tesseractPath.' >>>>>>> The all tests passed and it built successfully, but then I went to >>>>>>>post >>>>>>> some tiff's to the server. >>>>>>> That didn't work. So I tried adding some System.out.println("hello >>>>>> world") >>>>>>> (a little crude I know) inside the unit tests to confirm that >>>>>>>tesseract >>>>>>> was working correctly. It looks like something happens in the unit >>>> test >>>>>> in >>>>>>> TesseractOCRTest.java >>>>>>> on the line that says TesseractOCRConfig config = new >>>>>>> TesseractOCRConfig();. Printing to stdout before works, but I get >>>> nothing >>>>>>> after. That happens before the assumeTrue(canRun(config));. So an >>>>>> exception >>>>>>> is not get raised. >>>>>>> >>>>>>> Then once everything is built, ocr does not work. That was why I >>>>>> figured I >>>>>>> would ask to see if I missed some sort of configuration step in >>>> building >>>>>>> it. >>>>>>> >>>>>>> Thanks a ton. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Sep 30, 2014 at 2:57 PM, Mattmann, Chris A (3980) < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Dear Kevin, >>>>>>>> >>>>>>>> Sure, it already works :) 1.7-SNAPSHOT. >>>>>>>> >>>>>>>> See this wiki page: >>>>>>>> >>>>>>>> https://wiki.apache.org/tika/TikaOCR >>>>>>>> >>>>>>>> I¹d be happy to discuss more. >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Chris >>>>>>>> >>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>> Chris Mattmann, Ph.D. >>>>>>>> Chief Architect >>>>>>>> Instrument Software and Science Data Systems Section (398) >>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>>>> Office: 168-519, Mailstop: 168-527 >>>>>>>> Email: [email protected] >>>>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>> Adjunct Associate Professor, Computer Science Department >>>>>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: kevin slote <[email protected]> >>>>>>>> Reply-To: "[email protected]" <[email protected]> >>>>>>>> Date: Tuesday, September 30, 2014 at 8:52 AM >>>>>>>> To: "[email protected]" <[email protected]> >>>>>>>> Subject: OCR with tika-server >>>>>>>> >>>>>>>>> Hello all, >>>>>>>>> >>>>>>>>> I have been testing out the integration of tika with tesseract. >>>>>>>>> I was wondering if there is a way to get tika-server to run with >>>>>>>>> tesseract's OCR capabilities? >>>>>>>>> >>>>>>>>> Best >>>>>>>>> >>>>>>>>> Kevin Slote >>>>>> >>>> >> >
