Kevin glad it is now fixed with you! If you get a chance, please feel free to document this on the wiki:
https://wiki.apache.org/tika/TikaOCR You can sign up for an account, and then I can grant you permissions to edit the file. Let me know! Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: kevin slote <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Friday, October 3, 2014 at 4:10 PM To: "[email protected]" <[email protected]> Subject: Re: OCR with tika-server >Hi all, > >I just confirmed that the problem was that my version of tesseract was too >old. >Maybe it would be a good idea to put something in the canRun method at the >top of the tesseract unit test to also check that the version of tesseract >is relevant? > >Older versions of tesseract do not have a "-v" or "--version" flag. So >maybe use ProcessBuilder to run that command and parse the string to see >if >it returned an error? > >Thanks for everyone's help. > >On Fri, Oct 3, 2014 at 2:30 PM, kevin slote <[email protected]> wrote: > >> Thanks for following up! >> >> I was trying to dig deeper before I responded. >> >> Tyler, >> >> I followed those instructions. My version of Tesseract does not ocr the >> google logo because it is not a tiff. I used imagemagick to convert it >>to >> a tif and tesseract returned "check_legal_image_size:Error:Only >>1,2,4,5,6,8 >> bpp are supported:32" error which usually means it needs to be re-sized >> with imagemagick. >> >> >> Chris, >> >> I wrote a python wrapper for tesseract that can parse the documents that >> were in your test-document repository concerning OCR (testOCR.pdf, >>etc.) It >> looks like right now, in TesseractOCRParser.java, the command line >>argument >> that is passed to the os points to a .tmp file in /tmp/. >> >> So the command that is executed is >> >> "tesseract /tmp/apache-tika-2409864150710514587.tmp >> /tmp/apache-tika-1277985370508249503.tmp -l eng -psm 1" >> >> This is not working for me. When I grab those .tmp files and try to ocr >> them from the command line, tesseract gets thrown for a loop. >> >> From what I can tell, is the tesseract I have installed can only handle >> .tif files. >> I can back this up by citing the tesseract page: >> https://code.google.com/p/tesseract-ocr/wiki/ReadMe >> >> If Tesseract isn't available for your distribution, or you want to use >>a >> newer version than they offer, you can compile your own >> <https://code.google.com/p/tesseract-ocr/wiki/Compiling>. Note that >>older >> versions of Tesseract only supported processing .tiff files. >> >> So, I think that upgrading tesseract or moving to ubuntu 12 or higher >>will >> solve my problems. >> >> I will let the listserv know if that fixes it. >> >> >> Kevin Slote >> >> >> >> On Wed, Oct 1, 2014 at 5:13 PM, Mattmann, Chris A (3980) < >> [email protected]> wrote: >> >>> What type of image is it, Kevin? >>> >>> If it’s a TIFF, you need to install tesseract with special lib tiff >>> parameters. See: >>> >>> https://gist.github.com/henrik/1967035 >>> >>> >>> Can you parse the image file with tesseract by itself, without >>> Tika’s tmp image? >>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Chris Mattmann, Ph.D. >>> Chief Architect >>> Instrument Software and Science Data Systems Section (398) >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>> Office: 168-519, Mailstop: 168-527 >>> Email: [email protected] >>> WWW: http://sunset.usc.edu/~mattmann/ >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Adjunct Associate Professor, Computer Science Department >>> University of Southern California, Los Angeles, CA 90089 USA >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> >>> >>> >>> >>> >>> -----Original Message----- >>> From: <Ramirez>, "Paul M (398J)" <[email protected]> >>> Reply-To: "[email protected]" <[email protected]> >>> Date: Wednesday, October 1, 2014 at 1:47 PM >>> To: "<[email protected]>" <[email protected]> >>> Subject: Re: OCR with tika-server >>> >>> >Nothing to be embarrassed about at all Kevin. I actually thought >>>maybe it >>> >was just a typo issue and I randomly happen to catch that. I've >>> >definitely done that one before myself. >>> > >>> >Bummed that was not the problem. >>> > >>> >--Paul >>> > >>> >On Oct 1, 2014, at 1:00 PM, kevin slote <[email protected]> >>> > wrote: >>> > >>> >> What I wrote there did have a typo in it. (It's not every day you >>>get >>> to >>> >> embarrass yourself in front of a bunch of guys from NASA) >>> >> >>> >> But that was not what I had in my terminal when I tested it. >>> >> >>> >> >>> >> >>> >> The actual PATH was: >>> >> >>> >> >>> >> >>> >> >>> >> >>> >>> >>>>>"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/us >>>>>r/g >>> >>ames:/usr/bin/tesseract" >>> >> >>> >> >>> >> >>> >> I think what was actually wrong with the path is that I added the >>> entire >>> >> path to the tesseract executable, which was in my /usr/bin/ >>>directory, >>> >> instead of just the directory where tesseract lives. Is this true? >>> >> >>> >> >>> >> >>> >> I deleted the hard coding from the TesseractOCRConfig.jave and then >>> >>printed >>> >> config.getTesseractPath() to stdout. This field was empty. >>> >> >>> >> However, I have tesseract installed system wide on this ubuntu vm. >>> >> >>> >> So the canRun method evaluated as true whether or not the >>>tesseractPath >>> >>was >>> >> configured correctly. >>> >> >>> >> >>> >> >>> >> I have been slowly trying to debug this all day. It looks like >>>tika is >>> >> making a tmp file with the .tmp preffix. >>> >> >>> >> I commented out some of the code to so that they remained in /tmp/. >>> >> >>> >> >>> >> >>> >> It looks like tesseract doesn't like that. >>> >> >>> >> I tried to ocr these .tmp files to see if I could isolate what was >>> going >>> >> wrong for me. >>> >> >>> >> >>> >> >>> >> kslote@ubuntu:~/tika/tika$ tesseract >>> >> /tmp/apache-tika-7112319184053570698.tmp out >>> >> >>> >> Tesseract Open Source OCR Engine >>> >> >>> >> name_to_image_type:Error:Unrecognized image >>> >> type:/tmp/apache-tika-7112319184053570698.tmp >>> >> >>> >> IMAGE::read_header:Error:Can't read this image >>> >> type:/tmp/apache-tika-7112319184053570698.tmp >>> >> >>> >> tesseract:Error:Read of file >>> >>failed:/tmp/apache-tika-7112319184053570698.tmp >>> >> >>> >> Segmentation fault >>> >> >>> >> >>> >> >>> >> On the wiki it mentions something about getting tesseract to work >>>with >>> >> .tiff files. For whatever reason, the tesseract I have installed >>>only >>> >> works for .tiff files. Would it be recommend that I re install >>> >>tesseract >>> >> from the source? >>> >> >>> >> On Tue, Sep 30, 2014 at 7:28 PM, Ramirez, Paul M (398J) < >>> >> [email protected]> wrote: >>> >> >>> >>> Is that a typo in your path to tesseract? >>> >>> >>> >>> /urs/bin/tesseract => /usr/bin/tesseract >>> >>> >>> >>> --Paul >>> >>> >>> >>>> On Sep 30, 2014, at 1:48 PM, "kevin slote" <[email protected]> >>> wrote: >>> >>>> >>> >>>> Unfortunately, that did not do it either. >>> >>>> >>> >>>> I did: >>> >>>> >>> >>>> $export >>> >>>> >>> >>> >>> >>> >>>>>>PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/us >>>>>>r/g >>> >>>ames:/urs/bin/tesseract >>> >>>> >>> >>>> Here is the output from printenv >>> >>>> >>> >>>> kslote@ubuntu:~/tika/tika$ printenv >>> >>>> SHELL=/bin/bash >>> >>>> USERNAME=kslote >>> >>>> XDG_CONFIG_DIRS=/etc/xdg/xdg-gnome:/etc/xdg >>> >>>> DESKTOP_SESSION=gnome >>> >>>> >>> >>> >>> >>> >>>>>>PATH=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bi >>>>>>n:/ >>> >>>usr/games:/urs/bin/tesseract >>> >>>> PWD=/home/kslote/tika/tika >>> >>>> HOME=/home/kslote >>> >>>> LOGNAME=kslote >>> >>>> _=/usr/bin/printenv >>> >>>> >>> >>>> >>> >>>> On Tue, Sep 30, 2014 at 4:13 PM, Tyler Palsulich >>> >>>><[email protected]> >>> >>>> wrote: >>> >>>> >>> >>>>> Hi, >>> >>>>> >>> >>>>> Hmm. Could you try adding tesseract to your PATH? How did you >>> install >>> >>>>> Tesseract? You should be able to do a straightforward `sudo >>>apt-get >>> >>> install >>> >>>>> tesseract-ocr`. After that, the OCR tests should pass. We're >>>still >>> >>> running >>> >>>>> into TIKA-1422, where a mail test fails. But, you can run just >>>the >>> >>>>>OCR >>> >>>>> tests with `mvn test >>> >>>>>-Dtest=org.apache.tika.parser.ocr.TesseractOCRTest >>> >>>>> -DfailIfNoTests=false`. >>> >>>>> >>> >>>>> Let me know if that works for you! >>> >>>>> Tyler >>> >>>>> >>> >>>>>> On Tue, Sep 30, 2014 at 4:00 PM, kevin slote <[email protected]> >>> >>> wrote: >>> >>>>>> >>> >>>>>> I am working on ubuntu 10.4. and I am having some trouble. >>> >>>>>> Tesseract is installed correctly, but just doing a clone from >>>the >>> >>>>>>repo >>> >>>>> and >>> >>>>>> installing with maven, I am getting some errors. >>> >>>>>> >>> >>>>>> This is before I did anything with tesseract installed. >>> >>>>>> >>> >>>>>> Failed tests: >>> >>> testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest): >>> >>>>>> Check for the image's text. >>> >>>>>> testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest) >>> >>>>>> testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest) >>> >>>>>> >>> >>>>>> Next I hard coded the tesseractPath: >>> >>>>>> >>> >>>>>> I went into the TesseractOCRConfig.java and hard coded >>> >>>>>>'tesseractPath.' >>> >>>>>> The all tests passed and it built successfully, but then I went >>>to >>> >>>>>>post >>> >>>>>> some tiff's to the server. >>> >>>>>> That didn't work. So I tried adding some >>>System.out.println("hello >>> >>>>> world") >>> >>>>>> (a little crude I know) inside the unit tests to confirm that >>> >>>>>>tesseract >>> >>>>>> was working correctly. It looks like something happens in the >>>unit >>> >>> test >>> >>>>> in >>> >>>>>> TesseractOCRTest.java >>> >>>>>> on the line that says TesseractOCRConfig config = new >>> >>>>>> TesseractOCRConfig();. Printing to stdout before works, but I >>>get >>> >>> nothing >>> >>>>>> after. That happens before the assumeTrue(canRun(config));. So >>>an >>> >>>>> exception >>> >>>>>> is not get raised. >>> >>>>>> >>> >>>>>> Then once everything is built, ocr does not work. That was why >>>I >>> >>>>> figured I >>> >>>>>> would ask to see if I missed some sort of configuration step in >>> >>> building >>> >>>>>> it. >>> >>>>>> >>> >>>>>> Thanks a ton. >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> On Tue, Sep 30, 2014 at 2:57 PM, Mattmann, Chris A (3980) < >>> >>>>>> [email protected]> wrote: >>> >>>>>> >>> >>>>>>> Dear Kevin, >>> >>>>>>> >>> >>>>>>> Sure, it already works :) 1.7-SNAPSHOT. >>> >>>>>>> >>> >>>>>>> See this wiki page: >>> >>>>>>> >>> >>>>>>> https://wiki.apache.org/tika/TikaOCR >>> >>>>>>> >>> >>>>>>> I¹d be happy to discuss more. >>> >>>>>>> >>> >>>>>>> Thanks! >>> >>>>>>> >>> >>>>>>> Cheers, >>> >>>>>>> Chris >>> >>>>>>> >>> >>>>>>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>>>>>> Chris Mattmann, Ph.D. >>> >>>>>>> Chief Architect >>> >>>>>>> Instrument Software and Science Data Systems Section (398) >>> >>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>> >>>>>>> Office: 168-519, Mailstop: 168-527 >>> >>>>>>> Email: [email protected] >>> >>>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>> >>>>>>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>>>>>> Adjunct Associate Professor, Computer Science Department >>> >>>>>>> University of Southern California, Los Angeles, CA 90089 USA >>> >>>>>>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> -----Original Message----- >>> >>>>>>> From: kevin slote <[email protected]> >>> >>>>>>> Reply-To: "[email protected]" <[email protected]> >>> >>>>>>> Date: Tuesday, September 30, 2014 at 8:52 AM >>> >>>>>>> To: "[email protected]" <[email protected]> >>> >>>>>>> Subject: OCR with tika-server >>> >>>>>>> >>> >>>>>>>> Hello all, >>> >>>>>>>> >>> >>>>>>>> I have been testing out the integration of tika with >>>tesseract. >>> >>>>>>>> I was wondering if there is a way to get tika-server to run >>>with >>> >>>>>>>> tesseract's OCR capabilities? >>> >>>>>>>> >>> >>>>>>>> Best >>> >>>>>>>> >>> >>>>>>>> Kevin Slote >>> >>>>> >>> >>> >>> > >>> >>> >>
