Ok, I am signed up. https://wiki.apache.org/tika/Kevin%20Slote
On Fri, Oct 3, 2014 at 11:02 PM, Mattmann, Chris A (3980) < [email protected]> wrote: > Kevin glad it is now fixed with you! > > If you get a chance, please feel free to document > this on the wiki: > > https://wiki.apache.org/tika/TikaOCR > > > You can sign up for an account, and then I can grant > you permissions to edit the file. Let me know! > > Cheers, > Chris > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > -----Original Message----- > From: kevin slote <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Friday, October 3, 2014 at 4:10 PM > To: "[email protected]" <[email protected]> > Subject: Re: OCR with tika-server > > >Hi all, > > > >I just confirmed that the problem was that my version of tesseract was too > >old. > >Maybe it would be a good idea to put something in the canRun method at the > >top of the tesseract unit test to also check that the version of tesseract > >is relevant? > > > >Older versions of tesseract do not have a "-v" or "--version" flag. So > >maybe use ProcessBuilder to run that command and parse the string to see > >if > >it returned an error? > > > >Thanks for everyone's help. > > > >On Fri, Oct 3, 2014 at 2:30 PM, kevin slote <[email protected]> wrote: > > > >> Thanks for following up! > >> > >> I was trying to dig deeper before I responded. > >> > >> Tyler, > >> > >> I followed those instructions. My version of Tesseract does not ocr the > >> google logo because it is not a tiff. I used imagemagick to convert it > >>to > >> a tif and tesseract returned "check_legal_image_size:Error:Only > >>1,2,4,5,6,8 > >> bpp are supported:32" error which usually means it needs to be re-sized > >> with imagemagick. > >> > >> > >> Chris, > >> > >> I wrote a python wrapper for tesseract that can parse the documents that > >> were in your test-document repository concerning OCR (testOCR.pdf, > >>etc.) It > >> looks like right now, in TesseractOCRParser.java, the command line > >>argument > >> that is passed to the os points to a .tmp file in /tmp/. > >> > >> So the command that is executed is > >> > >> "tesseract /tmp/apache-tika-2409864150710514587.tmp > >> /tmp/apache-tika-1277985370508249503.tmp -l eng -psm 1" > >> > >> This is not working for me. When I grab those .tmp files and try to ocr > >> them from the command line, tesseract gets thrown for a loop. > >> > >> From what I can tell, is the tesseract I have installed can only handle > >> .tif files. > >> I can back this up by citing the tesseract page: > >> https://code.google.com/p/tesseract-ocr/wiki/ReadMe > >> > >> If Tesseract isn't available for your distribution, or you want to use > >>a > >> newer version than they offer, you can compile your own > >> <https://code.google.com/p/tesseract-ocr/wiki/Compiling>. Note that > >>older > >> versions of Tesseract only supported processing .tiff files. > >> > >> So, I think that upgrading tesseract or moving to ubuntu 12 or higher > >>will > >> solve my problems. > >> > >> I will let the listserv know if that fixes it. > >> > >> > >> Kevin Slote > >> > >> > >> > >> On Wed, Oct 1, 2014 at 5:13 PM, Mattmann, Chris A (3980) < > >> [email protected]> wrote: > >> > >>> What type of image is it, Kevin? > >>> > >>> If it’s a TIFF, you need to install tesseract with special lib tiff > >>> parameters. See: > >>> > >>> https://gist.github.com/henrik/1967035 > >>> > >>> > >>> Can you parse the image file with tesseract by itself, without > >>> Tika’s tmp image? > >>> > >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>> Chris Mattmann, Ph.D. > >>> Chief Architect > >>> Instrument Software and Science Data Systems Section (398) > >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >>> Office: 168-519, Mailstop: 168-527 > >>> Email: [email protected] > >>> WWW: http://sunset.usc.edu/~mattmann/ > >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>> Adjunct Associate Professor, Computer Science Department > >>> University of Southern California, Los Angeles, CA 90089 USA > >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>> > >>> > >>> > >>> > >>> > >>> > >>> -----Original Message----- > >>> From: <Ramirez>, "Paul M (398J)" <[email protected]> > >>> Reply-To: "[email protected]" <[email protected]> > >>> Date: Wednesday, October 1, 2014 at 1:47 PM > >>> To: "<[email protected]>" <[email protected]> > >>> Subject: Re: OCR with tika-server > >>> > >>> >Nothing to be embarrassed about at all Kevin. I actually thought > >>>maybe it > >>> >was just a typo issue and I randomly happen to catch that. I've > >>> >definitely done that one before myself. > >>> > > >>> >Bummed that was not the problem. > >>> > > >>> >--Paul > >>> > > >>> >On Oct 1, 2014, at 1:00 PM, kevin slote <[email protected]> > >>> > wrote: > >>> > > >>> >> What I wrote there did have a typo in it. (It's not every day you > >>>get > >>> to > >>> >> embarrass yourself in front of a bunch of guys from NASA) > >>> >> > >>> >> But that was not what I had in my terminal when I tested it. > >>> >> > >>> >> > >>> >> > >>> >> The actual PATH was: > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> > >>> > >>>>>"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/us > >>>>>r/g > >>> >>ames:/usr/bin/tesseract" > >>> >> > >>> >> > >>> >> > >>> >> I think what was actually wrong with the path is that I added the > >>> entire > >>> >> path to the tesseract executable, which was in my /usr/bin/ > >>>directory, > >>> >> instead of just the directory where tesseract lives. Is this true? > >>> >> > >>> >> > >>> >> > >>> >> I deleted the hard coding from the TesseractOCRConfig.jave and then > >>> >>printed > >>> >> config.getTesseractPath() to stdout. This field was empty. > >>> >> > >>> >> However, I have tesseract installed system wide on this ubuntu vm. > >>> >> > >>> >> So the canRun method evaluated as true whether or not the > >>>tesseractPath > >>> >>was > >>> >> configured correctly. > >>> >> > >>> >> > >>> >> > >>> >> I have been slowly trying to debug this all day. It looks like > >>>tika is > >>> >> making a tmp file with the .tmp preffix. > >>> >> > >>> >> I commented out some of the code to so that they remained in /tmp/. > >>> >> > >>> >> > >>> >> > >>> >> It looks like tesseract doesn't like that. > >>> >> > >>> >> I tried to ocr these .tmp files to see if I could isolate what was > >>> going > >>> >> wrong for me. > >>> >> > >>> >> > >>> >> > >>> >> kslote@ubuntu:~/tika/tika$ tesseract > >>> >> /tmp/apache-tika-7112319184053570698.tmp out > >>> >> > >>> >> Tesseract Open Source OCR Engine > >>> >> > >>> >> name_to_image_type:Error:Unrecognized image > >>> >> type:/tmp/apache-tika-7112319184053570698.tmp > >>> >> > >>> >> IMAGE::read_header:Error:Can't read this image > >>> >> type:/tmp/apache-tika-7112319184053570698.tmp > >>> >> > >>> >> tesseract:Error:Read of file > >>> >>failed:/tmp/apache-tika-7112319184053570698.tmp > >>> >> > >>> >> Segmentation fault > >>> >> > >>> >> > >>> >> > >>> >> On the wiki it mentions something about getting tesseract to work > >>>with > >>> >> .tiff files. For whatever reason, the tesseract I have installed > >>>only > >>> >> works for .tiff files. Would it be recommend that I re install > >>> >>tesseract > >>> >> from the source? > >>> >> > >>> >> On Tue, Sep 30, 2014 at 7:28 PM, Ramirez, Paul M (398J) < > >>> >> [email protected]> wrote: > >>> >> > >>> >>> Is that a typo in your path to tesseract? > >>> >>> > >>> >>> /urs/bin/tesseract => /usr/bin/tesseract > >>> >>> > >>> >>> --Paul > >>> >>> > >>> >>>> On Sep 30, 2014, at 1:48 PM, "kevin slote" <[email protected]> > >>> wrote: > >>> >>>> > >>> >>>> Unfortunately, that did not do it either. > >>> >>>> > >>> >>>> I did: > >>> >>>> > >>> >>>> $export > >>> >>>> > >>> >>> > >>> > >>> > >>>>>>PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/us > >>>>>>r/g > >>> >>>ames:/urs/bin/tesseract > >>> >>>> > >>> >>>> Here is the output from printenv > >>> >>>> > >>> >>>> kslote@ubuntu:~/tika/tika$ printenv > >>> >>>> SHELL=/bin/bash > >>> >>>> USERNAME=kslote > >>> >>>> XDG_CONFIG_DIRS=/etc/xdg/xdg-gnome:/etc/xdg > >>> >>>> DESKTOP_SESSION=gnome > >>> >>>> > >>> >>> > >>> > >>> > >>>>>>PATH=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bi > >>>>>>n:/ > >>> >>>usr/games:/urs/bin/tesseract > >>> >>>> PWD=/home/kslote/tika/tika > >>> >>>> HOME=/home/kslote > >>> >>>> LOGNAME=kslote > >>> >>>> _=/usr/bin/printenv > >>> >>>> > >>> >>>> > >>> >>>> On Tue, Sep 30, 2014 at 4:13 PM, Tyler Palsulich > >>> >>>><[email protected]> > >>> >>>> wrote: > >>> >>>> > >>> >>>>> Hi, > >>> >>>>> > >>> >>>>> Hmm. Could you try adding tesseract to your PATH? How did you > >>> install > >>> >>>>> Tesseract? You should be able to do a straightforward `sudo > >>>apt-get > >>> >>> install > >>> >>>>> tesseract-ocr`. After that, the OCR tests should pass. We're > >>>still > >>> >>> running > >>> >>>>> into TIKA-1422, where a mail test fails. But, you can run just > >>>the > >>> >>>>>OCR > >>> >>>>> tests with `mvn test > >>> >>>>>-Dtest=org.apache.tika.parser.ocr.TesseractOCRTest > >>> >>>>> -DfailIfNoTests=false`. > >>> >>>>> > >>> >>>>> Let me know if that works for you! > >>> >>>>> Tyler > >>> >>>>> > >>> >>>>>> On Tue, Sep 30, 2014 at 4:00 PM, kevin slote <[email protected] > > > >>> >>> wrote: > >>> >>>>>> > >>> >>>>>> I am working on ubuntu 10.4. and I am having some trouble. > >>> >>>>>> Tesseract is installed correctly, but just doing a clone from > >>>the > >>> >>>>>>repo > >>> >>>>> and > >>> >>>>>> installing with maven, I am getting some errors. > >>> >>>>>> > >>> >>>>>> This is before I did anything with tesseract installed. > >>> >>>>>> > >>> >>>>>> Failed tests: > >>> >>> testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRTest): > >>> >>>>>> Check for the image's text. > >>> >>>>>> testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRTest) > >>> >>>>>> testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRTest) > >>> >>>>>> > >>> >>>>>> Next I hard coded the tesseractPath: > >>> >>>>>> > >>> >>>>>> I went into the TesseractOCRConfig.java and hard coded > >>> >>>>>>'tesseractPath.' > >>> >>>>>> The all tests passed and it built successfully, but then I went > >>>to > >>> >>>>>>post > >>> >>>>>> some tiff's to the server. > >>> >>>>>> That didn't work. So I tried adding some > >>>System.out.println("hello > >>> >>>>> world") > >>> >>>>>> (a little crude I know) inside the unit tests to confirm that > >>> >>>>>>tesseract > >>> >>>>>> was working correctly. It looks like something happens in the > >>>unit > >>> >>> test > >>> >>>>> in > >>> >>>>>> TesseractOCRTest.java > >>> >>>>>> on the line that says TesseractOCRConfig config = new > >>> >>>>>> TesseractOCRConfig();. Printing to stdout before works, but I > >>>get > >>> >>> nothing > >>> >>>>>> after. That happens before the assumeTrue(canRun(config));. So > >>>an > >>> >>>>> exception > >>> >>>>>> is not get raised. > >>> >>>>>> > >>> >>>>>> Then once everything is built, ocr does not work. That was why > >>>I > >>> >>>>> figured I > >>> >>>>>> would ask to see if I missed some sort of configuration step in > >>> >>> building > >>> >>>>>> it. > >>> >>>>>> > >>> >>>>>> Thanks a ton. > >>> >>>>>> > >>> >>>>>> > >>> >>>>>> > >>> >>>>>> > >>> >>>>>> > >>> >>>>>> On Tue, Sep 30, 2014 at 2:57 PM, Mattmann, Chris A (3980) < > >>> >>>>>> [email protected]> wrote: > >>> >>>>>> > >>> >>>>>>> Dear Kevin, > >>> >>>>>>> > >>> >>>>>>> Sure, it already works :) 1.7-SNAPSHOT. > >>> >>>>>>> > >>> >>>>>>> See this wiki page: > >>> >>>>>>> > >>> >>>>>>> https://wiki.apache.org/tika/TikaOCR > >>> >>>>>>> > >>> >>>>>>> I¹d be happy to discuss more. > >>> >>>>>>> > >>> >>>>>>> Thanks! > >>> >>>>>>> > >>> >>>>>>> Cheers, > >>> >>>>>>> Chris > >>> >>>>>>> > >>> >>>>>>> > >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>> >>>>>>> Chris Mattmann, Ph.D. > >>> >>>>>>> Chief Architect > >>> >>>>>>> Instrument Software and Science Data Systems Section (398) > >>> >>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >>> >>>>>>> Office: 168-519, Mailstop: 168-527 > >>> >>>>>>> Email: [email protected] > >>> >>>>>>> WWW: http://sunset.usc.edu/~mattmann/ > >>> >>>>>>> > >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>> >>>>>>> Adjunct Associate Professor, Computer Science Department > >>> >>>>>>> University of Southern California, Los Angeles, CA 90089 USA > >>> >>>>>>> > >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> -----Original Message----- > >>> >>>>>>> From: kevin slote <[email protected]> > >>> >>>>>>> Reply-To: "[email protected]" <[email protected]> > >>> >>>>>>> Date: Tuesday, September 30, 2014 at 8:52 AM > >>> >>>>>>> To: "[email protected]" <[email protected]> > >>> >>>>>>> Subject: OCR with tika-server > >>> >>>>>>> > >>> >>>>>>>> Hello all, > >>> >>>>>>>> > >>> >>>>>>>> I have been testing out the integration of tika with > >>>tesseract. > >>> >>>>>>>> I was wondering if there is a way to get tika-server to run > >>>with > >>> >>>>>>>> tesseract's OCR capabilities? > >>> >>>>>>>> > >>> >>>>>>>> Best > >>> >>>>>>>> > >>> >>>>>>>> Kevin Slote > >>> >>>>> > >>> >>> > >>> > > >>> > >>> > >> > >
