Hi Chris, I agree with Oleg. Tesseract is free but requires training to get any respectable OCR output. Lastly, I found that Tesseract had memory leaks (circa Sept. 2010).
Aside: I noticed Tesseract doesn't have pre-compiled builds nor a Java API. On Wed, Nov 30, 2011 at 9:51 AM, Mattmann, Chris A (388J) <[email protected]> wrote: > Hi Oleg, > > Thanks for the FYI, Oleg and the heads up on what needs to improve > here. > > Cheers, > Chris > > On Nov 29, 2011, at 11:10 PM, Oleg Tikhonov wrote: > >> Hi Chris, >> I was playing with it recently. >> One of the big issues with tesseract is a tough process of the preparing >> training set for multiple fonts and languages. >> In addition, we also have to add an option for image preprocessing (skewing >> + filtering etc). >> >> >> BR, >> Oleg >> >> On Wed, Nov 30, 2011 at 8:59 AM, Mattmann, Chris A (388J) < >> [email protected]> wrote: >> >>> Hey Guys, >>> >>> FYI: http://code.google.com/p/tesseract-ocr/ >>> >>> I was pointed at this library by someone recently asking me if Tika >>> was interested in integrating with this library. It's ALv2 licensed, and >>> seems pretty interesting. I'm going to check it out, but just >>> wanted to give everyone a heads up. >>> >>> Cheers, >>> Chris >>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Chris Mattmann, Ph.D. >>> Senior Computer Scientist >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>> Office: 171-266B, Mailstop: 171-246 >>> Email: [email protected] >>> WWW: http://sunset.usc.edu/~mattmann/ >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Adjunct Assistant Professor, Computer Science Department >>> University of Southern California, Los Angeles, CA 90089 USA >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > -- Sincerely, Albert Law Senior Software Engineer Logik.com
