You can also look onto Cuneiform OCR... I think, that easiest way to integrate them into Tika - allow user to specify external script that will be called from Tika and that should return recognized text
On Wed, Nov 30, 2011 at 10:48 PM, Albert Law (Logik) <[email protected]> wrote: > Hi Chris, > > I agree with Oleg. Tesseract is free but requires training to get any > respectable OCR output. Lastly, I found that Tesseract had memory > leaks (circa Sept. 2010). > > Aside: I noticed Tesseract doesn't have pre-compiled builds nor a Java API. > > On Wed, Nov 30, 2011 at 9:51 AM, Mattmann, Chris A (388J) > <[email protected]> wrote: >> Hi Oleg, >> >> Thanks for the FYI, Oleg and the heads up on what needs to improve >> here. >> >> Cheers, >> Chris >> >> On Nov 29, 2011, at 11:10 PM, Oleg Tikhonov wrote: >> >>> Hi Chris, >>> I was playing with it recently. >>> One of the big issues with tesseract is a tough process of the preparing >>> training set for multiple fonts and languages. >>> In addition, we also have to add an option for image preprocessing (skewing >>> + filtering etc). >>> >>> >>> BR, >>> Oleg >>> >>> On Wed, Nov 30, 2011 at 8:59 AM, Mattmann, Chris A (388J) < >>> [email protected]> wrote: >>> >>>> Hey Guys, >>>> >>>> FYI: http://code.google.com/p/tesseract-ocr/ >>>> >>>> I was pointed at this library by someone recently asking me if Tika >>>> was interested in integrating with this library. It's ALv2 licensed, and >>>> seems pretty interesting. I'm going to check it out, but just >>>> wanted to give everyone a heads up. >>>> >>>> Cheers, >>>> Chris >>>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Chris Mattmann, Ph.D. >>>> Senior Computer Scientist >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>> Office: 171-266B, Mailstop: 171-246 >>>> Email: [email protected] >>>> WWW: http://sunset.usc.edu/~mattmann/ >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Adjunct Assistant Professor, Computer Science Department >>>> University of Southern California, Los Angeles, CA 90089 USA >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> >> >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Senior Computer Scientist >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 171-266B, Mailstop: 171-246 >> Email: [email protected] >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Assistant Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> > > > > -- > > Sincerely, > Albert Law > Senior Software Engineer > Logik.com -- With best wishes, Alex Ott http://alexott.net/ Tiwtter: alexott_en (English), alexott (Russian) Skype: alex.ott
