You can also look onto Cuneiform OCR... I think, that easiest way to
integrate them into Tika - allow user to specify external script that
will be called from Tika and that should return recognized text

On Wed, Nov 30, 2011 at 10:48 PM, Albert Law (Logik) <[email protected]> wrote:
> Hi Chris,
>
> I agree with Oleg.  Tesseract is free but requires training to get any
> respectable OCR output.  Lastly, I found that Tesseract had memory
> leaks (circa Sept. 2010).
>
> Aside: I noticed Tesseract doesn't have pre-compiled builds nor a Java API.
>
> On Wed, Nov 30, 2011 at 9:51 AM, Mattmann, Chris A (388J)
> <[email protected]> wrote:
>> Hi Oleg,
>>
>> Thanks for the FYI, Oleg and the heads up on what needs to improve
>> here.
>>
>> Cheers,
>> Chris
>>
>> On Nov 29, 2011, at 11:10 PM, Oleg Tikhonov wrote:
>>
>>> Hi Chris,
>>> I was playing with it recently.
>>> One of the big issues with tesseract is a tough process of the preparing
>>> training set for multiple fonts and languages.
>>> In addition, we also have to add an option for image preprocessing (skewing
>>> + filtering etc).
>>>
>>>
>>> BR,
>>> Oleg
>>>
>>> On Wed, Nov 30, 2011 at 8:59 AM, Mattmann, Chris A (388J) <
>>> [email protected]> wrote:
>>>
>>>> Hey Guys,
>>>>
>>>> FYI: http://code.google.com/p/tesseract-ocr/
>>>>
>>>> I was pointed at this library by someone recently asking me if Tika
>>>> was interested in integrating with this library. It's ALv2 licensed, and
>>>> seems pretty interesting. I'm going to check it out, but just
>>>> wanted to give everyone a heads up.
>>>>
>>>> Cheers,
>>>> Chris
>>>>
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Senior Computer Scientist
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 171-266B, Mailstop: 171-246
>>>> Email: [email protected]
>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Assistant Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>
>>>>
>>
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: [email protected]
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>
>
>
> --
>
> Sincerely,
> Albert Law
> Senior Software Engineer
> Logik.com



-- 
With best wishes,                    Alex Ott
http://alexott.net/
Tiwtter: alexott_en (English), alexott (Russian)
Skype: alex.ott

Reply via email to