Dear fiends!
For a few years our group has been developing OCR (optical character
recognition) and translation system with Open Source code for Asian languages.
The key features of the OCR system include:
1. Stream OCR processing During the first stage of the project, we recognized
300 000 pages of Tibetan Canon in Tibetan for TBRC Digital Library
(www.tbrc.org) We used MacPro server that has processed all 280 volumes with
one OCR set.
2. Tibetan spell checker and online dictionary on 250000 words ans 6.5 mln
wordlist.
3. Multilingual support At present, the key direction of the project is
Tibetan, Sinhala, Sanskrit, Kannada OCR.
4. High accuracy. The system uses dictionary control at all stages of OCR
processing. Its Grammar Corrector can use a statistic dictionary containing
20-30 mln phrases (the Tibetan dictionary now includes 8.5 mln). For Tibetan
books, the current recognition results are 1 error per 1000 characters.
In current stage of project:
We has grammar analysis module for tibetan and sanskrit. In include corpus and
full-text fussy search 1sec for 1Gb corpus
It is need incorporate it with HFST and Apertium
In StPetersburg State University we receive letter about Apertium project and
GoogleCodde Summer with recommendation to join your project.
How we can cooperate efforts?
Sincerely yours alex
http://code.google.com/p/ocrlib/
------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff