Does anyone know of any libraries that can transliterate bengali to
english. There are tools to the reverse. I need this to solve the last
remaining road-block in OCR.
The thing is Tesseract-OCR uses a data structure called
directed-acyclic-word-graph to store dictionaries for lookup. After an
OCR has been performed the OCR system matches the output with entries
in this d.a.w.g. file. Unfortunately the data structure is not suited
to complex scripts like ours
<http://groups.google.com/group/tesseract-ocr/browse_thread/thread/5495c4e348a4b272/a6dcfe5d92babb35?lnk=gst&q=dawg%2Bwieghts#a6dcfe5d92babb35>.
There are 2 solutions. 1) I figure out a suitable data structure that
handles Indic script and implement. 2) I transliterate the entire
dictionary and the OCR output to english (26 characters instead of the
500 odd for bengali) and then match. I think this should work.
Any suggestions?

[1] http://hacking-tesseract.blogspot.com/
[2] http://code.google.com/p/tesseract-ocr


-- 
Be Intelligent, Use GNU/Linux

http://debayanin.googlepages.com/
http://debayan.wordpress.com
http://lug.nitdgp.ac.in

------------------------------------------------------------------------------
Stay on top of everything new and different, both inside and 
around Java (TM) technology - register by April 22, and save
$200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
300 plus technical and hands-on sessions. Register today. 
Use priority code J9JMT32. http://p.sf.net/sfu/p
_______________________________________________
Bengalinux-core mailing list
Bengalinux-core@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bengalinux-core

Reply via email to