[Ankur-core] [X-Post]Bengali to English transliteration anyone?
Does anyone know of any libraries that can transliterate bengali to english. There are tools to the reverse. I need this to solve the last remaining road-block in OCR. The thing is Tesseract-OCR uses a data structure called directed-acyclic-word-graph to store dictionaries for lookup. After an OCR has been performed the OCR system matches the output with entries in this d.a.w.g. file. Unfortunately the data structure is not suited to complex scripts like ours http://groups.google.com/group/tesseract-ocr/browse_thread/thread/5495c4e348a4b272/a6dcfe5d92babb35?lnk=gstq=dawg%2Bwieghts#a6dcfe5d92babb35. There are 2 solutions. 1) I figure out a suitable data structure that handles Indic script and implement. 2) I transliterate the entire dictionary and the OCR output to english (26 characters instead of the 500 odd for bengali) and then match. I think this should work. Any suggestions? [1] http://hacking-tesseract.blogspot.com/ [2] http://code.google.com/p/tesseract-ocr -- Be Intelligent, Use GNU/Linux http://debayanin.googlepages.com/ http://debayan.wordpress.com http://lug.nitdgp.ac.in -- Stay on top of everything new and different, both inside and around Java (TM) technology - register by April 22, and save $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco. 300 plus technical and hands-on sessions. Register today. Use priority code J9JMT32. http://p.sf.net/sfu/p ___ Bengalinux-core mailing list Bengalinux-core@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bengalinux-core
Re: [Ankur-core] [X-Post]Bengali to English transliteration anyone?
On Sun, Apr 19, 2009 at 7:39 PM, Debayan Banerjee debaya...@gmail.com wrote: Does anyone know of any libraries that can transliterate bengali to english. There are tools to the reverse. I need this to solve the last remaining road-block in OCR. The thing is Tesseract-OCR uses a data structure called directed-acyclic-word-graph to store dictionaries for lookup. After an OCR has been performed the OCR system matches the output with entries in this d.a.w.g. file. Unfortunately the data structure is not suited to complex scripts like ours http://groups.google.com/group/tesseract-ocr/browse_thread/thread/5495c4e348a4b272/a6dcfe5d92babb35?lnk=gstq=dawg%2Bwieghts#a6dcfe5d92babb35. There are 2 solutions. 1) I figure out a suitable data structure that handles Indic script and implement. 2) I transliterate the entire dictionary and the OCR output to english (26 characters instead of the 500 odd for bengali) and then match. I think this should work. I believe there is a ISO standard for doing this. Take a look at ISO 15919:2001 Saner explanation is at http://homepage.ntlworld.com/stone-catend/trind.htm :-) There is a thing which does this for Devanagari: https://www.dealloc.org/~mublin/iso15919.py.html However, this is not restricted to the 26 characters, but definitely less than 500 :-) -sdg- -- Sayamindu Dasgupta [http://sayamindu.randomink.org/ramblings] -- Stay on top of everything new and different, both inside and around Java (TM) technology - register by April 22, and save $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco. 300 plus technical and hands-on sessions. Register today. Use priority code J9JMT32. http://p.sf.net/sfu/p ___ Bengalinux-core mailing list Bengalinux-core@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bengalinux-core
Re: [Ankur-core] [X-Post]Bengali to English transliteration anyone?
On 4/19/09, Debayan Banerjee debaya...@gmail.com wrote: Does anyone know of any libraries that can transliterate bengali to english. There are tools to the reverse. I need this to solve the last remaining road-block in OCR. The thing is Tesseract-OCR uses a data structure called directed-acyclic-word-graph to store dictionaries for lookup. After an OCR has been performed the OCR system matches the output with entries in this d.a.w.g. file. Unfortunately the data structure is not suited to complex scripts like ours http://groups.google.com/group/tesseract-ocr/browse_thread/thread/5495c4e348a4b272/a6dcfe5d92babb35?lnk=gstq=dawg%2Bwieghts#a6dcfe5d92babb35. There are 2 solutions. 1) I figure out a suitable data structure that handles Indic script and implement. 2) I transliterate the entire dictionary and the OCR output to english (26 characters instead of the 500 odd for bengali) and then match. I think this should work. Any suggestions? Take a look at the uni2rb.py script in http://bocra.svn.sourceforge.net/viewvc/bocra/bocra/trunk/src/python/ -Deepayan -- Stay on top of everything new and different, both inside and around Java (TM) technology - register by April 22, and save $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco. 300 plus technical and hands-on sessions. Register today. Use priority code J9JMT32. http://p.sf.net/sfu/p ___ Bengalinux-core mailing list Bengalinux-core@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bengalinux-core