subject:"\[Ankur\-core\] \[X\-Post\]Bengali to English transliteration anyone\?"

[Ankur-core] [X-Post]Bengali to English transliteration anyone?

2009-04-19 Thread Debayan Banerjee

Does anyone know of any libraries that can transliterate bengali to
english. There are tools to the reverse. I need this to solve the last
remaining road-block in OCR.
The thing is Tesseract-OCR uses a data structure called
directed-acyclic-word-graph to store dictionaries for lookup. After an
OCR has been performed the OCR system matches the output with entries
in this d.a.w.g. file. Unfortunately the data structure is not suited
to complex scripts like ours
http://groups.google.com/group/tesseract-ocr/browse_thread/thread/5495c4e348a4b272/a6dcfe5d92babb35?lnk=gstq=dawg%2Bwieghts#a6dcfe5d92babb35.
There are 2 solutions. 1) I figure out a suitable data structure that
handles Indic script and implement. 2) I transliterate the entire
dictionary and the OCR output to english (26 characters instead of the
500 odd for bengali) and then match. I think this should work.
Any suggestions?

[1] http://hacking-tesseract.blogspot.com/
[2] http://code.google.com/p/tesseract-ocr

--
Be Intelligent, Use GNU/Linux

http://debayanin.googlepages.com/
http://debayan.wordpress.com
http://lug.nitdgp.ac.in

--
Stay on top of everything new and different, both inside and
around Java (TM) technology - register by April 22, and save
$200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
300 plus technical and hands-on sessions. Register today.
Use priority code J9JMT32. http://p.sf.net/sfu/p
___
Bengalinux-core mailing list
Bengalinux-core@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bengalinux-core

Re: [Ankur-core] [X-Post]Bengali to English transliteration anyone?

2009-04-19 Thread Sayamindu Dasgupta

On Sun, Apr 19, 2009 at 7:39 PM, Debayan Banerjee debaya...@gmail.com wrote:
Does anyone know of any libraries that can transliterate bengali to
english. There are tools to the reverse. I need this to solve the last
remaining road-block in OCR.
The thing is Tesseract-OCR uses a data structure called
directed-acyclic-word-graph to store dictionaries for lookup. After an
OCR has been performed the OCR system matches the output with entries
in this d.a.w.g. file. Unfortunately the data structure is not suited
to complex scripts like ours
http://groups.google.com/group/tesseract-ocr/browse_thread/thread/5495c4e348a4b272/a6dcfe5d92babb35?lnk=gstq=dawg%2Bwieghts#a6dcfe5d92babb35.
There are 2 solutions. 1) I figure out a suitable data structure that
handles Indic script and implement. 2) I transliterate the entire
dictionary and the OCR output to english (26 characters instead of the
500 odd for bengali) and then match. I think this should work.

I believe there is a ISO standard for doing this. Take a look at ISO 15919:2001

Saner explanation is at http://homepage.ntlworld.com/stone-catend/trind.htm :-)

There is a thing which does this for Devanagari:
https://www.dealloc.org/~mublin/iso15919.py.html

However, this is not restricted to the 26 characters, but definitely
less than 500 :-)

-sdg-

--
Sayamindu Dasgupta
[http://sayamindu.randomink.org/ramblings]

Re: [Ankur-core] [X-Post]Bengali to English transliteration anyone?

2009-04-19 Thread Deepayan Sarkar

On 4/19/09, Debayan Banerjee debaya...@gmail.com wrote:
Does anyone know of any libraries that can transliterate bengali to
english. There are tools to the reverse. I need this to solve the last
remaining road-block in OCR.
The thing is Tesseract-OCR uses a data structure called
directed-acyclic-word-graph to store dictionaries for lookup. After an
OCR has been performed the OCR system matches the output with entries
in this d.a.w.g. file. Unfortunately the data structure is not suited
to complex scripts like ours

http://groups.google.com/group/tesseract-ocr/browse_thread/thread/5495c4e348a4b272/a6dcfe5d92babb35?lnk=gstq=dawg%2Bwieghts#a6dcfe5d92babb35.
There are 2 solutions. 1) I figure out a suitable data structure that
handles Indic script and implement. 2) I transliterate the entire
dictionary and the OCR output to english (26 characters instead of the
500 odd for bengali) and then match. I think this should work.
Any suggestions?

Take a look at the uni2rb.py script in

http://bocra.svn.sourceforge.net/viewvc/bocra/bocra/trunk/src/python/

-Deepayan

[Ankur-core] [X-Post]Bengali to English transliteration anyone?

Re: [Ankur-core] [X-Post]Bengali to English transliteration anyone?

Re: [Ankur-core] [X-Post]Bengali to English transliteration anyone?

3 matches

Site Navigation

Mail list logo

Footer information