[Ankur-core] [X-Post]Bengali to English transliteration anyone?

2009-04-19 Thread Debayan Banerjee
Does anyone know of any libraries that can transliterate bengali to
english. There are tools to the reverse. I need this to solve the last
remaining road-block in OCR.
The thing is Tesseract-OCR uses a data structure called
directed-acyclic-word-graph to store dictionaries for lookup. After an
OCR has been performed the OCR system matches the output with entries
in this d.a.w.g. file. Unfortunately the data structure is not suited
to complex scripts like ours
http://groups.google.com/group/tesseract-ocr/browse_thread/thread/5495c4e348a4b272/a6dcfe5d92babb35?lnk=gstq=dawg%2Bwieghts#a6dcfe5d92babb35.
There are 2 solutions. 1) I figure out a suitable data structure that
handles Indic script and implement. 2) I transliterate the entire
dictionary and the OCR output to english (26 characters instead of the
500 odd for bengali) and then match. I think this should work.
Any suggestions?

[1] http://hacking-tesseract.blogspot.com/
[2] http://code.google.com/p/tesseract-ocr


-- 
Be Intelligent, Use GNU/Linux

http://debayanin.googlepages.com/
http://debayan.wordpress.com
http://lug.nitdgp.ac.in

--
Stay on top of everything new and different, both inside and 
around Java (TM) technology - register by April 22, and save
$200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
300 plus technical and hands-on sessions. Register today. 
Use priority code J9JMT32. http://p.sf.net/sfu/p
___
Bengalinux-core mailing list
Bengalinux-core@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bengalinux-core


Re: [Ankur-core] [X-Post]Bengali to English transliteration anyone?

2009-04-19 Thread Sayamindu Dasgupta
On Sun, Apr 19, 2009 at 7:39 PM, Debayan Banerjee debaya...@gmail.com wrote:
 Does anyone know of any libraries that can transliterate bengali to
 english. There are tools to the reverse. I need this to solve the last
 remaining road-block in OCR.
 The thing is Tesseract-OCR uses a data structure called
 directed-acyclic-word-graph to store dictionaries for lookup. After an
 OCR has been performed the OCR system matches the output with entries
 in this d.a.w.g. file. Unfortunately the data structure is not suited
 to complex scripts like ours
 http://groups.google.com/group/tesseract-ocr/browse_thread/thread/5495c4e348a4b272/a6dcfe5d92babb35?lnk=gstq=dawg%2Bwieghts#a6dcfe5d92babb35.
 There are 2 solutions. 1) I figure out a suitable data structure that
 handles Indic script and implement. 2) I transliterate the entire
 dictionary and the OCR output to english (26 characters instead of the
 500 odd for bengali) and then match. I think this should work.


I believe there is a ISO standard for doing this. Take a look at ISO 15919:2001

Saner explanation is at http://homepage.ntlworld.com/stone-catend/trind.htm :-)

There is a thing which does this for Devanagari:
https://www.dealloc.org/~mublin/iso15919.py.html

However, this is not restricted to the 26 characters, but definitely
less than 500 :-)

-sdg-





--
Sayamindu Dasgupta
[http://sayamindu.randomink.org/ramblings]

--
Stay on top of everything new and different, both inside and 
around Java (TM) technology - register by April 22, and save
$200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
300 plus technical and hands-on sessions. Register today. 
Use priority code J9JMT32. http://p.sf.net/sfu/p
___
Bengalinux-core mailing list
Bengalinux-core@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bengalinux-core


Re: [Ankur-core] [X-Post]Bengali to English transliteration anyone?

2009-04-19 Thread Deepayan Sarkar
On 4/19/09, Debayan Banerjee debaya...@gmail.com wrote:
 Does anyone know of any libraries that can transliterate bengali to
  english. There are tools to the reverse. I need this to solve the last
  remaining road-block in OCR.
  The thing is Tesseract-OCR uses a data structure called
  directed-acyclic-word-graph to store dictionaries for lookup. After an
  OCR has been performed the OCR system matches the output with entries
  in this d.a.w.g. file. Unfortunately the data structure is not suited
  to complex scripts like ours
  
 http://groups.google.com/group/tesseract-ocr/browse_thread/thread/5495c4e348a4b272/a6dcfe5d92babb35?lnk=gstq=dawg%2Bwieghts#a6dcfe5d92babb35.
  There are 2 solutions. 1) I figure out a suitable data structure that
  handles Indic script and implement. 2) I transliterate the entire
  dictionary and the OCR output to english (26 characters instead of the
  500 odd for bengali) and then match. I think this should work.
  Any suggestions?

Take a look at the uni2rb.py script in

http://bocra.svn.sourceforge.net/viewvc/bocra/bocra/trunk/src/python/

-Deepayan

--
Stay on top of everything new and different, both inside and 
around Java (TM) technology - register by April 22, and save
$200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
300 plus technical and hands-on sessions. Register today. 
Use priority code J9JMT32. http://p.sf.net/sfu/p
___
Bengalinux-core mailing list
Bengalinux-core@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bengalinux-core