2009/11/27 Bibek Paudel <eternalyo...@gmail.com>:
> On Fri, Nov 27, 2009 at 4:59 PM, Debayan Banerjee <debaya...@gmail.com> wrote:
>> 2009/11/27 Bibek Paudel <eternalyo...@gmail.com>:
>>> On Fri, Nov 27, 2009 at 4:15 PM, Debayan Banerjee <debaya...@gmail.com> 
>>> wrote:
>>
>>> Great job Debayan ! Congrats and well done.
>>> What languages does it work for currently, apart from Bengali?
>>
>> It can support all languages, except Chinese and Arabic.
>> I just need to training data for the languages, including Nepali.
>> I need a text file containing all the possible glyphs in you script,
>> one per line. I also need a comprehensive word list for your
>> languages. Thats all i need.
>
> Wow, that's awesome, could you point me to some sample training data
> so that I can provide you with necessary training data in languages
> like Nepali? This is an exciting development, and I'm all eager to
> help. Thanks again.

I just need a word list, like the ones found here
<http://smc.org.in/silpa/modules/spellchecker/dicts/>.
And I need all the individual glyphs in your script. That includes all
possible symbols, including consonants, vowel signs, conjuncts, digits
and punctuation. I need these symbols in a file, one per line.
There is something you have to be careful about though. There may be
consonant+vowel combinations that fit in a rectangular box, like কু
<means ku in bengali>. Now কু = ক + ু . We can not train ু separately
because we will not find this symbol in an image isolated. Hence we
need to train all consonants + ু . Hence what I need from you is to
tell me cases like these where consonant + vowel produces a glyph that
overlaps vertically. To make myself more clear কা <ka> has a consonent
+ vowel too, but ক and া do not overlap on a vertical axis, and can be
trained separately, but for কি <ki> ক and ি overlaps vertically and
needs to be trained as a single symbol, together. The thing is that
the Tesseract segmenter is built for english and it only boxes
rectangles.

If you have understood the above, just send me all the possible glyphs
following the above rules.
Or you could simply send me a list of consonants, vowels, numbers,
punctuations and tell me the special rules that exists between
consonants and vowels in your language, I have an automated training
data generator that can be fed with these rules. and it generates
training data on the fly.
I will upload some Bengali data for you to see in a short while.

>
> Bibek
>
>>>
>>
>>
>> --
>> Regards,
>> Debayan Banerjee
>>
>



-- 
Regards,
Debayan Banerjee

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
IndLinux-group mailing list
IndLinux-group@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/indlinux-group

Reply via email to