2009/11/27 Bibek Paudel <eternalyo...@gmail.com>: > On Fri, Nov 27, 2009 at 4:59 PM, Debayan Banerjee <debaya...@gmail.com> wrote: >> 2009/11/27 Bibek Paudel <eternalyo...@gmail.com>: >>> On Fri, Nov 27, 2009 at 4:15 PM, Debayan Banerjee <debaya...@gmail.com> >>> wrote: >> >>> Great job Debayan ! Congrats and well done. >>> What languages does it work for currently, apart from Bengali? >> >> It can support all languages, except Chinese and Arabic. >> I just need to training data for the languages, including Nepali. >> I need a text file containing all the possible glyphs in you script, >> one per line. I also need a comprehensive word list for your >> languages. Thats all i need. > > Wow, that's awesome, could you point me to some sample training data > so that I can provide you with necessary training data in languages > like Nepali? This is an exciting development, and I'm all eager to > help. Thanks again.
I just need a word list, like the ones found here <http://smc.org.in/silpa/modules/spellchecker/dicts/>. And I need all the individual glyphs in your script. That includes all possible symbols, including consonants, vowel signs, conjuncts, digits and punctuation. I need these symbols in a file, one per line. There is something you have to be careful about though. There may be consonant+vowel combinations that fit in a rectangular box, like কু <means ku in bengali>. Now কু = ক + ু . We can not train ু separately because we will not find this symbol in an image isolated. Hence we need to train all consonants + ু . Hence what I need from you is to tell me cases like these where consonant + vowel produces a glyph that overlaps vertically. To make myself more clear কা <ka> has a consonent + vowel too, but ক and া do not overlap on a vertical axis, and can be trained separately, but for কি <ki> ক and ি overlaps vertically and needs to be trained as a single symbol, together. The thing is that the Tesseract segmenter is built for english and it only boxes rectangles. If you have understood the above, just send me all the possible glyphs following the above rules. Or you could simply send me a list of consonants, vowels, numbers, punctuations and tell me the special rules that exists between consonants and vowels in your language, I have an automated training data generator that can be fed with these rules. and it generates training data on the fly. I will upload some Bengali data for you to see in a short while. > > Bibek > >>> >> >> >> -- >> Regards, >> Debayan Banerjee >> > -- Regards, Debayan Banerjee ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ IndLinux-group mailing list IndLinux-group@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/indlinux-group