i have read these mails.. oops I dont have these much knowledge abt MT and corpus etc thing.. but i am more ready to do any volunteer work to make this happen. i have a good command over Malayalam and English..so how could be this translation actually work ?? show me the path, i will walk through..
On Mon, Jul 27, 2009 at 12:23 PM, JAGANADH G<[email protected]> wrote: > > > On Mon, Jul 27, 2009 at 11:33 AM, jinesh kj <[email protected]> wrote: >> >> hi all, >> >> Machine Translation is one of the toughest Language computing problems and >> newer ideas and thoughts are coming up every year. Ministry of Communication >> Information Technology is spending lot of money on the project(along with >> some other projects). M.T. System for Malayalam is being developed by Tamil >> University, Tanchavoor. From what i understand, they are using a corpus >> based approach, tailored for a set of sentences than a generic algorithm. > > Ya I know this. Thanjavoor people are working onTamil<-> Malayalam machine > translation. They are customizing the anusaarak approach developed by > Aksharbharatigroup. That system is a language acquistion system that MT (In > the original developers view). The system algo has its own advantages and > limitations. A group of C-DAC people are also nvolved in English to Indian > languages (Including Malayalam). I dont know any of these systems are Open > Or Not. So why I was not mentioning the name. > >> >> When i talked to a friend, he pointed out somethings like, we need to >> think of the deviations from base grammer rules, when designing a system for >> real translation. I think whatever we do, translation process will remain >> same(remove all agglutination, identify key words, their POS and using that >> information, translate). Sandhi splitting and POS tagging are the important >> steps to tackle in my view. > > More clearly Sourcelanguage Sentence -> Parsing(For pattern Identification) > -> Convert to target language Syntactic pattern --> Taget Language Text > generation . This is the broad block view of MT system. Whether POS tagger > should be there depend your design. > The harder part in Indian Language to Indian Language (from my experience) > is Morphological Analysis as well as Sandhi splitting. Some sort of > heuristics is required for Sandhi splitting. Computing Kerala Paniniyam will > not solve the problem Even for Sanskrit extensive Sandhi rules are there. > But people who engaged in Sanskrit Computing calls it as a baffling > problem.Sandhi Splitter is a required component in Morphological analyzer > and Morphological analyzer requires a Sandhi splitter (A kind of ded lock). >> >> May be Jagan, Santhosh Rajeev and all can add more to this. From what i >> understand, a normal rules based system wont work that well for malayalam >> since rules are not much followed in the normal writing scheme(both are >> right kind of approach). > > If some body really interested we can build a small system with in one year. > I will tell the plan with in a day or two. > >> >> cheers >> >> Jinesh K J >> >> On Mon, Jul 27, 2009 at 10:26 AM, JAGANADH G <[email protected]> wrote: >>> >>> If you are really interested drop me a mail. Are you familier with Perl >>> programming ? >>> >>> On Sun, Jul 26, 2009 at 10:29 PM, Varewoolf <[email protected]> wrote: >>>> >>>> so wat might be the next step?? >>>> >>>> On Sat, Jul 25, 2009 at 10:31 AM, JAGANADH G<[email protected]> wrote: >>>> > >>>> > >>>> > On Sat, Jul 25, 2009 at 12:41 AM, Rajeev J Sebastian >>>> > <[email protected]> wrote: >>>> >> >>>> >> On Fri, Jul 24, 2009 at 7:02 PM, JAGANADH G<[email protected]> >>>> >> wrote: >>>> >> > >>>> >> > >>>> >> > On Fri, Jul 24, 2009 at 5:29 PM, Rajeev J Sebastian >>>> >> > <[email protected]> wrote: >>>> >> >> >>>> >> >> On Fri, Jul 24, 2009 at 5:19 PM, Varewoolf<[email protected]> >>>> >> >> wrote: >>>> >> >> > >>>> >> >> > i am so much interested to make this happen... i am always >>>> >> >> > interested >>>> >> >> > in linguistics... >>>> >> >> > anybody tell me wat r the things we need primarily?? >>>> >> >> >>>> >> >> How about ... >>>> >> >> >>>> >> >> 1) 50+ years of research (actually, 2000 if you consider Panini) >>>> >> > >>>> >> > It is history ? If you can work hard you can reduce the zero from >>>> >> > it. >>>> >> >>>> >> Huh ? >>>> >> >>>> >> >> >>>> >> >> 2) Extremely large corpus ... if you want to make a practical >>>> >> >> system >>>> >> > >>>> >> > Only if you adopt copus based model. That is not going to practical >>>> >> > in >>>> >> > right >>>> >> > now in the case of English to Malayalam translation >>>> >> >>>> >> It is not practical to make *anything* without a corpus. Even if you >>>> >> use a non-corpus based methodology to perform translation, you still >>>> >> need a large corpus to *validate* that your method works for more >>>> >> than >>>> >> toy examples. This is the biggest problem that faces any NLP work for >>>> >> Indic languages, and one that some glorified institutions in India >>>> >> neither builds up nor shares, most probably because all their systems >>>> >> are capable of are translating toy examples. >>>> > >>>> > I know that thre are non -free systems under dvevelopment which is >>>> > more >>>> > advanced that Google translate service(English Hindi). But when they >>>> > will >>>> > relese it I dont know. >>>> > >>>> >> >>>> >> >> >>>> >> >> 3) Large and talented team good in computational linguistics >>>> >> > >>>> >> > Where is it? We can build up this >>>> >> >>>> >> Best of Luck. >>>> >> >>>> >> >> >>>> >> >> 4) a very practical theory that can model language effectively for >>>> >> >> your purposes (seriously lacking for even small use cases in even >>>> >> >> major languages) >>>> >> > >>>> >> > A perfect grammar for Malayalam is required. Especially in Sysntax >>>> >> > and >>>> >> > Morphology. Malayalam really lacks such studies. >>>> >> >>>> >> I don't think any language has such an in-depth model that could be >>>> >> used for generic MT. There are of course, special case models ... >>>> >> which can be used for special cases. >>>> > >>>> > The Sanskrit grammar is a perfect model. >>>> > >>>> >> >>>> >> >> >>>> >> >> 5) since you want to do MT, you need one more theory to handle the >>>> >> >> target language ... maybe even an IL model if you go that route >>>> >> >> instead of direct translation. >>>> >> > >>>> >> > First of all we need a good English to Malayalam dict in e-format. >>>> >> > Which >>>> >> > gives excat meaning POS, etc. Not like one saying Science - >>>> >> > ശാസ്ത്രം, >>>> >> > തര്ക്കശാസ്ത്രം like. >>>> >> >>>> >> POS tagged dataset is just one component of a complete corpus. >>>> > >>>> > POS Tagged corpus is a variety of corpus. >>>> > >>>> >> >>>> >> Regards >>>> >> Rajeev J Sebastian >>>> >> >>>> >> >>>> > >>>> > >>>> > >>>> > -- >>>> > ********************************** >>>> > JAGANADH G >>>> > http://jaganadhg.freeflux.net/blog >>>> > >>>> > > >>>> > >>>> >>>> >>> >>> >>> >>> -- >>> ********************************** >>> JAGANADH G >>> http://jaganadhg.freeflux.net/blog >>> >>> >> >> >> >> -- >> My Feelings,Expressions- >> http://logbookofanobserver.blogspot.com >> >> My scribblings- >> http://logbookofanobserver.wordpress.com >> >> SMC : My computer, My language http://smc.org.in >> സ്വതന്ത്ര മലയാളം കമ്പ്യൂട്ടിങ്ങ്, എന്റെ കമ്പ്യൂട്ടറിന് എന്റെ ഭാഷ >> >> > > > > -- > ********************************** > JAGANADH G > http://jaganadhg.freeflux.net/blog > > > > --~--~---------~--~----~------------~-------~--~----~ "Freedom is the only law". "Freedom Unplugged" http://www.ilug-tvm.org You received this message because you are subscribed to the Google Groups "ilug-tvm" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For details visit the website: www.ilug-tvm.org or the google group page: http://groups.google.com/group/ilug-tvm?hl=en -~----------~----~----~----~------~----~------~--~---
