Re: [smc-discuss] Re: [fsug-tvm] Re: english malayalam translator

JAGANADH G Sun, 26 Jul 2009 23:53:46 -0700

On Mon, Jul 27, 2009 at 11:33 AM, jinesh kj <[email protected]> wrote:


> hi all,
>
> Machine Translation is one of the toughest Language computing problems and
> newer ideas and thoughts are coming up every year. Ministry of Communication
> Information Technology is spending lot of money on the project(along with
> some other projects). M.T. System for Malayalam is being developed by Tamil
> University, Tanchavoor. From what i understand, they are using a corpus
> based approach, tailored for a set of sentences than a generic algorithm.
>
Ya I know this. Thanjavoor people are working onTamil<-> Malayalam machine
translation. They are customizing the anusaarak approach developed by
Aksharbharatigroup. That system is a language acquistion system that MT (In
the original developers view). The system algo has its own advantages and
limitations. A group of C-DAC people are also nvolved in English to Indian
languages (Including Malayalam). I dont know any of these systems are Open
Or Not. So why I was not mentioning the name.


>
> When i talked to a friend, he pointed out somethings like, we need to think
> of the deviations from base grammer rules, when designing a system for real
> translation. I think whatever we do, translation process will remain
> same(remove all agglutination, identify key words, their POS and using that
> information, translate). Sandhi splitting and POS tagging are the important
> steps to tackle in my view.
>
More clearly Sourcelanguage Sentence -> Parsing(For pattern Identification)
-> Convert to target language Syntactic pattern --> Taget Language Text
generation . This is the broad block view of MT system. Whether POS tagger
should be there depend your design.
The harder part in Indian Language to Indian Language (from my experience)
is Morphological Analysis as well as Sandhi splitting. Some sort of
heuristics is required for Sandhi splitting. Computing Kerala Paniniyam will
not solve the problem Even for Sanskrit extensive Sandhi rules are there.
But people who engaged in Sanskrit Computing calls it as a baffling
problem.Sandhi Splitter is a required component in Morphological analyzer
and Morphological analyzer requires a Sandhi splitter (A kind of ded lock).

>
> May be Jagan, Santhosh Rajeev and all can add more to this. From what i
> understand, a normal rules based system wont work that well for malayalam
> since rules are not much followed in the normal writing scheme(both are
> right kind of approach).
>
If some body really interested we can build a small system with in one year.
I will tell the plan with in a day or two.


>
> cheers
>
> Jinesh K J
>
>
> On Mon, Jul 27, 2009 at 10:26 AM, JAGANADH G <[email protected]> wrote:
>
>> If you are really interested drop me a mail. Are you familier with Perl
>> programming ?
>>
>>
>> On Sun, Jul 26, 2009 at 10:29 PM, Varewoolf <[email protected]> wrote:
>>
>>>
>>> so wat might be the next step??
>>>
>>> On Sat, Jul 25, 2009 at 10:31 AM, JAGANADH G<[email protected]> wrote:
>>> >
>>> >
>>> > On Sat, Jul 25, 2009 at 12:41 AM, Rajeev J Sebastian
>>> > <[email protected]> wrote:
>>> >>
>>> >> On Fri, Jul 24, 2009 at 7:02 PM, JAGANADH G<[email protected]>
>>> wrote:
>>> >> >
>>> >> >
>>> >> > On Fri, Jul 24, 2009 at 5:29 PM, Rajeev J Sebastian
>>> >> > <[email protected]> wrote:
>>> >> >>
>>> >> >> On Fri, Jul 24, 2009 at 5:19 PM, Varewoolf<[email protected]>
>>> wrote:
>>> >> >> >
>>> >> >> > i am so much interested to make this happen... i am always
>>> interested
>>> >> >> > in linguistics...
>>> >> >> > anybody tell me wat r the things we need primarily??
>>> >> >>
>>> >> >> How about ...
>>> >> >>
>>> >> >> 1) 50+ years of research (actually, 2000 if you consider Panini)
>>> >> >
>>> >> > It is history ? If you can work hard you can reduce the zero from
>>> it.
>>> >>
>>> >> Huh ?
>>> >>
>>> >> >>
>>> >> >> 2) Extremely large corpus ... if you want to make a practical
>>> system
>>> >> >
>>> >> > Only if you adopt copus based model. That is not going to practical
>>> in
>>> >> > right
>>> >> > now in the case of English to Malayalam translation
>>> >>
>>> >> It is not practical to make *anything* without a corpus. Even if you
>>> >> use a non-corpus based methodology to perform translation, you still
>>> >> need a large corpus to *validate* that your method works for more than
>>> >> toy examples. This is the biggest problem that faces any NLP work for
>>> >> Indic languages, and one that some glorified institutions in India
>>> >> neither builds up nor shares, most probably because all their systems
>>> >> are capable of are translating toy examples.
>>> >
>>> > I know that thre are non -free systems under dvevelopment which is more
>>> > advanced that Google translate service(English Hindi). But when they
>>> will
>>> > relese it I dont know.
>>> >
>>> >>
>>> >> >>
>>> >> >> 3) Large and talented team good in computational linguistics
>>> >> >
>>> >> > Where is it? We can build up this
>>> >>
>>> >> Best of Luck.
>>> >>
>>> >> >>
>>> >> >> 4) a very practical theory that can model language effectively for
>>> >> >> your purposes (seriously lacking for even small use cases in even
>>> >> >> major languages)
>>> >> >
>>> >> > A perfect grammar for Malayalam is required. Especially in Sysntax
>>> and
>>> >> > Morphology. Malayalam really lacks such studies.
>>> >>
>>> >> I don't think any language has such an in-depth model that could be
>>> >> used for generic MT. There are of course, special case models ...
>>> >> which can be used for special cases.
>>> >
>>> > The Sanskrit grammar is a perfect model.
>>> >
>>> >>
>>> >> >>
>>> >> >> 5) since you want to do MT, you need one more theory to handle the
>>> >> >> target language ... maybe even an IL model if you go that route
>>> >> >> instead of direct translation.
>>> >> >
>>> >> > First of all we need a good English to Malayalam dict in e-format.
>>> >> > Which
>>> >> > gives excat meaning POS, etc. Not like one saying Science -
>>> ശാസ്ത്രം,
>>> >> > തര്‍ക്കശാസ്ത്രം like.
>>> >>
>>> >> POS tagged dataset is just one component of a complete corpus.
>>> >
>>> > POS Tagged corpus is a variety of corpus.
>>> >
>>> >>
>>> >> Regards
>>> >> Rajeev J Sebastian
>>> >>
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > **********************************
>>> > JAGANADH G
>>> > http://jaganadhg.freeflux.net/blog
>>> >
>>> > >
>>> >
>>>
>>>
>>>
>>
>>
>> --
>> **********************************
>> JAGANADH G
>> http://jaganadhg.freeflux.net/blog
>>
>>
>>
>
>
> --
> My Feelings,Expressions-
> http://logbookofanobserver.blogspot.com
>
> My scribblings-
> http://logbookofanobserver.wordpress.com
>
> SMC : My computer, My language http://smc.org.in
> സ്വതന്ത്ര മലയാളം കമ്പ്യൂട്ടിങ്ങ്, എന്റെ കമ്പ്യൂട്ടറിന് എന്റെ ഭാഷ
>
> >
>


-- 
**********************************
JAGANADH G
http://jaganadhg.freeflux.net/blog

--~--~---------~--~----~------------~-------~--~----~
"Freedom is the only law". 
"Freedom Unplugged"
http://www.ilug-tvm.org

You received this message because you are subscribed to the Google
Groups "ilug-tvm" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]

For details visit the website: www.ilug-tvm.org or the google group page: 
http://groups.google.com/group/ilug-tvm?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: [smc-discuss] Re: [fsug-tvm] Re: english malayalam translator

Reply via email to