Re: [Apertium-stuff] Improving support for non-standard text input

Francis Tyers Mon, 17 Mar 2014 11:28:32 -0700

Well, we can definitely see that en-eo has a bad time with translating
Turkish:


@*berryhuckle *yok *ben *sana *faceten ĉeıyıım *sen *yine *çok *güzel
*çıkm.ı*şsın *ben *yine *kötü *çıkm.ı*şım *ama *olsun *fotoğ*rafımızın
*olması *iyi *bişeyy :))

:)

F.

El dl 17 de 03 de 2014 a les 14:18 -0400, en/na Saurabh Hota va
escriure:
> Hi
> I have gone through the archives and Akshay has good data set
> of shortened words which can be used to train which vowels are
> dropped. Also we have to note that abbreviations and shorten form
> are different like brb - > be right back and bday -> birthday. So we
> have to handle them separately. And to do this first we have to
> classify them.
> 
> 
> For translation I have just written a bash script
>     while read line; do echo $line | apertium en-eo; done < Tweets
> 
> 
> Tweets and their translation.
> 
> 
> On Mon, Mar 17, 2014 at 6:15 AM, Francis Tyers <fty...@prompsit.com>
> wrote:
>         El dl 17 de 03 de 2014 a les 13:46 +0530, en/na Saurabh Hota
>         va
>         escriure:
>         > Hi Francis
>         >
>         >
>         > I am Saurabh, a fourth year undergraduate student majoring
>         in Computer
>         > Science
>         > at Indian Institute of Technology. I am interested to work
>         on
>         > improving support
>         > for non standard words(NSW).
>         >
>         >
>         > I have read some papers and have a vast collection of
>         general tweets,
>         > form that
>         > I have observed that classification of non standard words is
>         > important. Like the
>         > top level classification could be numerical and alphabetical
>         NSW and
>         > they can
>         > be classified further also and then handle them separately.
>         >
>         >
>         > Below I am listing how to handle some of them (easy to
>         hard):
>         > 1. Emoticons can be handled easily as they are very limited.
>         > 2. Repeating letter in a word (eg byeee etc) can be
>         normalized by
>         > reducing letter
>         >     which occur more than 3 times to 1 or 2 times and
>         checking it
>         > whether it is
>         >     present in the dictionary
>         > 3. Shortened words are difficult to handle. eg insti ->
>         institution,
>         > ur -> your etc.
>         >     We can handle this if we assume that corpus also contain
>         the exact
>         > standard
>         >     word of the shortened word. eg one sentence is
>         >          Institute building is far away.
>         >     and another sentence is
>         >          Today Insti building is been renovated
>         >     So from this we know that Insti might be shortened form
>         of
>         > Institute with some
>         >     probability which can calculated assuming n-gram model.
>         This is an
>         > unsupervised
>         >     method.
>         >
>         >
>         >     Another way to handle this if we can get a training set
>         containing
>         > exact words
>         >     their shortened form then we can train (training method
>         can be
>         > decided later but
>         >     an easy choice can be naive bayes) which letters are
>         generally
>         > dropped.
>         >     Then for each word we will can find their probable
>         shortened form.
>         >
>         >
>         > The are many other types of NSW but now I am focused on
>         above ones.
>         > So, Sir can you review these ideas and give some
>         suggestions.
>         >
>         
>         
>         Hello, that sounds quite good! I recommend you take a look at
>         the
>         mailing list archive to see what Karan and Akshay have come up
>         with, and
>         come back to us with what you think.
>         
>         Also, which language did you translate the tweets to with
>         Apertium ?
>         
>         F.
>         
>         F.
>         
>         
>         
>         
> ------------------------------------------------------------------------------
>         Learn Graph Databases - Download FREE O'Reilly Book
>         "Graph Databases" is the definitive new guide to graph
>         databases and their
>         applications. Written by three acclaimed leaders in the field,
>         this first edition is now available. Download your free book
>         today!
>         http://p.sf.net/sfu/13534_NeoTech
>         _______________________________________________
>         Apertium-stuff mailing list
>         Apertium-stuff@lists.sourceforge.net
>         https://lists.sourceforge.net/lists/listinfo/apertium-stuff
> 
> 




------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Improving support for non-standard text input

Reply via email to