Re: [Apertium-stuff] Improving support for non-standard text input

Francis Tyers Mon, 17 Mar 2014 03:16:30 -0700

El dl 17 de 03 de 2014 a les 13:46 +0530, en/na Saurabh Hota va
escriure:
> Hi Francis
> 
> 
> I am Saurabh, a fourth year undergraduate student majoring in Computer
> Science
> at Indian Institute of Technology. I am interested to work on
> improving support
> for non standard words(NSW).
> 
> 
> I have read some papers and have a vast collection of general tweets,
> form that
> I have observed that classification of non standard words is
> important. Like the
> top level classification could be numerical and alphabetical NSW and
> they can 
> be classified further also and then handle them separately.
> 
> 
> Below I am listing how to handle some of them (easy to hard):
> 1. Emoticons can be handled easily as they are very limited.
> 2. Repeating letter in a word (eg byeee etc) can be normalized by
> reducing letter
>     which occur more than 3 times to 1 or 2 times and checking it
> whether it is 
>     present in the dictionary
> 3. Shortened words are difficult to handle. eg insti -> institution,
> ur -> your etc.
>     We can handle this if we assume that corpus also contain the exact
> standard
>     word of the shortened word. eg one sentence is
>          Institute building is far away.
>     and another sentence is
>          Today Insti building is been renovated
>     So from this we know that Insti might be shortened form of
> Institute with some
>     probability which can calculated assuming n-gram model. This is an
> unsupervised
>     method.
> 
> 
>     Another way to handle this if we can get a training set containing
> exact words
>     their shortened form then we can train (training method can be
> decided later but 
>     an easy choice can be naive bayes) which letters are generally
> dropped.
>     Then for each word we will can find their probable shortened form.
> 
> 
> The are many other types of NSW but now I am focused on above ones.
> So, Sir can you review these ideas and give some suggestions.
>


Hello, that sounds quite good! I recommend you take a look at the
mailing list archive to see what Karan and Akshay have come up with, and
come back to us with what you think.

Also, which language did you translate the tweets to with Apertium ?

F.

F.



------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Improving support for non-standard text input

Reply via email to