Hi Francis

I am Saurabh, a fourth year undergraduate student majoring in Computer
Science
at Indian Institute of Technology. I am interested to work on improving
support
for non standard words(NSW).

I have read some papers and have a vast collection of general tweets, form
that
I have observed that classification of non standard words is important.
Like the
top level classification could be numerical and alphabetical NSW and they
can
be classified further also and then handle them separately.

Below I am listing how to handle some of them (easy to hard):
1. Emoticons can be handled easily as they are very limited.
2. Repeating letter in a word (eg byeee etc) can be normalized by reducing
letter
    which occur more than 3 times to 1 or 2 times and checking it whether
it is
    present in the dictionary
3. Shortened words are difficult to handle. eg insti -> institution, ur ->
your etc.
    We can handle this if we assume that corpus also contain the exact
standard
    word of the shortened word. eg one sentence is
   *      Institute building is far away.*
    and another sentence is
         *Today Insti building is been renovated*
    So from this we know that *Insti *might be shortened form of *Institute
*with some
    probability which can calculated assuming n-gram model. This is an
unsupervised
    method.

    Another way to handle this if we can get a training set containing
exact words
    their shortened form then we can train (training method can be decided
later but
    an easy choice can be naive bayes) which letters are generally dropped.
    Then for each word we will can find their probable shortened form.

The are many other types of NSW but now I am focused on above ones.
So, Sir can you review these ideas and give some suggestions.

Thank you
Saurabh
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to