Hello Francis,
As asked in the coding challenge, I have prepared a corpus of 100 sentences
containing non-standard text (from chat data and twitter status).
Sample data :
https://docs.google.com/document/d/1fGFO6V-lKcvqgzaQRfxEfLWGXF6AxqTIODKdap9IL1c/edit?usp=sharing
I have used Apertium en-es translator and after analyzing the output.
Sample Translation
https://docs.google.com/document/d/1Mn83zon-gsGXbeIqRREglF6kHN10GRLWfYvrh3UG4XU/edit?usp=sharing
I have concluded that the following non-standard features are effecting
translation quality.
*1) Single character words*
Example: r -> are, d-> the, m->am etc.
Proposal : Generally not more than 26 such cases are possible so they can
be mapped to the original word.
*2) Extended words*
Example: lovveee ->love, byeeeee->bye
Proposal: no three or more same characters occur together, so trim them.
*3) Smileys *
Example: :) , ;), <3
Proposals: Can be replaced with the emotion by creating a map as they are
limited.
*4) Vowels Drop*
Example: Bt->But, Tht -> That, Lv->Love
Proposal: Using Phonetic Dictionary
Vowels are dropped to make the word short keeping the pronunciation same.
So we can use a phonetic dictionary and map each word with its trimmed
variations.
*5) Spelling Error { Most difficult to correct }*
Example: Beautyful->beautiful, lov->love
Proposal: A FSM can be created using a dictionary and such words can be
replaced with the words with which they have minimal distance.
*6) Hash Tags*
Examples: #MeganSoHot, #IndiaWin
Proposal: These words most of the time follow a pattern where each capital
character separates a new word.
Abbreviations and numbers also make things difficult some times but they
are hard to handle. I would suggest, we can recognize those and
transliterate them will be better.
While doing a literature survey, i went through following articles.
http://www.cs.columbia.edu/~julia/papers/sproatetal01.pdf
people also tend to use wrong spellings, so it will also involve a spell
checker and then maintaining that list and keep on adding words to it.
https://docs.google.com/viewer?url=patentimages.storage.googleapis.com/pdfs/US5604897.pdf
Am i thinking in the right direction ?
It will be great, if you could guide me more.
Also can I get commit rights, as I have implemented a basic script to
pre-process the input that can handle easy cases, also a basic cleaning
script to make the data true-cased as that also caused a problem in the
translation.
Hoping a reply soon
Karan Singla
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff