Re: [Apertium-stuff] Reg : GSOC - "Improving support for non-standard text input"

Francis Tyers Tue, 11 Mar 2014 02:46:27 -0700

El dt 11 de 03 de 2014 a les 07:45 +0530, en/na karan singla va
escriure:
> Hello Francis,
> 
> As asked in the coding challenge, I have prepared a corpus of 100
> sentences containing non-standard text (from chat data and twitter
> status).
> 
> Sample data : 
> https://docs.google.com/document/d/1fGFO6V-lKcvqgzaQRfxEfLWGXF6AxqTIODKdap9IL1c/edit?usp=sharing
> 
> I have used Apertium en-es translator and after analyzing the output.
> 
> Sample Translation
> https://docs.google.com/document/d/1Mn83zon-gsGXbeIqRREglF6kHN10GRLWfYvrh3UG4XU/edit?usp=sharing
> 
> 
> I have concluded that the following non-standard features are
> effecting translation quality.
> 1) Single character words
> Example: r -> are, d-> the, m->am etc.
> Proposal : Generally not more than 26 such cases are possible so they
> can be mapped to the original word.


Could you think of any ambiguities here ?

> 2) Extended words
> Example:  lovveee ->love, byeeeee->bye
> Proposal: no three or more same characters occur together, so trim
> them.

This is a nice idea! But you would still end up with 'byee' and
'lovvee'. Example from Portuguese:

  SR store - Nossssaaa!! Ta muito chique hein!!!!! | Facebook

Fixed: "Nossa! Está  muito chique    hein!" 
        Wow!   It.is very  cool/chic no/hey! 

Here "nossa" is an interjection, but it could also be a possessive
adjective "a nossa casa" (our house)

So, trimming >2 letters would work, but in many cases you will be left
with two where you might need 1.

> 3) Smileys 
> Example: :) , ;), <3
> Proposals: Can be replaced with the emotion by creating a map as they
> are limited. 

Good idea

> 4) Vowels Drop
> Example: Bt->But, Tht -> That, Lv->Love
> Proposal: Using Phonetic Dictionary 
> Vowels are dropped to make the word short keeping the pronunciation
> same. So we can use a phonetic dictionary and map each word with its
> trimmed variations.
> 
> 5) Spelling Error { Most difficult to correct }
> Example: Beautyful->beautiful, lov->love
> Proposal: A FSM can be created using a dictionary and such words can
> be replaced with the words with which they have minimal distance.

Can you think of a way of estimating confidence for replacements ?

> 6) Hash Tags
> Examples: #MeganSoHot, #IndiaWin
> Proposal: These words most of the time follow a pattern where each
> capital character separates a new word. 

These could also by IRC channel names, they could mostly be taken care
of with a regular expression probably.

> 
> Abbreviations and numbers also make things difficult some times but
> they are hard to handle. I would suggest, we can recognize those and
> transliterate them will be better.
> 
> 
> While doing a literature survey, i went through following articles.
> 
> http://www.cs.columbia.edu/~julia/papers/sproatetal01.pdf
> 
> people also tend to use wrong spellings, so it will also involve a
> spell checker and then maintaining that list and keep on adding words
> to it.
> 
> https://docs.google.com/viewer?url=patentimages.storage.googleapis.com/pdfs/US5604897.pdf
> 
> 
> Am i thinking in the right direction ?

Yes, you are definitely thinking in the right direction. This is great
work.

I'm beginning to think that the way to solve the problem is in two
stages... the first stage will ambiguate the input:

^Nossssaaa/nossaa/nossa/nosa/nosaa$
^!!/!!$
^Ta/está/tá/ta$
^muito/muito$
^chique/chique$ 
^hein/hein$
^!!!!!/!!!!!$

Format: ^original/candidate1/candidate2/candidate3$

Then in a second stage we can trim down the possibilities, with either a
statistical model or rules, or both. What do you think ? So one way to
trim them would be just to pass the possibilities through the
morphological dictionary, so that would trim "non-words" (e.g. nossaa,
nosa, nosaa from the above).

> It will be great, if you could guide me more. 

Perhaps this would be a nice easy thing to implement to complete the
coding challenge.

Write a program that takes input, and produces candidates where strings
of letters longer of >=3 are reduced to 1/2 letters, and then write a
second program which checks them against a morphological dictionary.

Input program 1:

Nossssaaa
!!
Ta
muito
chique
hein
!!!!!

Output program 1 / input program 2:

^Nossssaaa/nossaa/nossa/nosa/nosaa$
^!!/!!$
^Ta/está/tá/ta$
^muito/muito$
^chique/chique$ 
^hein/hein$
^!!!!!/!!!!!$

Output program 2:

^Nossssaaa/nossa$
^!!/!!$
^Ta/ta$
^muito/muito$
^chique/chique$ 
^hein/hein$
^!!!!!/!!!!!$

> Also can I get commit rights, as I have implemented a basic script to
> pre-process the input that can handle easy cases, also a basic
> cleaning script to make the data true-cased as that also caused a
> problem in the translation.

What is your SF username ?

Fran



------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Reg : GSOC - "Improving support for non-standard text input"

Reply via email to