---------- Forwarded message ----------
From: "karan singla" <[email protected]>
Date: 11-Mar-2014 8:40 PM
Subject: Re: Reg : GSOC - "Improving support for non-standard text input"
To: <[email protected]>
Cc:
Hello,
Surely, there will be ambiguities for the "single-char" words, but I still
need to get through a big corpus. But i think, those cases can be handled
using a simple n-gram language model.
Regarding mis-spelled words( which has some characters missing), as I
suggested we can make an FSM and return the word which has the least edit
distance with the mis-spelled word.
like : lov ==> love
Do you have a better way in mind.
and as you suggested for extended words, i wrote that script (
extended_words.py <input file>). But I haven't seen words in which a letter
is repeated more than 2 times consecutively, so my script works like this:
sample input:
whyyyyy dooo yyyoouuu loveee meee
sample output:
whyyyyy\whyy\why
dooo\doo\do
yyyoouuu\yyoouu\yyoou\yyouu\yyou\yoouu\yoou\youu\you
loveee\lovee\love
meee\mee\me
I also wrote an another basic script ( corrected.py <dictionary> <input>
), that searches a dictionary and finds if the correction is there in
dictionary, I also mapped those "single_char" words to correct.
sample input :
y dooo yyyoouuu loooveee meee
sample output :
why do you love me
( you will be needing a dictionary to run )
eg :
I
am
going
,,,
there can be cases in which there are multiple words which can be correct
eg : meeeettttttt ===> met and meet both can come,
So they can be disambiguated based on a Language Model.
Also i found a POS tagger for tweets, which give POS tags for non-standard
words, can we include such tools as a part of the module ??
They can be really helpful. For eg : i loveee gng 2 gym.
So a rule can be made to change "2" to "to", if it is proceeding a verb,
because we can't really correct ALL the mis-spelled words, because there is
far too much disturbance in data which we haven't seen.
My username of SourceForge is "ksingla025"
Hoping for a reply soon.
Regards,
Karan
On Tue, Mar 11, 2014 at 3:15 PM, Francis Tyers <[email protected]> wrote:
> El dt 11 de 03 de 2014 a les 07:45 +0530, en/na karan singla va
> escriure:
> > Hello Francis,
> >
> > As asked in the coding challenge, I have prepared a corpus of 100
> > sentences containing non-standard text (from chat data and twitter
> > status).
> >
> > Sample data :
> >
> https://docs.google.com/document/d/1fGFO6V-lKcvqgzaQRfxEfLWGXF6AxqTIODKdap9IL1c/edit?usp=sharing
> >
> > I have used Apertium en-es translator and after analyzing the output.
> >
> > Sample Translation
> >
> https://docs.google.com/document/d/1Mn83zon-gsGXbeIqRREglF6kHN10GRLWfYvrh3UG4XU/edit?usp=sharing
> >
> >
> > I have concluded that the following non-standard features are
> > effecting translation quality.
> > 1) Single character words
> > Example: r -> are, d-> the, m->am etc.
> > Proposal : Generally not more than 26 such cases are possible so they
> > can be mapped to the original word.
>
> Could you think of any ambiguities here ?
>
> > 2) Extended words
> > Example: lovveee ->love, byeeeee->bye
> > Proposal: no three or more same characters occur together, so trim
> > them.
>
> This is a nice idea! But you would still end up with 'byee' and
> 'lovvee'. Example from Portuguese:
>
> SR store - Nossssaaa!! Ta muito chique hein!!!!! | Facebook
>
> Fixed: "Nossa! Está muito chique hein!"
> Wow! It.is very cool/chic no/hey!
>
> Here "nossa" is an interjection, but it could also be a possessive
> adjective "a nossa casa" (our house)
>
> So, trimming >2 letters would work, but in many cases you will be left
> with two where you might need 1.
>
> > 3) Smileys
> > Example: :) , ;), <3
> > Proposals: Can be replaced with the emotion by creating a map as they
> > are limited.
>
> Good idea
>
> > 4) Vowels Drop
> > Example: Bt->But, Tht -> That, Lv->Love
> > Proposal: Using Phonetic Dictionary
> > Vowels are dropped to make the word short keeping the pronunciation
> > same. So we can use a phonetic dictionary and map each word with its
> > trimmed variations.
> >
> > 5) Spelling Error { Most difficult to correct }
> > Example: Beautyful->beautiful, lov->love
> > Proposal: A FSM can be created using a dictionary and such words can
> > be replaced with the words with which they have minimal distance.
>
> Can you think of a way of estimating confidence for replacements ?
>
> > 6) Hash Tags
> > Examples: #MeganSoHot, #IndiaWin
> > Proposal: These words most of the time follow a pattern where each
> > capital character separates a new word.
>
> These could also by IRC channel names, they could mostly be taken care
> of with a regular expression probably.
>
> >
> > Abbreviations and numbers also make things difficult some times but
> > they are hard to handle. I would suggest, we can recognize those and
> > transliterate them will be better.
> >
> >
> > While doing a literature survey, i went through following articles.
> >
> > http://www.cs.columbia.edu/~julia/papers/sproatetal01.pdf
> >
> > people also tend to use wrong spellings, so it will also involve a
> > spell checker and then maintaining that list and keep on adding words
> > to it.
> >
> >
> https://docs.google.com/viewer?url=patentimages.storage.googleapis.com/pdfs/US5604897.pdf
> >
> >
> > Am i thinking in the right direction ?
>
> Yes, you are definitely thinking in the right direction. This is great
> work.
>
> I'm beginning to think that the way to solve the problem is in two
> stages... the first stage will ambiguate the input:
>
> ^Nossssaaa/nossaa/nossa/nosa/nosaa$
> ^!!/!!$
> ^Ta/está/tá/ta$
> ^muito/muito$
> ^chique/chique$
> ^hein/hein$
> ^!!!!!/!!!!!$
>
> Format: ^original/candidate1/candidate2/candidate3$
>
> Then in a second stage we can trim down the possibilities, with either a
> statistical model or rules, or both. What do you think ? So one way to
> trim them would be just to pass the possibilities through the
> morphological dictionary, so that would trim "non-words" (e.g. nossaa,
> nosa, nosaa from the above).
>
> > It will be great, if you could guide me more.
>
> Perhaps this would be a nice easy thing to implement to complete the
> coding challenge.
>
> Write a program that takes input, and produces candidates where strings
> of letters longer of >=3 are reduced to 1/2 letters, and then write a
> second program which checks them against a morphological dictionary.
>
> Input program 1:
>
> Nossssaaa
> !!
> Ta
> muito
> chique
> hein
> !!!!!
>
> Output program 1 / input program 2:
>
> ^Nossssaaa/nossaa/nossa/nosa/nosaa$
> ^!!/!!$
> ^Ta/está/tá/ta$
> ^muito/muito$
> ^chique/chique$
> ^hein/hein$
> ^!!!!!/!!!!!$
>
> Output program 2:
>
> ^Nossssaaa/nossa$
> ^!!/!!$
> ^Ta/ta$
> ^muito/muito$
> ^chique/chique$
> ^hein/hein$
> ^!!!!!/!!!!!$
>
> > Also can I get commit rights, as I have implemented a basic script to
> > pre-process the input that can handle easy cases, also a basic
> > cleaning script to make the data true-cased as that also caused a
> > problem in the translation.
>
> What is your SF username ?
>
> Fran
>
>
>
import sys,re,collections
array=[]
dic = []
single_char = {"b":"bee","B":"bee", "c":"see","C":"see", "d":"the", "D":"the","k":"okey","K":"okey","m":"am","M":"am","qs":"question","Qs": "question","r":"are","R":"are", "s":"ass", "S":"ass","u":"you","U":"you", "v":"we","V":"we","x":"times", "X":"times","y":"why", "Y":"why"}
inp = open(sys.argv[1],'r')
dictionary = inp.readlines()
for x in dictionary:
dic += [x.strip()]
def compute(string,pos,temp):
if pos>=len(string):
array.append(temp);
return
if pos==len(string)-1:
array.append(temp+string[pos])
return
if string[pos]!=string[pos+1]:
compute(string,pos+1,temp+string[pos])
else:
compute(string,pos+2,temp+string[pos]+string[pos+1])
compute(string,pos+2,temp+string[pos])
return
fp=open(sys.argv[2],'r')
lines=fp.readlines()
for line in lines:
lis=line.split()
for j in range(len(lis)):
if lis[j] in single_char:
lis[j] = single_char[lis[j]]
# print lis[j]
coun=1
string=""
for k in range(1,len(lis[j]),1):
if lis[j][k]==lis[j][k-1]:
coun=coun+1
else:
if coun>=2:
string=string+lis[j][k-1]+lis[j][k-1]
else:
string=string+lis[j][k-1]
coun=1
if coun>=2:
string=string+lis[j][len(lis[j])-1]+lis[j][len(lis[j])-1]
else:
string=string+lis[j][len(lis[j])-1]
compute(string,0,"")
if(len(array)==0):
print lis[j]
# else:
# temp1="\\".join(array)
# print lis[j]+"\\"+temp1
for words in array:
if words in dic:
print words
del(array[:])
import sys,re,collections
array=[]
def compute(string,pos,temp):
if pos>=len(string):
array.append(temp);
return
if pos==len(string)-1:
array.append(temp+string[pos])
return
if string[pos]!=string[pos+1]:
compute(string,pos+1,temp+string[pos])
else:
compute(string,pos+2,temp+string[pos]+string[pos+1])
compute(string,pos+2,temp+string[pos])
return
fp=open(sys.argv[1],'r')
lines=fp.readlines()
for line in lines:
lis=line.split()
for j in range(len(lis)):
coun=1
string=""
for k in range(1,len(lis[j]),1):
if lis[j][k]==lis[j][k-1]:
coun=coun+1
else:
if coun>=2:
string=string+lis[j][k-1]+lis[j][k-1]
else:
string=string+lis[j][k-1]
coun=1
if coun>=2:
string=string+lis[j][len(lis[j])-1]+lis[j][len(lis[j])-1]
else:
string=string+lis[j][len(lis[j])-1]
compute(string,0,"")
if(len(array)==0):
print lis[j]
else:
temp1="\\".join(array)
print lis[j]+"\\"+temp1
del(array[:])
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff