Re: [Apertium-stuff] GSOC Idea: Take a language pair and make it state of the art

Alex Aruj Tue, 18 Mar 2014 14:30:34 -0700

I ran into a section in wiki about testvoc which delineates a similar
procedure to assess quality, so I will budget time in next 24 hours to
learn how to identify the holes in the coverage with that tool (
http://wiki.apertium.eu/index.php/Session_7). Also, I appreciate the
follow-up on my battle to update dictionaries. I have to dive into that
again and even testvoc, if possible tonight my time (PST). I will formally
submit the application tomorrow, since I am not sure I will have internet
at least through part of Friday. So, if the timeline looks too rough or
downright unintelligible when reviewed, I hope I get the time to re-adjust
it.


Here are my stats from a short file ~200 words I post-edited and compared
with raw MT version. Earlier, I must have been running the
apertium-eval-translator incorrectly on each set of four files. I have not
found time to post-edit all. for my short 200-word file, the numbers are
looking more reasonable:

Test file: 'en-target2'
Reference file 'en-target2-posted'

Statistics about input files
-------------------------------------------------------
Number of words in reference: 187
Number of words in test: 188
Number of unknown words (marked with a star) in test: 14
*Percentage of unknown words: 7.45 %*

Results when removing unknown-word marks (stars)
-------------------------------------------------------
Edit distance: 66
*Word error rate (WER): 35.29 %*
Number of position-independent correct words: 137
Position-independent word error rate (PER): 27.27 %

Results when unknown-word marks (stars) are not removed
-------------------------------------------------------
Edit distance: 55
*Word Error Rate (WER): 29.41 %*
Number of position-independent correct words: 148
Position-independent word error rate (PER): 21.39 %

Statistics about the translation of unknown words
-------------------------------------------------------
Number of unknown words which were free rides: -11
Percentage of unknown words that were free rides: -78.57 %

And thank you Xavi, for the link received moments ago.


On 18 March 2014 02:24, Francis Tyers <[email protected]> wrote:

> El dt 18 de 03 de 2014 a les 01:08 -0700, en/na Alex Aruj va escriure:
> > Hello, I was still unable to see the updates to dictionaries taking
> > full effect even after trying the -d . es-en solution, but I will try
> > running lt-comp again, checking the lr and rl directionality and
> > automorf and autogen bin files.
> >
> >
> > I have shared part of the GSOC proposal that I think is most directly
> > relevant to the task. I would like some feedback on it if anyone has
> > time. If any ideas about the project are misguided, please suggest
> > alternatives. The formatting options are a little wacky on Windows 8
> > MSWord--will certainly adjust later.
>
> Comments:
>
> I think it might be more convincing if you showed the existing coverage
> on a range of corpora, and showed estimates of how many words you would
> have to add in order to reach the targets you've given yourself. I would
> like to see a week-by-week plan.
>
> Procedure:
>
> 1) Calculate coverage over the whole corpus.
> 2) Get number of known tokens/total tokens.
> 3) Find out how many more tokens you need to add in order to increase 1%
> 4) Make a frequency list of unknown words
> 5) Starting at the top of the list, count down number of words and token
> count. This way you should be able to find how many tokens (surface
> forms) you need to over to increase by 1%.
>
> You seem to be confusing error rate with coverage. That en-es has a
> coverage of 94% does not surprise me, that it has an error rate of 6%
> does. This would mean that you only need to change (postedit) 6 words in
> 100 in order to get an adequate translation. I suspect it is much
> higher :)
>
> Have you done the evaluation of your 4 texts for WER yet ?
>
> Fran
>
> PS. I fixed the problem with 'nueve':
>
> $ echo "son las nueve y todavía me da palo salir de la cama" | apertium
> es-en
> They are the nine and still gives me stick go out of the bed
>
>


-- 
Alex

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GSOC Idea: Take a language pair and make it state of the art

Reply via email to