Re: [Apertium-stuff] [GSoC] Adopt an unreleased language pair (Bangla-English)

Rafi Kamal Fri, 28 Mar 2014 10:03:42 -0700

Hi Francis and Zaher
Sorry I was busy with assignments and projects, so I couldn't manage enough 
time to finish the coding challenge (this is the 12th week of my current 
semester, so I'm under a lot of academic pressure :( )

However, I've completed translating four 5000 words articles and then post 
edited them to create reference translation. Besides, I've evaluated WER and 
PWER of current bn-en translation using apertium-eval-translator. I've created 
a wiki page where I've put the details: 
http://wiki.apertium.org/wiki/User:Rafi_kamal/Coding_Challange_GSoC_2014

Regards

On Thursday, March 20, 2014 8:37 PM, Francis Tyers <[email protected]> wrote:

El dj 20 de 03 de 2014 a les 04:49 -0700, en/na Rafi Kamal va escriure:
> Hi
> 
> 
> I came to know from Zaher and Ragib that en-bn PoS tagger is giving
> wrong output for some inputs. Some examples of wrong tagger output is
> here: 
> http://wiki.apertium.org/wiki/Bengali_and_English/Issues#Wrong_Tagger_Output.
> 
> 
> I think I should work on the PoS tagger first, because without fixing
> it, adding transfer rules or updating dictionaries won't help much.

Updating dictionaries will, transfer rules probably will help.

> I've talked to Unhammer on IRC about the tagger. He suggested me to
> train the tagger to improve its quality. I've read wiki articles on
> tagger training and unsupervised tagger training. Now I've a few
> questions:
>      1. Where can I find the tag definition file? According to the
>         wiki, it should be in the language pair directory. But find .
>         -name *.tsx doesn't return any match.

It probably doesn't have one written yet. 

>      1. I've downloaded the bnwiki dump, unzipped it and run
>         WikiExtractor.py script on it. But I think I'm not getting the
>         correct output. The script filters all the body texts from the
>         dump and preserves only some of the titles. Here is the first
>         100 lines of the script output:
>        http://apertium.codepad.org/HGLeBM2K. I can write an extractor
>         for Bangla myself, just needed to be sure if I'm not doing
>         anything wrong.

The way the script functions is not intuitive. Look for a file called
"bn.crp.txt" or something like that. It will contain the body texts. I
think you might be able to specify an output file too.

>      1. The wiki page focuses on creating an entirely new .prob file.
>         As there has already been one .prob file, is there any way I
>         can just update it by training? (I
>         guess apertium-tagger-trainer can do it, but it works only
>         with Apertium 1).

This sounds quite time consuming. I would work on more time effective
improvements to start out with.

Fran

------------------------------------------------------------------------------

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] [GSoC] Adopt an unreleased language pair (Bangla-English)

Reply via email to