Hi Francis and Zaher
Sorry I was busy with assignments and projects, so I couldn't manage enough
time to finish the coding challenge (this is the 12th week of my current
semester, so I'm under a lot of academic pressure :( )
However, I've completed translating four 5000 words articles and then post
edited them to create reference translation. Besides, I've evaluated WER and
PWER of current bn-en translation using apertium-eval-translator. I've created
a wiki page where I've put the details:
http://wiki.apertium.org/wiki/User:Rafi_kamal/Coding_Challange_GSoC_2014
Regards
On Thursday, March 20, 2014 8:37 PM, Francis Tyers <[email protected]> wrote:
El dj 20 de 03 de 2014 a les 04:49 -0700, en/na Rafi Kamal va escriure:
> Hi
>
>
> I came to know from Zaher and Ragib that en-bn PoS tagger is giving
> wrong output for some inputs. Some examples of wrong tagger output is
> here:
> http://wiki.apertium.org/wiki/Bengali_and_English/Issues#Wrong_Tagger_Output.
>
>
> I think I should work on the PoS tagger first, because without fixing
> it, adding transfer rules or updating dictionaries won't help much.
Updating dictionaries will, transfer rules probably will help.
> I've talked to Unhammer on IRC about the tagger. He suggested me to
> train the tagger to improve its quality. I've read wiki articles on
> tagger training and unsupervised tagger training. Now I've a few
> questions:
> 1. Where can I find the tag definition file? According to the
> wiki, it should be in the language pair directory. But find .
> -name *.tsx doesn't return any match.
It probably doesn't have one written yet.
> 1. I've downloaded the bnwiki dump, unzipped it and run
> WikiExtractor.py script on it. But I think I'm not getting the
> correct output. The script filters all the body texts from the
> dump and preserves only some of the titles. Here is the first
> 100 lines of the script output:
> http://apertium.codepad.org/HGLeBM2K. I can write an extractor
> for Bangla myself, just needed to be sure if I'm not doing
> anything wrong.
The way the script functions is not intuitive. Look for a file called
"bn.crp.txt" or something like that. It will contain the body texts. I
think you might be able to specify an output file too.
> 1. The wiki page focuses on creating an entirely new .prob file.
> As there has already been one .prob file, is there any way I
> can just update it by training? (I
> guess apertium-tagger-trainer can do it, but it works only
> with Apertium 1).
This sounds quite time consuming. I would work on more time effective
improvements to start out with.
Fran
------------------------------------------------------------------------------
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff