Hi Francis and Zaher
Last I was to busy in the university to give time in coding challenge. I've
started again yesterday. Now I'm reviewing the translation of my first article
and adding new entries to Bangla/English monolingual dictionaries. So far the
improvements are:
Article WERPER
1-0.81%-2.74%
2-0.90%-2.71%
3-0.16%-0.63%
4+0.19%-0.14%
I hope I'll be able to achieve greater improvements after entering all the
unknown words into the dictionary and then writing appropriate transfer rules.
On Friday, March 28, 2014 10:22 PM, Rafi Kamal <[email protected]> wrote:
Hi Francis and Zaher
Sorry I was busy with assignments and projects, so I couldn't manage enough
time to finish the coding challenge (this is the 12th week of my current
semester, so I'm under a lot of academic pressure :( )
However, I've completed translating four 5000 words articles and then post
edited them to create reference translation. Besides, I've evaluated WER and
PWER of current bn-en translation using apertium-eval-translator. I've created
a wiki page where I've put the details:
http://wiki.apertium.org/wiki/User:Rafi_kamal/Coding_Challange_GSoC_2014
Regards
On Thursday, March 20, 2014 8:37 PM, Francis Tyers <[email protected]> wrote:
El dj 20 de 03 de 2014 a les 04:49 -0700, en/na Rafi Kamal va escriure:
> Hi
>
>
> I came to know from Zaher and Ragib that en-bn PoS tagger is giving
> wrong output for some inputs. Some examples of wrong tagger output is
> here:
> http://wiki.apertium.org/wiki/Bengali_and_English/Issues#Wrong_Tagger_Output.
>
>
> I think I should work on the PoS tagger first, because without fixing
> it, adding transfer rules or updating dictionaries won't help much.
Updating dictionaries will, transfer rules probably will help.
> I've talked to Unhammer on IRC about the tagger. He suggested me to
> train the tagger to improve its
quality. I've read wiki articles on
> tagger training and unsupervised tagger training. Now I've a few
> questions:
> 1. Where can I find the tag definition file? According to the
> wiki, it should be in the language pair directory. But find .
> -name *.tsx doesn't return any match.
It probably doesn't have one written yet.
> 1. I've downloaded the bnwiki dump, unzipped it and run
> WikiExtractor.py script on it. But I think I'm not getting the
> correct output. The script filters all the body texts from the
> dump and preserves only some of the titles. Here
is the first
> 100 lines of the script output:
> http://apertium.codepad.org/HGLeBM2K. I can write an extractor
> for Bangla myself, just needed to be sure if I'm not doing
> anything wrong.
The way the script functions is not intuitive. Look for a file called
"bn.crp.txt" or something like that. It will contain the body texts. I
think you might be able to specify an output file too.
> 1. The wiki page focuses on creating an entirely new .prob file.
> As there has already been one .prob file, is there any way I
> can just update it by training? (I
> guess apertium-tagger-trainer can do it, but it works only
> with Apertium 1).
This sounds quite time consuming. I would work on more time effective
improvements to start out with.
Fran
------------------------------------------------------------------------------
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff
------------------------------------------------------------------------------
Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test & Deployment
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff