Hi
I came to know from Zaher and Ragib that en-bn PoS tagger is giving wrong
output for some inputs. Some examples of wrong tagger output is here:
http://wiki.apertium.org/wiki/Bengali_and_English/Issues#Wrong_Tagger_Output.
I think I should work on the PoS tagger first, because without fixing it,
adding transfer rules or updating dictionaries won't help much.
I've talked to Unhammer on IRC about the tagger. He suggested me to train the
tagger to improve its quality. I've read wiki articles on tagger training and
unsupervised tagger training. Now I've a few questions:
1. Where can I find the tag definition file? According to the wiki, it
should be in the language pair directory. Butfind . -name *.tsx doesn't return
any match.
2. I've downloaded the bnwiki dump, unzipped it and run
WikiExtractor.py script on it. But I think I'm not getting the correct output.
The script filters all the body texts from the dump and preserves only some of
the titles. Here is the first 100 lines of the script output:
http://apertium.codepad.org/HGLeBM2K. I can write an extractor for Bangla
myself, just needed to be sure if I'm not doing anything wrong.
3. The wiki page focuses on creating an entirely new .prob file. As
there has already been one .prob file, is there any way I can just update it by
training? (I guess apertium-tagger-trainer can do it, but it works only with
Apertium 1).
Regards
Rafi
On Wednesday, March 19, 2014 3:51 PM, Rafi Kamal <[email protected]> wrote:
Hello Zaher and Francis
Thanks for your responses. I've already started the coding challenge and I'll
create a detailed proposal soon. I'll keep you posted on the progress.
Regards
Rafi Kamal
On Wednesday, March 19, 2014 1:42 PM, Francis Tyers <[email protected]> wrote:
El dc 19 de 03 de 2014 a les 12:02 +0600, en/na Abu Zaher va escriure:
> Hi There,
>
>
> I'm glad to see that you are interested in working in the
> Bengali-English Pair.
>
> On 19 March 2014 01:25, Rafi Kamal <[email protected]> wrote:
> HI
>
>
> I'm Rafi, a third year undergraduate student studying Computer
> Science & Engineering in Bangladesh University of Engineering
> & Technology. I'm interested in
the project Adopt an
> unreleased language pair for the language pair Bangla-English.
>
>
> The reason why I'm interested in this project is, as a native
> speaker of Bangla, I strongly feel the need of a good
> Bangla-English machine translation system. Bangla is one of
> the most spoken languages in the world with about 220 million
> native speakers, but the only open source machine translation
> system available is the Apertium's one, which is currently in
> the incubator stage [1].
>
>
> My goal is to bring the Bangla-English language pair up to
> release quality result. To do this, I will
> 1. Add new transfer rules
> 2. Add more words to the dictionary
> 3. Improve handling of Bengali enclitics
> 4. Handle Bangladeshi Bengali and Indian Bengali variants
> properly
>
>
> You are correct in saying that these are the improvements we'd like to
> see in Bengali-English language pair. However, I think targeting all
> of them for GSoC might be a bit impractical given the timeframe.
>
>
> And you need to consider a few more things, e.g. during the last GSoC,
> adding transfer rules was a bit hampered by the incorrect English PoS
> tagging, if I remember correctly.
>
>
> I'd like to see a bit specific and realistic goals and timeframes for
> them. We can discuss it in IRC and/or Gmail if you want. I'd try to
> stay online.
>
> Skills:
> I've good knowledge of Java, C++, Python and Android
> application
development. I've several apps on Google Play, one
> of which is an open source English to Bangla dictionary named
> Ridmik Dictionary [2]. Currently it's one of the most popular
> English - Bangla dictionaries for Android (more than 30,000
> downloads last year). Some other projects done by me has been
> uploaded to my Github profile [3].
>
>
> I'm a full time Linux user and I've a good understanding of
> the tools like sed, awk, grep etc. I've taken the Theory of
> Computation and
Compiler courses at my university, where I've
> learned about finite automata, regular expressions, parse
> trees etc. Bangla is my native language and English is my
> second language, so I've good knowledge of grammatical rules
> in both of these languages.
>
>
> What I've done so far:
> I've successfully compiled Apertium and Bangla-English
> language pair from their sources. I've gone through the wiki
> pages to have an understanding how the system works. Besides,
> I've added several entries into the dictionaries and also
> tested these. Currently I'm working on the coding challenge
> posted in the wiki page (translating the story "Where is
> James"). I hope I can finish it in the next couple of days.
>
>
> I'll be looking forward to any suggestion.
>
>
> [1] http://wiki.apertium.org/wiki/Apertium-bn-en
>
[2] https://play.google.com/store/apps/details?id=buet.rafi.dictionary
> [3] https://github.com/rafi-kamal
>
>
> Regards
> Rafi Kamal
Hi Rafi, note that this pair has had two GSOC projects already, so if we
want to adopt it for a third one, your proposal and your coding
challenge should be pretty convincing.
I'm not saying that you shouldn't do it, but you should really convince
us that you
have what it takes to bring this pair to
releasable quality.
One idea might be to do the coding challenge from this task.[1]
Fran
1.
http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Make_a_language_pair_state-of-the-art
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book
today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff