Re: [Apertium-stuff] [GSoC] Adopt an unreleased language pair (Bangla-English)

Rafi Kamal Thu, 20 Mar 2014 09:42:07 -0700

That worked perfectly :) Thank you very much. I'll work on the tagger later, 
first I'll work on the coding challenge according to your suggestion.

On Thursday, March 20, 2014 10:05 PM, Francis Tyers <[email protected]> wrote:

El dj 20 de 03 de 2014 a les 08:59 -0700, en/na Rafi Kamal va escriure:
> Hi Francis
> 
> 
> Thanks for your reply. According to the wiki page, I have to follow
> this step one by one:
> 
>      1. Download the wiki dump (I've downloaded it from here.
>      2. Extract it. (I've used: bzcat
>         bnwiki-20140305-pages-articles-multistream.xml.bz2 >
>         bnwiki.xml)
>      3. Run the script (I've tried: python3 WikiExtractor.py --infn
>         bnwiki.xml > bnwiki.txt)
>      4. Filter the output to generate the corpus (cat bnwiki.txt| grep
>         -v "''" | grep -v http | grep -v "#" | grep -v "@" | grep -e
>         '................................................' | sort -fiu
>         | sort -R | nl -s ". " > bnwiki.crp.txt )
> Can you
 please tell am I doing something wrong here?

You are using it wrong.

python3 WikiExtractor.py --infn
bnwiki-20140305-pages-articles-multistream.xml.bz2

The directory you execute this command in will have some file created
which is the corpus, but I don't know the file name, you just have to
find it out. Perhaps you could do a before/after ls

$ ls > old
$ python3 WikiExtractor.py ...
$ ls > new
$ diff old new 

to find out the file name.

> As you can see, bnwiki.crp.txt is just a filtered version of
> bnwiki.txt. So as bnwiki.txt doesn't contain any body text of
> wikipedia articles, nor does bnwiki.crp.txt. Anyways, as
 you've
> suggested that fixing PoS tagger will be time consuming, I might try
> it later, focusing on transfer rules first :)

I would suggest the following way of approaching the coding challenge:

1) Add the missing vocabulary from the development articles to all of
the dictionaries.
2) Go through the development articles sentence by sentence, and work on
lexical selection, disambiguation, then transfer in turn.

F.

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] [GSoC] Adopt an unreleased language pair (Bangla-English)

Reply via email to