Re: [Moses-support] Asynchronous factors and IRST LM (not working)

Mauro Cettolo Mon, 31 Jan 2011 02:39:09 -0800

Dear Dennis,

chunk LMs were implemented in IRSTLM toolkit to be used within Moses. 
Methods for using chunks in a standalone fashion are available, but not 
used in any way by the exectubales of the toolkit (e.g. "compile-lm"). 
If you want to use them in Moses, as written in the on-line 
documentation, you have to define a word-to-chunk map and pass it to 
Moses through the configuration file. Looking at your example, you 
should have a map like this:



FIELD -1
a NP(
b NP+
c NP)
d VP(
e VP+
f VP)
g PP(
h PP+
i PP)

[please find in the on-line manual the meaning of the header "FIELD 
-1"], save it in a file (e.g. "map") and add it as fourth field in the 
line of the config file where the chunk LM is specified:

1 0 3 corp.blm map

This way, when a translation hypothesis like for example

a b b b c d f

has to be scored by the chunk LM, the score actually provided will be 
that corresponding to the chunk sequence of the mapped sequence NP( NP+ 
NP+ NP+ NP) (VP VP), that is NP VP.

That's what we implemented some years ago. Since it's a long time we've 
not used that code, we're going to check right now if it's still working 
like we designed it, or if some successive updates have affected it. 
I'll give you feedback as soon as possible.

Mauro

Dennis Mehay wrote:
> Hello all,
>
> I'm trying to train up an asynchronous ("chunk-based") factored PMT 
> model with Moses + IRST LM.  Trouble is, I'm not sure IRST LM is 
> reassembling microtags into chunks (e.g., a candidate with "NP( NP+ 
> NP+ NP)" should become just "NP" before LM scoring during decoding).
>
> The reason I'm not sure is that I trained up a little dummy LM using a 
> tiny corpus of chunks (diplayed below) as follows:
>
> -------------------------------------------------
> $ more corp
> NP VP ADVP
> NP VP NP NP
> NP VP NP
> NP VP PP
> $ ngt -i=corp -n=3 -o=corp.www -b=yes
> $ tlm -tr=corp.www -n=3 -lm=wb -o=corp.lm
> $ compile-lm corp.lm corp.blm
> $ more evalcorp
> NP( NP+ NP) VP( VP+ VP+ VP+ VP) PP( PP+ PP)
> $ cat evalcorp | add-start-end.sh | compile-lm corp.blm --eval=/dev/stdin
> -------------------------------------------------
>
> and I get:
>
> -------------------------------------------------
> inpfile: corp.blm
> dub: 10000000
> Reading corp.blm...
> blmt
> loadbin()
> loading 6 1-grams
> loading 9 2-grams
> loading 3 3-grams
> done
> OOV code is 5
> creating cache for storing prob, state and statesize of ngrams
> Start Eval
> OOV code: 5
> %% Nw=13 PP=35896110.99 PPwp=35896068.13 Nbo=12 Noov=11 OOV=84.62%
> prob_and_state_cache() ngramcache stats: entries=3 acc=11 hits=8 
> ht.used= 6402408 mp.used= 56000008 mp.wasted= 55999840
> lmtable class statistics
> levels 3
> lev 1 entries 6 used mem 0.00Mb
> lev 2 entries 9 used mem 0.00Mb
> lev 3 entries 3 used mem 0.00Mb
> total allocated mem 0.00Mb
> total number of get and binary search calls
> level 1 get: 5 bsearch: 0
> level 2 get: 4 bsearch: 7
> level 3 get: 3 bsearch: 0
> deleting cache for storing prob, state and statesize of ngrams
> -------------------------------------------------
>
> Notice that all of the microtags are treated as OOV terms (i.e., not 
> mapped to the chunks they describe).
> For what it's worth, the ARPA format file looks fine:
>
> -------------------------------------------------
> $ more corp.lm
>
> \data\
> ngram  1=         6
> ngram  2=         9
> ngram  3=         3
>
>
> \1-grams:
> -1.09691    <s>    -0.39794
> -0.49485    NP    -0.653212
> -0.69897    VP    -0.367977
> -1.09691    ADVP    -0.30103
> -1.09691    PP    -0.30103
> -0.619789    <unk>
>
> \2-grams:
> -0.364516    <s> <s>
> -0.484126    <s> NP
> -0.393141    NP NP    -0.221849
> -0.31079    NP VP    -0.146128
> -0.373806    VP NP    -0.477121
> -0.751676    VP ADVP
> -0.751676    VP PP
> -0.180456    ADVP NP
> -0.267606    PP <s>
>
> \3-grams:
> -0.159058    NP NP VP
> -0.230804    NP VP NP
> -0.0961065    VP NP NP
> \end\
> -------------------------------------------------
>
> Also, when I run tiny tests in Moses (i.e., train up on a tiny 
> parallel corpus, train up a LM using IRST LM, etc.), I get more 
> garbled results than when I don't use the chunk-based LM.  I suspect 
> this is due to IRST LM's treating each microtag as an <unk>, so that 
> the chunk-based LM confounds rather than helps fluency.
> Asking Moses to "-report-all-factors" doesn't confirm anything either, 
> as the factors would just be the microtags and nothing would confirm 
> that they are or are not being reassembled into chunks internally.
>
> If I'm missing something (this mysterious mapping file, perhaps?), 
> someone please let me know.
>
> Thanks.
>
> Best,
> Dennis
> ------------------------------------------------------------------------
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>   


-- 
Mauro Cettolo
FBK - Ricerca Scientifica e Tecnologica
Via Sommarive 18
38123 Povo (Trento), Italy
Phone: (+39) 0461-314551
E-mail: [email protected]
URL: http://hlt.fbk.eu/people/cettolo

E cuale esie la me Patrie? cent, centmil, nissune
parcè che par picjâ lis bandieris spes a si picjin i omis

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Asynchronous factors and IRST LM (not working)

Reply via email to