Hello all,

I'm trying to train up an asynchronous ("chunk-based") factored PMT model
with Moses + IRST LM.  Trouble is, I'm not sure IRST LM is reassembling
microtags into chunks (e.g., a candidate with "NP( NP+ NP+ NP)" should
become just "NP" before LM scoring during decoding).

The reason I'm not sure is that I trained up a little dummy LM using a tiny
corpus of chunks (diplayed below) as follows:

-------------------------------------------------
$ more corp
NP VP ADVP
NP VP NP NP
NP VP NP
NP VP PP
$ ngt -i=corp -n=3 -o=corp.www -b=yes
$ tlm -tr=corp.www -n=3 -lm=wb -o=corp.lm
$ compile-lm corp.lm corp.blm
$ more evalcorp
NP( NP+ NP) VP( VP+ VP+ VP+ VP) PP( PP+ PP)
$ cat evalcorp | add-start-end.sh | compile-lm corp.blm --eval=/dev/stdin
-------------------------------------------------

and I get:

-------------------------------------------------
inpfile: corp.blm
dub: 10000000
Reading corp.blm...
blmt
loadbin()
loading 6 1-grams
loading 9 2-grams
loading 3 3-grams
done
OOV code is 5
creating cache for storing prob, state and statesize of ngrams
Start Eval
OOV code: 5
%% Nw=13 PP=35896110.99 PPwp=35896068.13 Nbo=12 Noov=11 OOV=84.62%
prob_and_state_cache() ngramcache stats: entries=3 acc=11 hits=8 ht.used=
6402408 mp.used= 56000008 mp.wasted= 55999840
lmtable class statistics
levels 3
lev 1 entries 6 used mem 0.00Mb
lev 2 entries 9 used mem 0.00Mb
lev 3 entries 3 used mem 0.00Mb
total allocated mem 0.00Mb
total number of get and binary search calls
level 1 get: 5 bsearch: 0
level 2 get: 4 bsearch: 7
level 3 get: 3 bsearch: 0
deleting cache for storing prob, state and statesize of ngrams
-------------------------------------------------

Notice that all of the microtags are treated as OOV terms (i.e., not mapped
to the chunks they describe).
For what it's worth, the ARPA format file looks fine:

-------------------------------------------------
$ more corp.lm

\data\
ngram  1=         6
ngram  2=         9
ngram  3=         3


\1-grams:
-1.09691    <s>    -0.39794
-0.49485    NP    -0.653212
-0.69897    VP    -0.367977
-1.09691    ADVP    -0.30103
-1.09691    PP    -0.30103
-0.619789    <unk>

\2-grams:
-0.364516    <s> <s>
-0.484126    <s> NP
-0.393141    NP NP    -0.221849
-0.31079    NP VP    -0.146128
-0.373806    VP NP    -0.477121
-0.751676    VP ADVP
-0.751676    VP PP
-0.180456    ADVP NP
-0.267606    PP <s>

\3-grams:
-0.159058    NP NP VP
-0.230804    NP VP NP
-0.0961065    VP NP NP
\end\
-------------------------------------------------

Also, when I run tiny tests in Moses (i.e., train up on a tiny parallel
corpus, train up a LM using IRST LM, etc.), I get more garbled results than
when I don't use the chunk-based LM.  I suspect this is due to IRST LM's
treating each microtag as an <unk>, so that the chunk-based LM confounds
rather than helps fluency.
Asking Moses to "-report-all-factors" doesn't confirm anything either, as
the factors would just be the microtags and nothing would confirm that they
are or are not being reassembled into chunks internally.

If I'm missing something (this mysterious mapping file, perhaps?), someone
please let me know.

Thanks.

Best,
Dennis
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to