Dear Dennis,
chunk LMs were implemented in IRSTLM toolkit to be used within Moses.
Methods for using chunks in a standalone fashion are available, but not
used in any way by the exectubales of the toolkit (e.g. "compile-lm").
If you want to use them in Moses, as written in the on-line
documentation, you have to define a word-to-chunk map and pass it to
Moses through the configuration file. Looking at your example, you
should have a map like this:
FIELD -1
a NP(
b NP+
c NP)
d VP(
e VP+
f VP)
g PP(
h PP+
i PP)
[please find in the on-line manual the meaning of the header "FIELD
-1"], save it in a file (e.g. "map") and add it as fourth field in the
line of the config file where the chunk LM is specified:
1 0 3 corp.blm map
This way, when a translation hypothesis like for example
a b b b c d f
has to be scored by the chunk LM, the score actually provided will be
that corresponding to the chunk sequence of the mapped sequence NP( NP+
NP+ NP+ NP) (VP VP), that is NP VP.
That's what we implemented some years ago. Since it's a long time we've
not used that code, we're going to check right now if it's still working
like we designed it, or if some successive updates have affected it.
I'll give you feedback as soon as possible.
Mauro
Dennis Mehay wrote:
> Hello all,
>
> I'm trying to train up an asynchronous ("chunk-based") factored PMT
> model with Moses + IRST LM. Trouble is, I'm not sure IRST LM is
> reassembling microtags into chunks (e.g., a candidate with "NP( NP+
> NP+ NP)" should become just "NP" before LM scoring during decoding).
>
> The reason I'm not sure is that I trained up a little dummy LM using a
> tiny corpus of chunks (diplayed below) as follows:
>
> -------------------------------------------------
> $ more corp
> NP VP ADVP
> NP VP NP NP
> NP VP NP
> NP VP PP
> $ ngt -i=corp -n=3 -o=corp.www -b=yes
> $ tlm -tr=corp.www -n=3 -lm=wb -o=corp.lm
> $ compile-lm corp.lm corp.blm
> $ more evalcorp
> NP( NP+ NP) VP( VP+ VP+ VP+ VP) PP( PP+ PP)
> $ cat evalcorp | add-start-end.sh | compile-lm corp.blm --eval=/dev/stdin
> -------------------------------------------------
>
> and I get:
>
> -------------------------------------------------
> inpfile: corp.blm
> dub: 10000000
> Reading corp.blm...
> blmt
> loadbin()
> loading 6 1-grams
> loading 9 2-grams
> loading 3 3-grams
> done
> OOV code is 5
> creating cache for storing prob, state and statesize of ngrams
> Start Eval
> OOV code: 5
> %% Nw=13 PP=35896110.99 PPwp=35896068.13 Nbo=12 Noov=11 OOV=84.62%
> prob_and_state_cache() ngramcache stats: entries=3 acc=11 hits=8
> ht.used= 6402408 mp.used= 56000008 mp.wasted= 55999840
> lmtable class statistics
> levels 3
> lev 1 entries 6 used mem 0.00Mb
> lev 2 entries 9 used mem 0.00Mb
> lev 3 entries 3 used mem 0.00Mb
> total allocated mem 0.00Mb
> total number of get and binary search calls
> level 1 get: 5 bsearch: 0
> level 2 get: 4 bsearch: 7
> level 3 get: 3 bsearch: 0
> deleting cache for storing prob, state and statesize of ngrams
> -------------------------------------------------
>
> Notice that all of the microtags are treated as OOV terms (i.e., not
> mapped to the chunks they describe).
> For what it's worth, the ARPA format file looks fine:
>
> -------------------------------------------------
> $ more corp.lm
>
> \data\
> ngram 1= 6
> ngram 2= 9
> ngram 3= 3
>
>
> \1-grams:
> -1.09691 <s> -0.39794
> -0.49485 NP -0.653212
> -0.69897 VP -0.367977
> -1.09691 ADVP -0.30103
> -1.09691 PP -0.30103
> -0.619789 <unk>
>
> \2-grams:
> -0.364516 <s> <s>
> -0.484126 <s> NP
> -0.393141 NP NP -0.221849
> -0.31079 NP VP -0.146128
> -0.373806 VP NP -0.477121
> -0.751676 VP ADVP
> -0.751676 VP PP
> -0.180456 ADVP NP
> -0.267606 PP <s>
>
> \3-grams:
> -0.159058 NP NP VP
> -0.230804 NP VP NP
> -0.0961065 VP NP NP
> \end\
> -------------------------------------------------
>
> Also, when I run tiny tests in Moses (i.e., train up on a tiny
> parallel corpus, train up a LM using IRST LM, etc.), I get more
> garbled results than when I don't use the chunk-based LM. I suspect
> this is due to IRST LM's treating each microtag as an <unk>, so that
> the chunk-based LM confounds rather than helps fluency.
> Asking Moses to "-report-all-factors" doesn't confirm anything either,
> as the factors would just be the microtags and nothing would confirm
> that they are or are not being reassembled into chunks internally.
>
> If I'm missing something (this mysterious mapping file, perhaps?),
> someone please let me know.
>
> Thanks.
>
> Best,
> Dennis
> ------------------------------------------------------------------------
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
--
Mauro Cettolo
FBK - Ricerca Scientifica e Tecnologica
Via Sommarive 18
38123 Povo (Trento), Italy
Phone: (+39) 0461-314551
E-mail: [email protected]
URL: http://hlt.fbk.eu/people/cettolo
E cuale esie la me Patrie? cent, centmil, nissune
parcè che par picjâ lis bandieris spes a si picjin i omis
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support