Hi Mauro (or others in the know), One more thing: does this mapping file have to be a 1-to-1 map, or can the same word be mapped to more than one microtag?
E.g., given the corpus: --------------------------------------- This|NP( little|NP+ example|NP) is|VP( boring|VP) This|NP is|VP a|NP( little|NP+ example|NP) too|ADVP --------------------------------------- we would have: --------------------------------------- this NP( this NP ...[etc.] is VP is VP( ...[etc.] --------------------------------------- Is that possible under the current implementation? --D.N. On Mon, Jan 31, 2011 at 5:38 AM, Mauro Cettolo <[email protected]> wrote: > Dear Dennis, > > chunk LMs were implemented in IRSTLM toolkit to be used within Moses. > Methods for using chunks in a standalone fashion are available, but not used > in any way by the exectubales of the toolkit (e.g. "compile-lm"). If you > want to use them in Moses, as written in the on-line documentation, you have > to define a word-to-chunk map and pass it to Moses through the configuration > file. Looking at your example, you should have a map like this: > > > FIELD -1 > a NP( > b NP+ > c NP) > d VP( > e VP+ > f VP) > g PP( > h PP+ > i PP) > > [please find in the on-line manual the meaning of the header "FIELD -1"], > save it in a file (e.g. "map") and add it as fourth field in the line of the > config file where the chunk LM is specified: > > 1 0 3 corp.blm map > > This way, when a translation hypothesis like for example > > a b b b c d f > > has to be scored by the chunk LM, the score actually provided will be that > corresponding to the chunk sequence of the mapped sequence NP( NP+ NP+ NP+ > NP) (VP VP), that is NP VP. > > That's what we implemented some years ago. Since it's a long time we've not > used that code, we're going to check right now if it's still working like we > designed it, or if some successive updates have affected it. I'll give you > feedback as soon as possible. > > Mauro > > Dennis Mehay wrote: > >> Hello all, >> >> I'm trying to train up an asynchronous ("chunk-based") factored PMT model >> with Moses + IRST LM. Trouble is, I'm not sure IRST LM is reassembling >> microtags into chunks (e.g., a candidate with "NP( NP+ NP+ NP)" should >> become just "NP" before LM scoring during decoding). >> >> The reason I'm not sure is that I trained up a little dummy LM using a >> tiny corpus of chunks (diplayed below) as follows: >> >> ------------------------------------------------- >> $ more corp >> NP VP ADVP >> NP VP NP NP >> NP VP NP >> NP VP PP >> $ ngt -i=corp -n=3 -o=corp.www -b=yes >> $ tlm -tr=corp.www -n=3 -lm=wb -o=corp.lm >> $ compile-lm corp.lm corp.blm >> $ more evalcorp >> NP( NP+ NP) VP( VP+ VP+ VP+ VP) PP( PP+ PP) >> $ cat evalcorp | add-start-end.sh | compile-lm corp.blm --eval=/dev/stdin >> ------------------------------------------------- >> >> and I get: >> >> ------------------------------------------------- >> inpfile: corp.blm >> dub: 10000000 >> Reading corp.blm... >> blmt >> loadbin() >> loading 6 1-grams >> loading 9 2-grams >> loading 3 3-grams >> done >> OOV code is 5 >> creating cache for storing prob, state and statesize of ngrams >> Start Eval >> OOV code: 5 >> %% Nw=13 PP=35896110.99 PPwp=35896068.13 Nbo=12 Noov=11 OOV=84.62% >> prob_and_state_cache() ngramcache stats: entries=3 acc=11 hits=8 ht.used= >> 6402408 mp.used= 56000008 mp.wasted= 55999840 >> lmtable class statistics >> levels 3 >> lev 1 entries 6 used mem 0.00Mb >> lev 2 entries 9 used mem 0.00Mb >> lev 3 entries 3 used mem 0.00Mb >> total allocated mem 0.00Mb >> total number of get and binary search calls >> level 1 get: 5 bsearch: 0 >> level 2 get: 4 bsearch: 7 >> level 3 get: 3 bsearch: 0 >> deleting cache for storing prob, state and statesize of ngrams >> ------------------------------------------------- >> >> Notice that all of the microtags are treated as OOV terms (i.e., not >> mapped to the chunks they describe). >> For what it's worth, the ARPA format file looks fine: >> >> ------------------------------------------------- >> $ more corp.lm >> >> \data\ >> ngram 1= 6 >> ngram 2= 9 >> ngram 3= 3 >> >> >> \1-grams: >> -1.09691 <s> -0.39794 >> -0.49485 NP -0.653212 >> -0.69897 VP -0.367977 >> -1.09691 ADVP -0.30103 >> -1.09691 PP -0.30103 >> -0.619789 <unk> >> >> \2-grams: >> -0.364516 <s> <s> >> -0.484126 <s> NP >> -0.393141 NP NP -0.221849 >> -0.31079 NP VP -0.146128 >> -0.373806 VP NP -0.477121 >> -0.751676 VP ADVP >> -0.751676 VP PP >> -0.180456 ADVP NP >> -0.267606 PP <s> >> >> \3-grams: >> -0.159058 NP NP VP >> -0.230804 NP VP NP >> -0.0961065 VP NP NP >> \end\ >> ------------------------------------------------- >> >> Also, when I run tiny tests in Moses (i.e., train up on a tiny parallel >> corpus, train up a LM using IRST LM, etc.), I get more garbled results than >> when I don't use the chunk-based LM. I suspect this is due to IRST LM's >> treating each microtag as an <unk>, so that the chunk-based LM confounds >> rather than helps fluency. >> Asking Moses to "-report-all-factors" doesn't confirm anything either, as >> the factors would just be the microtags and nothing would confirm that they >> are or are not being reassembled into chunks internally. >> >> If I'm missing something (this mysterious mapping file, perhaps?), someone >> please let me know. >> >> Thanks. >> >> Best, >> Dennis >> ------------------------------------------------------------------------ >> >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> > > > -- > Mauro Cettolo > FBK - Ricerca Scientifica e Tecnologica > Via Sommarive 18 > 38123 Povo (Trento), Italy > Phone: (+39) 0461-314551 > E-mail: [email protected] > URL: http://hlt.fbk.eu/people/cettolo > > E cuale esie la me Patrie? cent, centmil, nissune > parcč che par picjâ lis bandieris spes a si picjin i omis > >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
