Re: [Moses-support] Asynchronous factors and IRST LM (not working)

Dennis Mehay Mon, 31 Jan 2011 09:48:35 -0800

Hi Mauro (or others in the know),

One more thing: does this mapping file have to be a 1-to-1 map, or can the
same word be mapped to more than one microtag?


E.g., given the corpus:

---------------------------------------
This|NP( little|NP+ example|NP) is|VP( boring|VP)
This|NP is|VP a|NP( little|NP+ example|NP) too|ADVP
---------------------------------------

we would have:

---------------------------------------
this NP(
this NP
...[etc.]
is VP
is VP(
...[etc.]
---------------------------------------

Is that possible under the current implementation?

--D.N.

On Mon, Jan 31, 2011 at 5:38 AM, Mauro Cettolo <[email protected]> wrote:

> Dear Dennis,
>
> chunk LMs were implemented in IRSTLM toolkit to be used within Moses.
> Methods for using chunks in a standalone fashion are available, but not used
> in any way by the exectubales of the toolkit (e.g. "compile-lm"). If you
> want to use them in Moses, as written in the on-line documentation, you have
> to define a word-to-chunk map and pass it to Moses through the configuration
> file. Looking at your example, you should have a map like this:
>
>
> FIELD -1
> a NP(
> b NP+
> c NP)
> d VP(
> e VP+
> f VP)
> g PP(
> h PP+
> i PP)
>
> [please find in the on-line manual the meaning of the header "FIELD -1"],
> save it in a file (e.g. "map") and add it as fourth field in the line of the
> config file where the chunk LM is specified:
>
> 1 0 3 corp.blm map
>
> This way, when a translation hypothesis like for example
>
> a b b b c d f
>
> has to be scored by the chunk LM, the score actually provided will be that
> corresponding to the chunk sequence of the mapped sequence NP( NP+ NP+ NP+
> NP) (VP VP), that is NP VP.
>
> That's what we implemented some years ago. Since it's a long time we've not
> used that code, we're going to check right now if it's still working like we
> designed it, or if some successive updates have affected it. I'll give you
> feedback as soon as possible.
>
> Mauro
>
> Dennis Mehay wrote:
>
>> Hello all,
>>
>> I'm trying to train up an asynchronous ("chunk-based") factored PMT model
>> with Moses + IRST LM.  Trouble is, I'm not sure IRST LM is reassembling
>> microtags into chunks (e.g., a candidate with "NP( NP+ NP+ NP)" should
>> become just "NP" before LM scoring during decoding).
>>
>> The reason I'm not sure is that I trained up a little dummy LM using a
>> tiny corpus of chunks (diplayed below) as follows:
>>
>> -------------------------------------------------
>> $ more corp
>> NP VP ADVP
>> NP VP NP NP
>> NP VP NP
>> NP VP PP
>> $ ngt -i=corp -n=3 -o=corp.www -b=yes
>> $ tlm -tr=corp.www -n=3 -lm=wb -o=corp.lm
>> $ compile-lm corp.lm corp.blm
>> $ more evalcorp
>> NP( NP+ NP) VP( VP+ VP+ VP+ VP) PP( PP+ PP)
>> $ cat evalcorp | add-start-end.sh | compile-lm corp.blm --eval=/dev/stdin
>> -------------------------------------------------
>>
>> and I get:
>>
>> -------------------------------------------------
>> inpfile: corp.blm
>> dub: 10000000
>> Reading corp.blm...
>> blmt
>> loadbin()
>> loading 6 1-grams
>> loading 9 2-grams
>> loading 3 3-grams
>> done
>> OOV code is 5
>> creating cache for storing prob, state and statesize of ngrams
>> Start Eval
>> OOV code: 5
>> %% Nw=13 PP=35896110.99 PPwp=35896068.13 Nbo=12 Noov=11 OOV=84.62%
>> prob_and_state_cache() ngramcache stats: entries=3 acc=11 hits=8 ht.used=
>> 6402408 mp.used= 56000008 mp.wasted= 55999840
>> lmtable class statistics
>> levels 3
>> lev 1 entries 6 used mem 0.00Mb
>> lev 2 entries 9 used mem 0.00Mb
>> lev 3 entries 3 used mem 0.00Mb
>> total allocated mem 0.00Mb
>> total number of get and binary search calls
>> level 1 get: 5 bsearch: 0
>> level 2 get: 4 bsearch: 7
>> level 3 get: 3 bsearch: 0
>> deleting cache for storing prob, state and statesize of ngrams
>> -------------------------------------------------
>>
>> Notice that all of the microtags are treated as OOV terms (i.e., not
>> mapped to the chunks they describe).
>> For what it's worth, the ARPA format file looks fine:
>>
>> -------------------------------------------------
>> $ more corp.lm
>>
>> \data\
>> ngram  1=         6
>> ngram  2=         9
>> ngram  3=         3
>>
>>
>> \1-grams:
>> -1.09691    <s>    -0.39794
>> -0.49485    NP    -0.653212
>> -0.69897    VP    -0.367977
>> -1.09691    ADVP    -0.30103
>> -1.09691    PP    -0.30103
>> -0.619789    <unk>
>>
>> \2-grams:
>> -0.364516    <s> <s>
>> -0.484126    <s> NP
>> -0.393141    NP NP    -0.221849
>> -0.31079    NP VP    -0.146128
>> -0.373806    VP NP    -0.477121
>> -0.751676    VP ADVP
>> -0.751676    VP PP
>> -0.180456    ADVP NP
>> -0.267606    PP <s>
>>
>> \3-grams:
>> -0.159058    NP NP VP
>> -0.230804    NP VP NP
>> -0.0961065    VP NP NP
>> \end\
>> -------------------------------------------------
>>
>> Also, when I run tiny tests in Moses (i.e., train up on a tiny parallel
>> corpus, train up a LM using IRST LM, etc.), I get more garbled results than
>> when I don't use the chunk-based LM.  I suspect this is due to IRST LM's
>> treating each microtag as an <unk>, so that the chunk-based LM confounds
>> rather than helps fluency.
>> Asking Moses to "-report-all-factors" doesn't confirm anything either, as
>> the factors would just be the microtags and nothing would confirm that they
>> are or are not being reassembled into chunks internally.
>>
>> If I'm missing something (this mysterious mapping file, perhaps?), someone
>> please let me know.
>>
>> Thanks.
>>
>> Best,
>> Dennis
>> ------------------------------------------------------------------------
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>
>
> --
> Mauro Cettolo
> FBK - Ricerca Scientifica e Tecnologica
> Via Sommarive 18
> 38123 Povo (Trento), Italy
> Phone: (+39) 0461-314551
> E-mail: [email protected]
> URL: http://hlt.fbk.eu/people/cettolo
>
> E cuale esie la me Patrie? cent, centmil, nissune
> parcč che par picjâ lis bandieris spes a si picjin i omis
>
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Asynchronous factors and IRST LM (not working)

Reply via email to