Hello Mauro,

Thanks very much for the clarification.  It *was* the mapping file, after
all.
I was under the impression that IRSTLM did this chunk reassembly
automatically
all the time, but I really only want to use it for Moses, so it's just as
well.

Also, thanks for the offer to test whether this functionality still works.

Best,
D.N.

On Mon, Jan 31, 2011 at 5:38 AM, Mauro Cettolo <[email protected]> wrote:

> Dear Dennis,
>
> chunk LMs were implemented in IRSTLM toolkit to be used within Moses.
> Methods for using chunks in a standalone fashion are available, but not used
> in any way by the exectubales of the toolkit (e.g. "compile-lm"). If you
> want to use them in Moses, as written in the on-line documentation, you have
> to define a word-to-chunk map and pass it to Moses through the configuration
> file. Looking at your example, you should have a map like this:
>
>
> FIELD -1
> a NP(
> b NP+
> c NP)
> d VP(
> e VP+
> f VP)
> g PP(
> h PP+
> i PP)
>
> [please find in the on-line manual the meaning of the header "FIELD -1"],
> save it in a file (e.g. "map") and add it as fourth field in the line of the
> config file where the chunk LM is specified:
>
> 1 0 3 corp.blm map
>
> This way, when a translation hypothesis like for example
>
> a b b b c d f
>
> has to be scored by the chunk LM, the score actually provided will be that
> corresponding to the chunk sequence of the mapped sequence NP( NP+ NP+ NP+
> NP) (VP VP), that is NP VP.
>
> That's what we implemented some years ago. Since it's a long time we've not
> used that code, we're going to check right now if it's still working like we
> designed it, or if some successive updates have affected it. I'll give you
> feedback as soon as possible.
>
> Mauro
>
> Dennis Mehay wrote:
>
>> Hello all,
>>
>> I'm trying to train up an asynchronous ("chunk-based") factored PMT model
>> with Moses + IRST LM.  Trouble is, I'm not sure IRST LM is reassembling
>> microtags into chunks (e.g., a candidate with "NP( NP+ NP+ NP)" should
>> become just "NP" before LM scoring during decoding).
>>
>> The reason I'm not sure is that I trained up a little dummy LM using a
>> tiny corpus of chunks (diplayed below) as follows:
>>
>> -------------------------------------------------
>> $ more corp
>> NP VP ADVP
>> NP VP NP NP
>> NP VP NP
>> NP VP PP
>> $ ngt -i=corp -n=3 -o=corp.www -b=yes
>> $ tlm -tr=corp.www -n=3 -lm=wb -o=corp.lm
>> $ compile-lm corp.lm corp.blm
>> $ more evalcorp
>> NP( NP+ NP) VP( VP+ VP+ VP+ VP) PP( PP+ PP)
>> $ cat evalcorp | add-start-end.sh | compile-lm corp.blm --eval=/dev/stdin
>> -------------------------------------------------
>>
>> and I get:
>>
>> -------------------------------------------------
>> inpfile: corp.blm
>> dub: 10000000
>> Reading corp.blm...
>> blmt
>> loadbin()
>> loading 6 1-grams
>> loading 9 2-grams
>> loading 3 3-grams
>> done
>> OOV code is 5
>> creating cache for storing prob, state and statesize of ngrams
>> Start Eval
>> OOV code: 5
>> %% Nw=13 PP=35896110.99 PPwp=35896068.13 Nbo=12 Noov=11 OOV=84.62%
>> prob_and_state_cache() ngramcache stats: entries=3 acc=11 hits=8 ht.used=
>> 6402408 mp.used= 56000008 mp.wasted= 55999840
>> lmtable class statistics
>> levels 3
>> lev 1 entries 6 used mem 0.00Mb
>> lev 2 entries 9 used mem 0.00Mb
>> lev 3 entries 3 used mem 0.00Mb
>> total allocated mem 0.00Mb
>> total number of get and binary search calls
>> level 1 get: 5 bsearch: 0
>> level 2 get: 4 bsearch: 7
>> level 3 get: 3 bsearch: 0
>> deleting cache for storing prob, state and statesize of ngrams
>> -------------------------------------------------
>>
>> Notice that all of the microtags are treated as OOV terms (i.e., not
>> mapped to the chunks they describe).
>> For what it's worth, the ARPA format file looks fine:
>>
>> -------------------------------------------------
>> $ more corp.lm
>>
>> \data\
>> ngram  1=         6
>> ngram  2=         9
>> ngram  3=         3
>>
>>
>> \1-grams:
>> -1.09691    <s>    -0.39794
>> -0.49485    NP    -0.653212
>> -0.69897    VP    -0.367977
>> -1.09691    ADVP    -0.30103
>> -1.09691    PP    -0.30103
>> -0.619789    <unk>
>>
>> \2-grams:
>> -0.364516    <s> <s>
>> -0.484126    <s> NP
>> -0.393141    NP NP    -0.221849
>> -0.31079    NP VP    -0.146128
>> -0.373806    VP NP    -0.477121
>> -0.751676    VP ADVP
>> -0.751676    VP PP
>> -0.180456    ADVP NP
>> -0.267606    PP <s>
>>
>> \3-grams:
>> -0.159058    NP NP VP
>> -0.230804    NP VP NP
>> -0.0961065    VP NP NP
>> \end\
>> -------------------------------------------------
>>
>> Also, when I run tiny tests in Moses (i.e., train up on a tiny parallel
>> corpus, train up a LM using IRST LM, etc.), I get more garbled results than
>> when I don't use the chunk-based LM.  I suspect this is due to IRST LM's
>> treating each microtag as an <unk>, so that the chunk-based LM confounds
>> rather than helps fluency.
>> Asking Moses to "-report-all-factors" doesn't confirm anything either, as
>> the factors would just be the microtags and nothing would confirm that they
>> are or are not being reassembled into chunks internally.
>>
>> If I'm missing something (this mysterious mapping file, perhaps?), someone
>> please let me know.
>>
>> Thanks.
>>
>> Best,
>> Dennis
>> ------------------------------------------------------------------------
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>
>
> --
> Mauro Cettolo
> FBK - Ricerca Scientifica e Tecnologica
> Via Sommarive 18
> 38123 Povo (Trento), Italy
> Phone: (+39) 0461-314551
> E-mail: [email protected]
> URL: http://hlt.fbk.eu/people/cettolo
>
> E cuale esie la me Patrie? cent, centmil, nissune
> parcč che par picjâ lis bandieris spes a si picjin i omis
>
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to