hi raphael

the .dat files are the binary files for the phrase table. it was 
binarized from the included phrase table
   rule-table
the source code for phrase-based binary implementation was a little too 
unwieldly to extend to the a syntax model, so we rewrote our own.

in the ini file, do this for the text format:
    6 0 0 1 syntax-model/rule-table
or this for the binary format:
    2 0 0 1 syntax-model/rules

i'm not familiar with the perl scripts. Phil Williams might be along 
later as he knows more about them.

I know that to extract syntax/hiero rules, the scripts have to call 
extract-rules, instead of extract. The other parts of the training 
pipeline is identical.

So for instance,

1. to create hiero grammar, run something like
    extract-rules   corpus.1.0-0.en    corpus.1.0-0.de    
aligned.1.grow-diag-final-and    extract.hiero    --MaxSymbolsSource 
5    --Hierarchical
This extract phrases that look like this:
    Musharrafs [X][X] Akt ? [X] ||| Musharraf 's [X][X] Act ? [X] ||| 
0-0 0-1 2-3 3-4 1-2 ||| 0.0666667
in the SCFG rewrite world, this means
    X --> Musharrafs X Akt ?  |||  Musharraf 's X Act ?

2. to extract rules with grammar from a TARGET syntax tree:
    extract nc.truecased.1.en.0000 nc.truecased.1.de.0000 
aligned.1.grow-diag-final-and.0000 extract.both.0000 --MaxSymbolsSource 
5 --Hierarchical --GlueGrammar $DIR/glue.both.0000 --TargetSyntax 
--OnlyDirect --NonTermConsecSource --MaxNonTerm 3 --MinHoleSource 1 
--AllowOnlyUnalignedWords --MinWords 0

this creates rules like:
    Musharrafs [X][JJ] Akt ? [X] ||| Musharraf 's [X][JJ] Act ? [NPB] 
||| 0-0 0-1 2-3 3-4 1-2 ||| 0.015873
which means
      NPB --> Musharrafs JJ Akt  |||  Musharraf 's JJ Act ?
in this case, the target side part of your aligned corpus has to be in 
the tree format that you saw. I don't know which script convert a parser 
output to the tree format, there's so many parsers so it may be that 
your have to write your own

However, GIZA++ doesn't know anything about trees, it uses plain old 
detokenized corpus as always. That's probably why mkcls choked



On 04/05/2010 13:01, Raphael Payen wrote:
> Hi
>
> I am interested to try using syntax models, and I have read the
> "syntax tutorial" section in the manual, but I don't really understand
> how it works. I guess it would be easier with an example, but I don't
> understand neither how to use the files in the sample models archive
> (what are the .dat files in the "rules" directory ? If I want to train
> my own model, I must provide a syntactically annotated parallel
> corpus. So, if I start from just a parallel corpus, I'll need to use
> for example first a POS tagger, then a Collins parser, then the
> wrapper script provided, and then call train-model.perl with
> --{source,target}-syntax ?
>
> I tried with a dummy corpus containing just this:
> <tree label="PN">  das</tree>  <tree label="V">  ist</tree>  <tree
> label="NP">  <tree label="DET">  ein</tree>  <tree label="ADJ">  kleines
> </tree>  <tree label="NN">  haus</tree>  </tree>
> (and similar in english)
>
> I called train-model.perl like this:
> train-model.perl --corpus testfile -f de -e en -lm
> 0:3:europarl.srilm.gz --source-syntax --target-syntax
> and got this error:
> mkcls: StatVar.cpp:116: double StatVar::quantil(double): Assertion
> `index>=0&&index<n' failed
> Obviously there's something I'm doing wrong, but I don't know what.
>
> By the way, train-model.perl is only in branches/mt3_chart, not in trunk ?
>
> So, to summarize all this, if someone could expand the syntax
> tutorial, and include an example of a simple training, I'd be
> grateful.
>
> Best regards,
>
>    
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to