hi raphael
the .dat files are the binary files for the phrase table. it was
binarized from the included phrase table
rule-table
the source code for phrase-based binary implementation was a little too
unwieldly to extend to the a syntax model, so we rewrote our own.
in the ini file, do this for the text format:
6 0 0 1 syntax-model/rule-table
or this for the binary format:
2 0 0 1 syntax-model/rules
i'm not familiar with the perl scripts. Phil Williams might be along
later as he knows more about them.
I know that to extract syntax/hiero rules, the scripts have to call
extract-rules, instead of extract. The other parts of the training
pipeline is identical.
So for instance,
1. to create hiero grammar, run something like
extract-rules corpus.1.0-0.en corpus.1.0-0.de
aligned.1.grow-diag-final-and extract.hiero --MaxSymbolsSource
5 --Hierarchical
This extract phrases that look like this:
Musharrafs [X][X] Akt ? [X] ||| Musharraf 's [X][X] Act ? [X] |||
0-0 0-1 2-3 3-4 1-2 ||| 0.0666667
in the SCFG rewrite world, this means
X --> Musharrafs X Akt ? ||| Musharraf 's X Act ?
2. to extract rules with grammar from a TARGET syntax tree:
extract nc.truecased.1.en.0000 nc.truecased.1.de.0000
aligned.1.grow-diag-final-and.0000 extract.both.0000 --MaxSymbolsSource
5 --Hierarchical --GlueGrammar $DIR/glue.both.0000 --TargetSyntax
--OnlyDirect --NonTermConsecSource --MaxNonTerm 3 --MinHoleSource 1
--AllowOnlyUnalignedWords --MinWords 0
this creates rules like:
Musharrafs [X][JJ] Akt ? [X] ||| Musharraf 's [X][JJ] Act ? [NPB]
||| 0-0 0-1 2-3 3-4 1-2 ||| 0.015873
which means
NPB --> Musharrafs JJ Akt ||| Musharraf 's JJ Act ?
in this case, the target side part of your aligned corpus has to be in
the tree format that you saw. I don't know which script convert a parser
output to the tree format, there's so many parsers so it may be that
your have to write your own
However, GIZA++ doesn't know anything about trees, it uses plain old
detokenized corpus as always. That's probably why mkcls choked
On 04/05/2010 13:01, Raphael Payen wrote:
> Hi
>
> I am interested to try using syntax models, and I have read the
> "syntax tutorial" section in the manual, but I don't really understand
> how it works. I guess it would be easier with an example, but I don't
> understand neither how to use the files in the sample models archive
> (what are the .dat files in the "rules" directory ? If I want to train
> my own model, I must provide a syntactically annotated parallel
> corpus. So, if I start from just a parallel corpus, I'll need to use
> for example first a POS tagger, then a Collins parser, then the
> wrapper script provided, and then call train-model.perl with
> --{source,target}-syntax ?
>
> I tried with a dummy corpus containing just this:
> <tree label="PN"> das</tree> <tree label="V"> ist</tree> <tree
> label="NP"> <tree label="DET"> ein</tree> <tree label="ADJ"> kleines
> </tree> <tree label="NN"> haus</tree> </tree>
> (and similar in english)
>
> I called train-model.perl like this:
> train-model.perl --corpus testfile -f de -e en -lm
> 0:3:europarl.srilm.gz --source-syntax --target-syntax
> and got this error:
> mkcls: StatVar.cpp:116: double StatVar::quantil(double): Assertion
> `index>=0&&index<n' failed
> Obviously there's something I'm doing wrong, but I don't know what.
>
> By the way, train-model.perl is only in branches/mt3_chart, not in trunk ?
>
> So, to summarize all this, if someone could expand the syntax
> tutorial, and include an example of a simple training, I'd be
> grateful.
>
> Best regards,
>
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support