hi francis
On 18 March 2013 14:38, Francis Tyers <[email protected]> wrote:
> Hello all,
>
> I'm trying to get lattice input to Moses to work for morpheme
> segmentation for Finnish->English MT. I'm using the description here[1]
> and have the following questions:
>
> 1) Do the weights outgoing arcs have to add up to 1.0 ? In some places
> it says weight, and in others probability.
>
no. They're just scores
>
> 2) For the multiline example, it is important that there be a preceeding
> space on the line before the first '(', but it's not mentioned in the
> documentation -- could it be added ? The code in question seems to be in
> parsePCN() where it returns error if in[c++] is not '(', so if you have
> "(" instead of " (" in the first line, the checkplf program returns a
> "there appears to be no path to the goal" error. This does not seem to
> be a problem in the single-line format, providing there are no extra
> spaces.
>
> 3) How does training work ? Should the training data include all the
> possible segmentations ? e.g. If I have a sentence (surface forms) in
> Finnish:
>
> Näitä siirtoja nopeutettiin tuntuvasti vuonna 1998 .
> Redeployment was stepped up in 1998 .
>
> Should I include:
>
> Näitä siirto >j >a nopeutettiin tuntuvasti vuote >na 1998 .
> Näitä siirtoja nopeutettiin tuntuvasti vuote >na 1998 .
> Näitä siirto >j >a nopeutettiin tuntuvasti vuonna 1998 .
> [etc.]
>
you can, but they would each be given the same weighting. They would each
also be given the same weighting as a parallel sentence that only has 1
possible segmentation. Ideally, the ambiguous sentence should be
downweighted.
You can change the extract program to do this downweighting. (It might
already have been done in the last few months)
You also have to give it word alignments for each possible segmentation.
I'm not sure what the best way to go about running GIZA++ to do this
> (where '>' indicates a suffix morpheme boundary). I read Dyer et al.
> (2008) paper, and what I'd like to do is similar to the Arabic setup,
> but how the training corpus was processed is not clear (at least to
> me). :)
>
cdec might have more support for lattices since that was chris' thesis.
> Thanks in advance for any help!
>
> Fran
>
> 1. http://www.statmt.org/moses/?n=Moses.WordLattices
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
--
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support