Hi Vera
I think the situation you describe could happen even without unaligned
words. Suppose that you have a 2 word sentence on each side, and the
alignment points are (0,0), (0,1) and (1,0) - I think this is possible
with the usual symmetrisation algorithm. Then you would extract the
phrase pair containing 2 2-word phrases, but no phrase pairs containing
1-word phrases. (see below for an example)
You still get lexical weights for the translation of word-0 to word-0
though, since there is an alignment point there
cheers - Barry
[hyperion]bhaddow: cat c.en
a b
[hyperion]bhaddow: cat c.fr
A B
[hyperion]bhaddow: cat c.align
0-0 1-0 0-1
[hyperion]bhaddow: ~/moses.new/bin/extract c.en c.fr c.align e 5
PhraseExtract v1.4, written by Philipp Koehn
phrase extraction from an aligned parallel corpus
[hyperion]bhaddow: cat e
A B ||| a b ||| 0-0 1-0 0-1
[hyperion]bhaddow: ~/moses.new/scripts/training/get-lexical.perl c.en
c.fr c.align c
(c.en,c.fr,c)
FILE: c.fr
FILE: c.en
FILE: c.align
!
Saved: c.f2e and c.e2f
[hyperion]bhaddow: cat c.e2f
a A 0.5000000
a B 1.0000000
b A 0.5000000
[hyperion]bhaddow: cat c.f2e
A a 0.5000000
B a 0.5000000
A b 1.0000000
On 27/11/14 16:15, Matthias Huck wrote:
> Hi Vera,
>
> It's odd that the lexical translation model contains such an entry if
> the pair is always unaligned. Maybe you used a different word alignment
> when you extracted the lexicon model?
>
> You should manually have a look at your word alignment in order to check
> whether it has reasonable quality. There's a visualization tool called
> "Picaro" in Moses:
>
> $ moses/contrib/picaro/picaro.py -a1 model/aligned.1.grow-diag-final-and -f
> model/aligned.1.0.de -e model/aligned.1.0.en
>
> In order to find out whether the symmetrization heuristic is an issue
> for you, you can compare the standard and inverse GIZA alignments with
> the symmetrized alignment.
>
> Ways to experiment with word alignment quality are for instance:
>
> - Choosing a different symmetrization heuristic
> - Modifying the GIZA settings, e.g. by training with a different number
> of EM iterations or a different sequence of IBM/HMM models
> - Using some other method for training word alignments, e.g. a
> discriminative word aligner
>
> Also, if the amount of parallel training data is small, you shouldn't be
> surprised if you are not able to train reliable models.
>
> Cheers,
> Matthias
>
>
> On Thu, 2014-11-27 at 14:45 +0100, Vera Aleksic, Linguatec GmbH wrote:
>> Hi,
>>
>> I have one more question:
>> In the lex.e2f file there is a translation Gitarre->guitar:
>>
>> Gitarre guitar 0.4000000
>> Gitarre using 0.0000284
>> Gitarre ; 0.0000017
>>
>> Why has not it became part of the phrase table?
>>
>> Thanks again!
>> Vera
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Vera Aleksic, Linguatec GmbH
>> Gesendet: Donnerstag, 27. November 2014 09:42
>> An: 'Matthias Huck'; Raj Dabre
>> Betreff: AW: [Moses-support] Unknown single words that are part of phrases
>>
>> Hi,
>> Thank you for your answers.
>> @Raj, one-word-translations do not exist, I have searched for them. If the
>> grow-diag method probably causes such phenomena, are there any better
>> alternatives?
>> @Matthias, you are right, the pair Gitarre-guitar is always unaligned, but I
>> do not really understand why. Why is "guitar" in the example below aligned
>> to "Musikinstrument Gittare", and not to "Gitarre" only? I assume,
>> decomposing "Musik + Instrument" would help? How else could I improve the
>> word alignment quality?
>> Thanks!
>> Best,
>> Vera
>>
>> für ein Musikinstrument wie eine elektrische Gitarre , NULL ({ }) for ({ 1
>> }) a ({ 2 }) musical ({ }) instrument ({ }) , ({ }) such ({ }) as ({ 4 }) an
>> ({ 5 }) electric ({ 6 }) guitar ({ 3 7 }) ; ({ 8 })
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Matthias Huck [mailto:[email protected]]
>> Gesendet: Mittwoch, 26. November 2014 17:54
>> An: Raj Dabre
>> Cc: Vera Aleksic, Linguatec GmbH; moses-support
>> Betreff: Re: [Moses-support] Unknown single words that are part of phrases
>>
>> Hi,
>>
>> Supposedly your phrase table does not contain an entry "Gitarre ||| guitar"
>> because this word pair is always unaligned in your training data. You could
>> try to improve your word alignment quality.
>>
>> Alternatively, you could implement a procedure in the manner of the "forced
>> single word heuristic" as described in:
>> D. Stein, D. Vilar, S. Peitz, M. Freitag, M. Huck, and H. Ney. A Guide to
>> Jane, an Open Source Hierarchical Translation Toolkit. The Prague Bulletin
>> of Mathematical Linguistics, number 95, pages 5-18, Prague, Czech Republic,
>> April 2011.
>> http://ufal.mff.cuni.cz/pbml/95/art-stein-vilar-ney-jane.pdf
>> (see Fig. 1c).
>>
>> But the latter would rather be a workaround.
>>
>> Cheers,
>> Matthias
>>
>>
>> On Thu, 2014-11-27 at 01:18 +0900, Raj Dabre wrote:
>>> Hello,
>>>
>>>
>>> If I am not wrong this is most likely due to the grow (-diag) method
>>> applied to the word aligned data (both directions) before phrase extraction.
>>>
>>> Furthermore..... one word translations should exist (but not always)....
>>> search for them.
>>>
>>>
>>>
>>> Regards.
>>>
>>>
>>> On Thu, Nov 27, 2014 at 12:53 AM, Vera Aleksic, Linguatec GmbH
>>> <[email protected]> wrote:
>>> Hi,
>>>
>>> I have observed many times that some words do not exist as single
>>> word translations in the phrase table, although they exist in the training
>>> corpus and in multiword phrases.
>>> An example:
>>> German-English translation for "Gitarre" is unknown, i.e. there is
>>> no single word entry for "Gitarre" in the phrase table, although some
>>> other phrases containing this word exist (see below).
>>> How is it possible?
>>> Thanks and best regards,
>>> Vera
>>>
>>>
>>> Gitarre , ||| guitar ; ||| 1 0.0284465 1 0.0654272 2.718 ||| ||| 1
>>> 1
>>> Gitarre darstellt , unter Beanspruchung ||| guitar using ||| 0.25
>>> 2.7351e-11 1 0.0625119 2.718 ||| ||| 4 1
>>> Gitarre darstellt , unter ||| guitar using ||| 0.25 1.18917e-05 1
>>> 0.0625119 2.718 ||| ||| 4 1
>>> Gitarre darstellt , ||| guitar using ||| 0.25 0.00569228 1
>>> 0.0625119 2.718 ||| ||| 4 1
>>> Gitarre darstellt ||| guitar using ||| 0.25 0.0400028 1 0.0625119
>>> 2.718 ||| ||| 4 1
>>> Kopfplatte einer Gitarre darstellt , ||| head of a guitar using
>>> ||| 0.5 4.23407e-08 1 0.00471281 2.718 ||| ||| 2 1
>>> Kopfplatte einer Gitarre darstellt ||| head of a guitar using |||
>>> 0.5 2.97552e-07 1 0.00471281 2.718 ||| ||| 2 1
>>> eine elektrische Gitarre , ||| an electric guitar ; ||| 1
>>> 0.00107982 1 0.00163632 2.718 ||| ||| 1 1
>>> einer Gitarre darstellt , unter ||| of a guitar using ||| 0.333333
>>> 6.4754e-07 1 0.00471281 2.718 ||| ||| 3 1
>>> einer Gitarre darstellt , ||| of a guitar using ||| 0.333333
>>> 0.000309961 1 0.00471281 2.718 ||| ||| 3 1
>>> einer Gitarre darstellt ||| of a guitar using ||| 0.333333
>>> 0.00217827 1 0.00471281 2.718 ||| ||| 3 1
>>> elektrische Gitarre , ||| electric guitar ; ||| 1 0.005661 1
>>> 0.0142097 2.718 ||| ||| 1 1
>>> wie eine elektrische Gitarre , ||| as an electric guitar ; |||
>>> 1 0.000177339 1 0.000809485 2.718 ||| ||| 1 1
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> [email protected]
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>
>>>
>>> --
>>> Raj Dabre.
>>> Research Student,
>>>
>>> Graduate School of Informatics,
>>> Kyoto University.
>>> CSE MTech, IITB., 2011-2014
>>>
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> [email protected]
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in Scotland,
>> with registration number SC005336.
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support