Re: [Moses-support] eos marker

Tom Hoar Tue, 12 Feb 2013 09:12:39 -0800

 

Yes, John. The solutions look the same. It's very possible I got the
idea from your earlier writings.


We don't always use this technique.
When we do use them, we've rarely find extraneous markers. We use them
more regularly in the recaser data and it helps force first-word casing.
I think they occur less than 1/15,000 segments. Maybe it's the data
because surely the choice of tokens wouldn't matter. 

On 2013-02-12
23:28, John D. Burger wrote: 

> This sounds like our workaround. Just
to make sure I understand, Tom, it sounds like you add your own extra
markers to everything, both for alignment and language modeling, so the
parallel files look like this (using <ss> and </ss> instead of your
music symbols):
> 
> <ss> das ist ein kleines haus . </ss>
> <ss> this
is a small house . </ss>
> 
> and the language modeling files look like
this:
> 
> <s> <ss> I like ice cream . </ss> </s>
> 
> And finally the
input to the decoder looks like this:
> 
> <ss> alles klar , herr
kommissar ? </ss>
> 
> So then Moses adds its own markers internally to
the input, and blithely treats <ss> and </ss> as ordinary tokens. Is
this your setup? We did find that our markers occasionally get inserted
mid-sentence, and a post-process is necessary to remove them. Just out
of curiosity, does that ever happen for you?
> 
> Thanks.
> 
> - JB
> 
>
On Feb 12, 2013, at 10:53 , Tom Hoar wrote:
> 
>> Based on last year's
eos marker discussions, we started using alternate sos/eos markers in
both parallel and lm corpora. We settled on two obscure UTF-8 characters
U+1D179 Musical Symbol Begin Phrase and U+1D17A Musical Symbol End
Phrase. As in standard corpus preparation, the parallel corpora does not
use <s></s> and lm corpora does. We've seen significant improvement in
results without the need to reordering placement of <s></s> tags. On
2013-02-11 00:43, Kenneth Heafield wrote: 
>> 
>>> On 02/10/13 17:21,
John Joseph Morgan wrote: 
>>> 
>>>> Hello all, My understanding is that
and end of sentence marker is inserted by the decoder at some point in
the decoding process to give the complete sentence higher probability
than shorter segments of the sentence. Is this correct?
>>> No.
Inserting the eos marker gives the complete sentence lower probability.
p(</s> | foo bar .) < 1. It's inserted to model the end of sentence.

>>> 
>>>> If so, can the decoder be configured to not insert the eos
marker? srilm's ngram-count has a -no-eos option, is there a similar
option for the decoder?
>>> There is no command line option to disable
</s>.
>> _______________________________________________ Moses-support
mailing list [email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support [1]
> 
>
_______________________________________________
> Moses-support mailing
list
> [email protected]
>
http://mailman.mit.edu/mailman/listinfo/moses-support [1]



Links:
------
[1]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] eos marker

Reply via email to