Hi,

        Just to clarify corpus preparation for LM estimation:

SRILM & KenLM: do not add <s> or </s> to the corpus.  They are added 
internally (unless you disable it).  So in this case, the LM file should 
look like this:

<ss> this is a small house .</ss>

IRSTLM: decorate the corpus with <s> and </s> before building, possibly 
by using their script.

<s> <ss> I like ice cream .</ss> </s>

        For what it's worth, cdec has the option to not add <s> and </s> 
(expecting that your grammar will supply them).  It also has harsh 
penalties if these symbols do not appear or appear in the wrong places.

Kenneth

On 02/12/13 17:11, Tom Hoar wrote:
> Yes, John. The solutions look the same. It's very possible I got the
> idea from your earlier writings.
>
> We don't always use this technique. When we do use them, we've rarely
> find extraneous markers. We use them more regularly in the recaser data
> and it helps force first-word casing. I think they occur less than
> 1/15,000 segments. Maybe it's the data because surely the choice of
> tokens wouldn't matter.
>
> On 2013-02-12 23:28, John D. Burger wrote:
>
>> This sounds like our workaround.  Just to make sure I understand, Tom, it 
>> sounds like you add your own extra markers to everything, both for alignment 
>> and language modeling, so the parallel files look like this (using<ss>  
>> and</ss>  instead of your music symbols):
>>
>>    <ss>  das ist ein kleines haus .</ss>
>>    <ss>  this is a small house .</ss>
>>
>> and the language modeling files look like this:
>>
>>    <s>  <ss>  I like ice cream .</ss>  </s>
>>
>> And finally the input to the decoder looks like this:
>>
>>    <ss>  alles klar , herr kommissar ?</ss>
>>
>> So then Moses adds its own markers internally to the input, and blithely 
>> treats<ss>  and</ss>  as ordinary tokens.  Is this your setup?  We did find 
>> that our markers occasionally get inserted mid-sentence, and a post-process 
>> is necessary to remove them.  Just out of curiosity, does that ever happen 
>> for you?
>>
>> Thanks.
>>
>> - JB
>>
>> On Feb 12, 2013, at 10:53 , Tom Hoar wrote:
>>> Based on last year's eos marker discussions, we started using
>>> alternate sos/eos markers in both parallel and lm corpora. We settled
>>> on two obscure UTF-8 characters U+1D179 Musical Symbol Begin Phrase
>>> and U+1D17A Musical Symbol End Phrase. As in standard corpus
>>> preparation, the parallel corpora does not use <s></s> and lm corpora
>>> does. We've seen significant improvement in results without the need
>>> to reordering placement of <s></s> tags. On 2013-02-11 00:43, Kenneth
>>> Heafield wrote:
>>>> On 02/10/13 17:21, John Joseph Morgan wrote:
>>>>> Hello all, My understanding is that and end of sentence marker is
>>>>> inserted by the decoder at some point in the decoding process to
>>>>> give the complete sentence higher probability than shorter segments
>>>>> of the sentence. Is this correct?
>>>> No. Inserting the eos marker gives the complete sentence lower
>>>> probability. p(</s> | foo bar .) < 1. It's inserted to model the end
>>>> of sentence.
>>>>> If so, can the decoder be configured to not insert the eos marker?
>>>>> srilm's ngram-count has a -no-eos option, is there a similar option
>>>>> for the decoder?
>>>> There is no command line option to disable </s>.
>>>>> What are the relevant files where this is coded?
>>>> For phrase-based KenLM, moses/LM/Ken.cpp:255. For phrase-based with
>>>> other lms, moses/LM/Implementation.cpp near 171. For syntax, see
>>>> moses/Sentence.cpp near 187 but beware that </s> controls when the
>>>> glue rule applies.
>>>>> Thanks, John _______________________________________________
>>>>> Moses-support mailing list [email protected]
>>>>> <mailto:[email protected]>
>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>> _______________________________________________ Moses-support
>>>> mailing list [email protected] <mailto:[email protected]>
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>> _______________________________________________ Moses-support mailing
>>> list [email protected] <mailto:[email protected]>
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]  <mailto:[email protected]>
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to