Hi,
Just to clarify corpus preparation for LM estimation:
SRILM & KenLM: do not add <s> or </s> to the corpus. They are added
internally (unless you disable it). So in this case, the LM file should
look like this:
<ss> this is a small house .</ss>
IRSTLM: decorate the corpus with <s> and </s> before building, possibly
by using their script.
<s> <ss> I like ice cream .</ss> </s>
For what it's worth, cdec has the option to not add <s> and </s>
(expecting that your grammar will supply them). It also has harsh
penalties if these symbols do not appear or appear in the wrong places.
Kenneth
On 02/12/13 17:11, Tom Hoar wrote:
> Yes, John. The solutions look the same. It's very possible I got the
> idea from your earlier writings.
>
> We don't always use this technique. When we do use them, we've rarely
> find extraneous markers. We use them more regularly in the recaser data
> and it helps force first-word casing. I think they occur less than
> 1/15,000 segments. Maybe it's the data because surely the choice of
> tokens wouldn't matter.
>
> On 2013-02-12 23:28, John D. Burger wrote:
>
>> This sounds like our workaround. Just to make sure I understand, Tom, it
>> sounds like you add your own extra markers to everything, both for alignment
>> and language modeling, so the parallel files look like this (using<ss>
>> and</ss> instead of your music symbols):
>>
>> <ss> das ist ein kleines haus .</ss>
>> <ss> this is a small house .</ss>
>>
>> and the language modeling files look like this:
>>
>> <s> <ss> I like ice cream .</ss> </s>
>>
>> And finally the input to the decoder looks like this:
>>
>> <ss> alles klar , herr kommissar ?</ss>
>>
>> So then Moses adds its own markers internally to the input, and blithely
>> treats<ss> and</ss> as ordinary tokens. Is this your setup? We did find
>> that our markers occasionally get inserted mid-sentence, and a post-process
>> is necessary to remove them. Just out of curiosity, does that ever happen
>> for you?
>>
>> Thanks.
>>
>> - JB
>>
>> On Feb 12, 2013, at 10:53 , Tom Hoar wrote:
>>> Based on last year's eos marker discussions, we started using
>>> alternate sos/eos markers in both parallel and lm corpora. We settled
>>> on two obscure UTF-8 characters U+1D179 Musical Symbol Begin Phrase
>>> and U+1D17A Musical Symbol End Phrase. As in standard corpus
>>> preparation, the parallel corpora does not use <s></s> and lm corpora
>>> does. We've seen significant improvement in results without the need
>>> to reordering placement of <s></s> tags. On 2013-02-11 00:43, Kenneth
>>> Heafield wrote:
>>>> On 02/10/13 17:21, John Joseph Morgan wrote:
>>>>> Hello all, My understanding is that and end of sentence marker is
>>>>> inserted by the decoder at some point in the decoding process to
>>>>> give the complete sentence higher probability than shorter segments
>>>>> of the sentence. Is this correct?
>>>> No. Inserting the eos marker gives the complete sentence lower
>>>> probability. p(</s> | foo bar .) < 1. It's inserted to model the end
>>>> of sentence.
>>>>> If so, can the decoder be configured to not insert the eos marker?
>>>>> srilm's ngram-count has a -no-eos option, is there a similar option
>>>>> for the decoder?
>>>> There is no command line option to disable </s>.
>>>>> What are the relevant files where this is coded?
>>>> For phrase-based KenLM, moses/LM/Ken.cpp:255. For phrase-based with
>>>> other lms, moses/LM/Implementation.cpp near 171. For syntax, see
>>>> moses/Sentence.cpp near 187 but beware that </s> controls when the
>>>> glue rule applies.
>>>>> Thanks, John _______________________________________________
>>>>> Moses-support mailing list [email protected]
>>>>> <mailto:[email protected]>
>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>> _______________________________________________ Moses-support
>>>> mailing list [email protected] <mailto:[email protected]>
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>> _______________________________________________ Moses-support mailing
>>> list [email protected] <mailto:[email protected]>
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>> _______________________________________________
>> Moses-support mailing list
>> [email protected] <mailto:[email protected]>
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support