Yes, John. The solutions look the same. It's very possible I got the idea from your earlier writings.
We don't always use this technique. When we do use them, we've rarely find extraneous markers. We use them more regularly in the recaser data and it helps force first-word casing. I think they occur less than 1/15,000 segments. Maybe it's the data because surely the choice of tokens wouldn't matter. On 2013-02-12 23:28, John D. Burger wrote: > This sounds like our workaround. Just to make sure I understand, Tom, it sounds like you add your own extra markers to everything, both for alignment and language modeling, so the parallel files look like this (using <ss> and </ss> instead of your music symbols): > > <ss> das ist ein kleines haus . </ss> > <ss> this is a small house . </ss> > > and the language modeling files look like this: > > <s> <ss> I like ice cream . </ss> </s> > > And finally the input to the decoder looks like this: > > <ss> alles klar , herr kommissar ? </ss> > > So then Moses adds its own markers internally to the input, and blithely treats <ss> and </ss> as ordinary tokens. Is this your setup? We did find that our markers occasionally get inserted mid-sentence, and a post-process is necessary to remove them. Just out of curiosity, does that ever happen for you? > > Thanks. > > - JB > > On Feb 12, 2013, at 10:53 , Tom Hoar wrote: > >> Based on last year's eos marker discussions, we started using alternate sos/eos markers in both parallel and lm corpora. We settled on two obscure UTF-8 characters U+1D179 Musical Symbol Begin Phrase and U+1D17A Musical Symbol End Phrase. As in standard corpus preparation, the parallel corpora does not use <s></s> and lm corpora does. We've seen significant improvement in results without the need to reordering placement of <s></s> tags. On 2013-02-11 00:43, Kenneth Heafield wrote: >> >>> On 02/10/13 17:21, John Joseph Morgan wrote: >>> >>>> Hello all, My understanding is that and end of sentence marker is inserted by the decoder at some point in the decoding process to give the complete sentence higher probability than shorter segments of the sentence. Is this correct? >>> No. Inserting the eos marker gives the complete sentence lower probability. p(</s> | foo bar .) < 1. It's inserted to model the end of sentence. >>> >>>> If so, can the decoder be configured to not insert the eos marker? srilm's ngram-count has a -no-eos option, is there a similar option for the decoder? >>> There is no command line option to disable </s>. >> _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support [1] > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support [1] Links: ------ [1] http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
