This sounds like our workaround. Just to make sure I understand, Tom, it sounds like you add your own extra markers to everything, both for alignment and language modeling, so the parallel files look like this (using <ss> and </ss> instead of your music symbols):
<ss> das ist ein kleines haus . </ss> <ss> this is a small house . </ss> and the language modeling files look like this: <s> <ss> I like ice cream . </ss> </s> And finally the input to the decoder looks like this: <ss> alles klar , herr kommissar ? </ss> So then Moses adds its own markers internally to the input, and blithely treats <ss> and </ss> as ordinary tokens. Is this your setup? We did find that our markers occasionally get inserted mid-sentence, and a post-process is necessary to remove them. Just out of curiosity, does that ever happen for you? Thanks. - JB On Feb 12, 2013, at 10:53 , Tom Hoar wrote: > Based on last year's eos marker discussions, we started using alternate > sos/eos markers in both parallel and lm corpora. We settled on two obscure > UTF-8 characters U+1D179 Musical Symbol Begin Phrase and U+1D17A Musical > Symbol End Phrase. As in standard corpus preparation, the parallel corpora > does not use <s></s> and lm corpora does. We've seen significant improvement > in results without the need to reordering placement of <s></s> tags. > > > On 2013-02-11 00:43, Kenneth Heafield wrote: > >> On 02/10/13 17:21, John Joseph Morgan wrote: >>> Hello all, My understanding is that and end of sentence marker is inserted >>> by the decoder at some point in the decoding process to give the complete >>> sentence higher probability than shorter segments of the sentence. Is this >>> correct? >> No. Inserting the eos marker gives the complete sentence lower >> probability. p(</s> | foo bar .) < 1. It's inserted to model the end >> of sentence. >> >>> If so, can the decoder be configured to not insert the eos marker? srilm's >>> ngram-count has a -no-eos option, is there a similar option for the decoder? >> There is no command line option to disable </s>. >>> What are the relevant files where this is coded? >> For phrase-based KenLM, moses/LM/Ken.cpp:255. For phrase-based with >> other lms, moses/LM/Implementation.cpp near 171. For syntax, see >> moses/Sentence.cpp near 187 but beware that </s> controls when the glue >> rule applies. >> >>> Thanks, John _______________________________________________ Moses-support >>> mailing list [email protected] >>> http://mailman.mit.edu/mailman/listinfo/moses-support >> _______________________________________________ >> Moses-support mailing list >> >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
