Hi, it seems to be best to remember the location of the XML markup, strip them out during translation and re-insert them into the output. The exact location of the markup can be determined with the phrase and word alignment of the translation.
You could also just leave them in, but since "<num>19</num>" is treated as a token, you may want to inserted. But still, the tags may get reshuffled by arbitrary preferences of the language model. -phi On Sat, Dec 17, 2011 at 2:38 AM, somayeh bakhshaei <[email protected]> wrote: > Hello, > > We intend to add XML tags to our corpus but we are not sure how the Moses > decoder and SRILM uses these tags in training and decoding phase. > > For example if we tag 19 in main corpus like this: > 19 ---> <num>19</num> > > How does LM must be made on this tagged corpus using SRILM? > Does SRILM consider whether <num> or <num>19</num> as a token? > > Also in decoding phase: > How does moses pass the tagged tokens to the LM? > For example if test is tagged like this: > <num>19</num> > Does it pass just <num> or whole of it as <num>19</num> > > > --------------------- > Best Regards, > S.Bakhshaei > > After All you will come .... > And will spread light on the dark desolate world! > O' Kind Father! We will be waiting for your affectionate hands ... > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
