You shouldnt keep them: the & and ; would be tokenized and pollute your sentences.
There are tools to convert them, at least a perl module I think, search about html decoding. They are called html entities, not tags. On Wed, Jul 24, 2013 at 2:16 PM, Cyrine NASRI <[email protected]> wrote: > Hello, > > I use a training corpus to build my translation system. > > But i founf in this corpus some HTML tags like for instance : > > "and i 'm going to start with this one : if momma ain 't happy , > ain 't nobody happy ." > > Should i have to elliminate this? or keep them? > > Thank you in advance for your replies > > Best > -- > Cyrine NASRI > Ph.D. Student in Computer Science > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
