text to be translated needs to be in the same format as the data used for training and decoding. typically, this means:
--tokenising --lower-casing but there is nothing in the framework which forces you to do this. for example, you might want to preserve case information best practise will depend upon the volume of material you have. if you have a lot of data, then it makes sense to keep it as much of the original format (information) as possible. whenever the text is transformed, you run the risk of throwing information away. or, reconstructing it might introduce extra errors. but if you have not much data, or you suspect that it contains noise, then cleaning etc might yield good results. Miles On 6 August 2010 14:36, Gary Daine <[email protected]> wrote: > I have a very basic-sounding question, but I've not been able to find > any reference in the documentation. > > Since Moses is trained on tokenized, lowercased corpora, is it necessary > to tokenize and lowercase the text to be translated as well (and do the > reverse to the output)? > > TIA > > Gary > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
