> 5. I thought building a bi-lingual corpus was having > a text file with a sentence in the source language > along with the sentence in the target language.
Yes. The basics of a parallel corpus are sentence pairs, where one sentence is in one language, the other is in the other, and the two are translations of each other. Everything else is extra, though potentially useful. Tokenization is an optional pre- or post-processing step you can do, if you have the tools. Parsing is useful for factored translation with the Moses decoder, but it is not strictly necessary. You can work just fine with basic text. Dictionaries / lexicons can also be useful, but again, not necessary. > 7. How big the corpus should be to get relatively accurate results? It depends on the translation task (domain-specific or not, length of sentences, complexity of the languages, etc etc). A simpler answer is: as large as possible, always :) Cheers, ~amittai _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
