> 5. I thought building a bi-lingual corpus was having
>   a text file with a sentence in the source language
>   along with the sentence in the target language.

Yes. The basics of a parallel corpus are sentence pairs, where one
sentence is in one language, the other is in the other, and the two
are translations of each other. Everything else is extra, though
potentially useful.

Tokenization is an optional pre- or post-processing step you can do,
if you have the tools. Parsing is useful for factored translation with
the Moses decoder, but it is not strictly necessary. You can work just
fine with basic text. Dictionaries / lexicons can also be useful, but
again, not necessary.

> 7. How big the corpus should be to get relatively accurate results?

It depends on the translation task (domain-specific or not, length of
sentences, complexity of the languages, etc etc).

A simpler answer is: as large as possible, always :)

Cheers,
~amittai
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to