It depends on how much a priori knowledge you have about the languages. For the moment people tend to go into two camps, those who want to use statistical engines and those who want to go for rule based engines. According to one person there are some activity to include rules into statistical engines and vica verca but it still needs a lot of work.
Identifying a language isn't that difficult in itself, most search engines are quite good at that. Many engines can even be told to interpret the text according to a specific language so the problem is basically non existent for us. Still, because our articles has a lot of text that isn't part of a single language, and in addition there are also specialized markup, there should be done some kind of parsing before the translation engine starts processing the text. After some discussions last winter I am quite sure a rule based engine work best for small languages, but that a working solution should use some kind of self learning mechanism to refine the translation or at least identify errors. Our idea was to use statistics to identify cases where existing rules failed, and let people define the new rules. Failing rules would be detected by checking which translated sentences got changed afterwards. Actually it is a bit more difficult than this,.. ;) And no, I'm not a linguist... John >>> One of the most important things that is needed for adding languages to a >>> technology like this is having a sufficiently sized corpus. >> Yes, that was basically my main question: What is sufficiently? How much >> pages or MB of text? At least the order of magnitude. >> >> Marcus Buck _______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l