Hi Petro, I’m a bit late to the discussion, but I’d nevertheless like to add my thoughts, especially as you hint at a possible solution yourself:
> I'm wondering if Polish data could be used with copious amounts of > regex to get a dramatic BLEU score improvement. Indeed, 3300 Lemko-English sentence pairs is not a lot, but you’re in the comfortable position that Lemko is closely related to two official EU languages, Polish and Slovak (Ukrainian might also help, but I don’t know a lot about the data situation there). With this, you have essentially two options: 1. Go for a classical pivot approach, by training a Lemko => Polish system and a Polish => English system and feeding the output of the former to the latter. The first step could be done on the character level, requiring less parallel data (this system would basically learn the “copious amounts of regex” you’re referring to). See for example Jörg Tiedemann: Character-based pivot translation for under-resourced languages and domains, EACL 2012. This approach requires some Lemko-Polish (or Slovak) parallel data though, which you may not have. 2. Use a “domain-adaptation” approach, where you’d start by creating a Polish-English MT system and gradually mix in some Lemko data during the training process. In this approach, you wouldn’t need any Lemko-Polish data, but it might be a bit trickier to get it working as the Lemko data will be outnumbered by the Polish data. These ideas can easily be combined with Amittai’s suggestions, as they would basically create a better SEED model to get started. Oh, and if you happen to be interested in morphological tagging for Lemko, you might want to have a look at this: Yves Scherrer & Achim Rabus: Multi-source morphosyntactic tagging for Spoken Rusyn, VarDial workshop, EACL 2017. Best of luck in your endeavors, and apologize for the shameless self-promotion :D Yves _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
