Hi Petro,

I’m a bit late to the discussion, but I’d nevertheless like to add my thoughts, 
especially as you hint at a possible solution yourself:

> I'm wondering if Polish data could be used with copious amounts of
> regex to get a dramatic BLEU score improvement.

Indeed, 3300 Lemko-English sentence pairs is not a lot, but you’re in the 
comfortable position that Lemko is closely related to two official EU 
languages, Polish and Slovak (Ukrainian might also help, but I don’t know a lot 
about the data situation there). With this, you have essentially two options:

1. Go for a classical pivot approach, by training a Lemko => Polish system and 
a Polish => English system and feeding the output of the former to the latter. 
The first step could be done on the character level, requiring less parallel 
data (this system would basically learn the “copious amounts of regex” you’re 
referring to). See for example Jörg Tiedemann: Character-based pivot 
translation for under-resourced languages and domains, EACL 2012. This approach 
requires some Lemko-Polish (or Slovak) parallel data though, which you may not 
have.

2. Use a “domain-adaptation” approach, where you’d start by creating a 
Polish-English MT system and gradually mix in some Lemko data during the 
training process. In this approach, you wouldn’t need any Lemko-Polish data, 
but it might be a bit trickier to get it working as the Lemko data will be 
outnumbered by the Polish data.

These ideas can easily be combined with Amittai’s suggestions, as they would 
basically create a better SEED model to get started.

Oh, and if you happen to be interested in morphological tagging for Lemko, you 
might want to have a look at this:
Yves Scherrer & Achim Rabus: Multi-source morphosyntactic tagging for Spoken 
Rusyn, VarDial workshop, EACL 2017.

Best of luck in your endeavors, and apologize for the shameless self-promotion 
:D
Yves


_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to