Hi Amittai, Thanks for all your help. I've already implemented a lot of your suggestions and hope to implement more this month or next. The output looks a lot better after including some language models as you suggested. That dinged the internal BLEU score a bit, but it's worth it. I've implemented the suggested Train/Tune/Test ratio of 80%:10%:10% as well as (monolingual) English corpus and look forward to doing cross-validation with different tune/test subsets. I see where you're coming from with finding a monolingual English corpus of exactly the kind of data we want the MT system to be good at and assembling a monolingual English corpus of plausible data. I'll have to confer with the community on that to see where the most urgent need is. I've starred and watched your repo https://github.com/amittai/cynical, looking forward to taking it for a spin. Yes, looking forward to getting an off-the-shelf system or even phone apps off the ground. I'm wondering if Polish data could be used with copious amounts of regex to get a dramatic BLEU score improvement. Thanks so much again for helping to revitalize this endangered low-resource language. Best regards, Petro
On 21 March 2018 at 02:19, amittai <[email protected]> wrote: > Hi -- > > For what it's worth, those are wonderful goals, and I hope you succeed. > The silence on the mailing list is a large number of people not wanting to > be the one pointing out that you are in dire need of more data ;) > But, it sounds like you know that, and want to build what you can with what > you've got. Here are my opinions, others might have different takes. > > -- I'd say a Train/Tune/Test ratio of 80%:10%:10% is textbook. This > means a tuning and test set of about 300-ish lines. That's tiny, > but we used test sets that size around 2005, so there is good > precedent. If you can afford it, try some cross-validation with > different tune/test subsets. > > -- What to do with the budget? I'd spend 90% of the money on getting > more bilingual data, and then build an off-the-shelf MT system with > the rest. > > -- Not sure that many internal system settings can compensate for a > fundamental lack of data. I'd make sure my pre- and post-processing setup > made my output look as fluent as possible. MT systems can make very > consistent mistakes in the output, and some of them can be patched up with > regexes. > > Point #2 raises the question of _which_ data to have translated... If it > were me building a Lemko--EN system, I'd do it like this: > > 1. Find a (monolingual) English corpus of exactly the kind of data I want my > MT system to be good at. (We're being realistic, right? Leuko translations > of the entire internet will have to wait until after we have good Lemko > translations of e.g. government forms, street signs, or tourist phrases). > Accuracy is more important than size. > let's call this the REPRESENTATIVE (REPR) corpus. > > 2. Assemble a (monolingual) English corpus of plausible data, meaning > sentences that looks like they might be helpful (i.e. not the UN corpus) and > that I could pay to have (some of) them translated. If nothing else, this > can just be corpus #1 (REPR), but I'd make it as large as I could without > extra effort. > call this the UNADAPTED or AVAILABLE (AVAIL) corpus. > > 3. Put my bilingual Lemko--EN data in a small pile, and call it the SEED. > Maybe pat it on the head, too, and tell it I'm working to find some friends. > This is the data I already have translated. > > I want to eventually be able to bilingually model the REPR corpus (by > training a system). I can't do that, and I can't use my Lemko data to figure > out how, either. What I *can* do is use my English data to figure out: > > What sentences from AVAIL should I add to SEED in order to better model > REPR? > > Monolingually, this means: > "I want to build a LM on {SEED plus some data}, and I want the LM to have > the lowest possible perplexity on REPR. Which sentences should I add to SEED > from AVAIL in order to do that?" > > The English sentences I move from AVAIL to SEED in order to better model > REPR are precisely the sentences that I should pay to have translated. This > is because these are the sentences in AVAIL with the most information about > the REPR corpus that is not already in SEED. > > I've written a tool that can do this: > https://github.com/amittai/cynical > > There might be other tools, and they might be better, but I'm not aware of > them. It'll output the sentences in AVAIL, but in order of how useful they > are to me. I'd go down the list, and translate as many as I could afford. If > at some point in the future I got more money, I could continue bootstrapping > by re-running the algorithm with the larger SEED corpus containing all my > translated data. > > "Cynical selection" was originally intended for regular domain adaptation > stuff, but it can also do the monolingual corpus-growing that you might > want. Documentation is mostly inside the code at the moment. For now, to run > it, edit the bash wrapper script to point to your files etc and then just > hit 'bash amittai-cynical-wrapper.sh' > > I think these settings might be useful: > task_distribution_file="representative_sentences.en" > unadapted_distribution_file="all_plausible_data.en" > seed_corpus_file="bilingual_data.en" > available_corpus_file=$unadapted_distribution_file > batchmode=0 ## disable it! > numlines=50000 ## stop after 50k lines, or whatever you think your budget > allows for > > it can be quite memory intensive if AVAIL is large. if you have hardware > constraints, try playing with the following settings: > mincount=20 ## if REPR is really big, increase mincount. > and > $save_memory=1 in the selection script itself. > > If you (or anyone) run into difficulties, just open a github issue here: > https://github.com/amittai/cynical/issues > and I'd be more than happy to help debug, clarify, walk through steps, etc. > > Cheers, > ~amittai > > > On 2018-03-20 19:06, Aileen Joan Vicente wrote: >> >> Would love to hear inputs from others. I am working on a low-resource >> Chavacano corpus too. >> >> On Wed, Mar 21, 2018 at 1:29 AM, Petro ORYNYCZ-GLEASON >> <[email protected]> wrote: >> >>> Dear Colleagues, >>> We are using Moses to revitalize Lemko, an endangered low-resource >>> language. We have 70,000 Lemko words in 3,387 segments perfectly >>> translated into native English and perfectly aligned. >>> Current BLEU score is about 0.10. >>> As far as hardware goes, we're using the cloud: Amazon EC2 p2.xlarge >>> (1 GPU, 4 vCPus, 61 GiB RAM). >>> Questions: >>> - How divide our precious 3,387 bilingual segments into training, >>> tuning, and testing data? What ratio is ideal? >>> - Considering that at this point, bilingual content is much dearer >>> to >>> us than processing power (Amazon AWS costs us USD 0.90 per hour, >>> while >>> translation costs us USD 0.15 per word), how do we make the most of >>> what we've got? >>> - Is there anything we could do other than the default settings that >>> might lead to a large improvement in the BLEU score? >>> >>> Current training model: >>> ~/workspace/mosesdecoder/scripts/training/train-model.perl \ >>> --parallel --mgiza-cpus 4 \ >>> -root-dir train \ >>> --corpus ~/corpus/train.ru-en.clean \ >>> --f ru --e en \ >>> --alignment grow-diag-final-and \ >>> --reordering msd-bidirectional-fe \ >>> --lm 0:3:/home/ubuntu/lm/train.ru-en.blm.en:8 \ >>> -external-bin-dir ~/workspace/bin/training-tools/mgizapp >>> >>> Current tuning model: >>> ~/workspace/mosesdecoder/scripts/training/mert-moses.pl [1] \ >>> ~/corpus/tune.ru-en.true.ru [2] ~/corpus/tune.ru-en.true.en \ >>> ~/workspace/mosesdecoder/bin/moses ~/working/train/model/moses.ini >>> --mertdir ~/workspace/mosesdecoder/bin/ \ >>> --decoder-flags="-threads 4" >>> >>> Thanks for your help! >>> _______________________________________________ >>> Moses-support mailing list >>> [email protected] >>> http://mailman.mit.edu/mailman/listinfo/moses-support [3] >> >> >> >> >> Links: >> ------ >> [1] http://mert-moses.pl >> [2] http://tune.ru-en.true.ru >> [3] http://mailman.mit.edu/mailman/listinfo/moses-support >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
