Hi Amittai,
Thanks for all your help. I've already implemented a lot of your
suggestions and hope to implement more this month or next.
The output looks a lot better after including some language models as
you suggested. That dinged the internal BLEU score a bit, but it's
worth it.
I've implemented the suggested Train/Tune/Test ratio of 80%:10%:10% as
well as (monolingual) English corpus and look forward to doing
cross-validation with different tune/test subsets.
I see where you're coming from with finding a monolingual English
corpus of exactly the kind of data we want the MT system to be good at
and assembling a monolingual English corpus of plausible data. I'll
have to confer with the community on that to see where the most urgent
need is. I've starred and watched your repo
https://github.com/amittai/cynical, looking forward to taking it for a
spin.
Yes, looking forward to getting an off-the-shelf system or even phone
apps off the ground.
I'm wondering if Polish data could be used with copious amounts of
regex to get a dramatic BLEU score improvement.
Thanks so much again for helping to revitalize this endangered
low-resource language.
Best regards,
Petro

On 21 March 2018 at 02:19, amittai <[email protected]> wrote:
> Hi --
>
> For what it's worth, those are wonderful goals, and I hope you succeed.
> The silence on the mailing list is a large number of people not wanting to
> be the one pointing out that you are in dire need of more data ;)
> But, it sounds like you know that, and want to build what you can with what
> you've got. Here are my opinions, others might have different takes.
>
> -- I'd say a Train/Tune/Test ratio of 80%:10%:10% is textbook. This
>    means a tuning and test set of about 300-ish lines. That's tiny,
>    but we used test sets that size around 2005, so there is good
>    precedent. If you can afford it, try some cross-validation with
>    different tune/test subsets.
>
> -- What to do with the budget? I'd spend 90% of the money on getting
>    more bilingual data, and then build an off-the-shelf MT system with
>    the rest.
>
> -- Not sure that many internal system settings can compensate for a
> fundamental lack of data. I'd make sure my pre- and post-processing setup
> made my output look as fluent as possible. MT systems can make very
> consistent mistakes in the output, and some of them can be patched up with
> regexes.
>
> Point #2 raises the question of _which_ data to have translated... If it
> were me building a Lemko--EN system, I'd do it like this:
>
> 1. Find a (monolingual) English corpus of exactly the kind of data I want my
> MT system to be good at. (We're being realistic, right? Leuko translations
> of the entire internet will have to wait until after we have good Lemko
> translations of e.g. government forms, street signs, or tourist phrases).
> Accuracy is more important than size.
> let's call this the REPRESENTATIVE (REPR) corpus.
>
> 2. Assemble a (monolingual) English corpus of plausible data, meaning
> sentences that looks like they might be helpful (i.e. not the UN corpus) and
> that I could pay to have (some of) them translated. If nothing else, this
> can just be corpus #1 (REPR), but I'd make it as large as I could without
> extra effort.
> call this the UNADAPTED or AVAILABLE (AVAIL) corpus.
>
> 3. Put my bilingual Lemko--EN data in a small pile, and call it the SEED.
> Maybe pat it on the head, too, and tell it I'm working to find some friends.
> This is the data I already have translated.
>
> I want to eventually be able to bilingually model the REPR corpus (by
> training a system). I can't do that, and I can't use my Lemko data to figure
> out how, either. What I *can* do is use my English data to figure out:
>
> What sentences from AVAIL should I add to SEED in order to better model
> REPR?
>
> Monolingually, this means:
> "I want to build a LM on {SEED plus some data}, and I want the LM to have
> the lowest possible perplexity on REPR. Which sentences should I add to SEED
> from AVAIL in order to do that?"
>
> The English sentences I move from AVAIL to SEED in order to better model
> REPR are precisely the sentences that I should pay to have translated.  This
> is because these are the sentences in AVAIL with the most information about
> the REPR corpus that is not already in SEED.
>
> I've written a tool that can do this:
>     https://github.com/amittai/cynical
>
> There might be other tools, and they might be better, but I'm not aware of
> them. It'll output the sentences in AVAIL, but in order of how useful they
> are to me. I'd go down the list, and translate as many as I could afford. If
> at some point in the future I got more money, I could continue bootstrapping
> by re-running the algorithm with the larger SEED corpus containing all my
> translated data.
>
> "Cynical selection" was originally intended for regular domain adaptation
> stuff, but it can also do the monolingual corpus-growing that you might
> want. Documentation is mostly inside the code at the moment. For now, to run
> it, edit the bash wrapper script to point to your files etc and then just
> hit 'bash amittai-cynical-wrapper.sh'
>
> I think these settings might be useful:
>   task_distribution_file="representative_sentences.en"
>   unadapted_distribution_file="all_plausible_data.en"
>   seed_corpus_file="bilingual_data.en"
>   available_corpus_file=$unadapted_distribution_file
>   batchmode=0  ## disable it!
>   numlines=50000 ## stop after 50k lines, or whatever you think your budget
> allows for
>
> it can be quite memory intensive if AVAIL is large. if you have hardware
> constraints, try playing with the following settings:
>     mincount=20  ## if REPR is really big, increase mincount.
> and
>     $save_memory=1 in the selection script itself.
>
> If you (or anyone) run into difficulties, just open a github issue here:
>     https://github.com/amittai/cynical/issues
> and I'd be more than happy to help debug, clarify, walk through steps, etc.
>
> Cheers,
> ~amittai
>
>
> On 2018-03-20 19:06, Aileen Joan Vicente wrote:
>>
>> Would love to hear inputs from others. I am working on a low-resource
>> Chavacano corpus too.
>>
>> On Wed, Mar 21, 2018 at 1:29 AM, Petro ORYNYCZ-GLEASON
>> <[email protected]> wrote:
>>
>>> Dear Colleagues,
>>> We are using Moses to revitalize Lemko, an endangered low-resource
>>> language. We have 70,000 Lemko words in 3,387 segments perfectly
>>> translated into native English and perfectly aligned.
>>> Current BLEU score is about 0.10.
>>> As far as hardware goes, we're using the cloud: Amazon EC2 p2.xlarge
>>> (1 GPU, 4 vCPus, 61 GiB RAM).
>>> Questions:
>>> - How divide our precious 3,387 bilingual segments into training,
>>> tuning, and testing data? What ratio is ideal?
>>> - Considering that at this point, bilingual content is much dearer
>>> to
>>> us than processing power (Amazon AWS costs us USD 0.90 per hour,
>>> while
>>> translation costs us USD 0.15 per word), how do we make the most of
>>> what we've got?
>>> - Is there anything we could do other than the default settings that
>>> might lead to a large improvement in the BLEU score?
>>>
>>> Current training model:
>>> ~/workspace/mosesdecoder/scripts/training/train-model.perl \
>>> --parallel --mgiza-cpus 4 \
>>> -root-dir train \
>>> --corpus ~/corpus/train.ru-en.clean \
>>> --f ru --e en \
>>> --alignment grow-diag-final-and \
>>> --reordering msd-bidirectional-fe \
>>> --lm 0:3:/home/ubuntu/lm/train.ru-en.blm.en:8 \
>>> -external-bin-dir ~/workspace/bin/training-tools/mgizapp
>>>
>>> Current tuning model:
>>> ~/workspace/mosesdecoder/scripts/training/mert-moses.pl [1] \
>>> ~/corpus/tune.ru-en.true.ru [2] ~/corpus/tune.ru-en.true.en \
>>> ~/workspace/mosesdecoder/bin/moses ~/working/train/model/moses.ini
>>> --mertdir ~/workspace/mosesdecoder/bin/ \
>>> --decoder-flags="-threads 4"
>>>
>>> Thanks for your help!
>>> _______________________________________________
>>> Moses-support mailing list
>>> [email protected]
>>> http://mailman.mit.edu/mailman/listinfo/moses-support [3]
>>
>>
>>
>>
>> Links:
>> ------
>> [1] http://mert-moses.pl
>> [2] http://tune.ru-en.true.ru
>> [3] http://mailman.mit.edu/mailman/listinfo/moses-support
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to