Oh yeah, good point. What I also used to do when translating large files it to cute them to a maximum length, say 200-300 tokens with a perl/python script before it goes into moses. So my recipe is:
* Split into files with about 10,000 lines each * loop over files with gnu parallel and a single-thread or 2-4 thread moses, letting gnu parallel handling parallelization * piping the file through a script that cuts the lines to length of 200-300 tokens before piping into moses That usually solves things. Another advantage is that you can easily check if there was intermediate crash and only retranslate the missing piece. W dniu 12.12.2017 o 14:10, Barry Haddow pisze: > Hi > > Yes, that's true. From Liling's description it sounds like a > pathologically long sentence is causing Moses to blow up. However he > states that it happens on random lines -- could it be that there are > so many threads that the amount of data translated is random but it's > the same problem line at each time? > > cheers - Barry > > On 12/12/17 09:24, Marcin Junczys-Dowmunt wrote: >> Hi, >> I think the important part is that Liling actually manages to translate >> several tens of thousands of sentences before that happens. A quick fix >> would be to break your corpus into pieces of 10K sentences each and loop >> over the files. I usually have bad experience with trying to translate >> large batches of text with moses. >> >> Is still trying to load the entire corpus into memory? It used to do >> that. >> >> W dniu 12.12.2017 o 10:16, Barry Haddow pisze: >>> Hi Liling >>> >>> The short answer is you need need to prune/filter your phrase table >>> prior to creating the compact phrase table. I don't mean "filter model >>> given input", because that won't make much difference if you have a >>> very large input, I mean getting rid of rare translations which won't >>> be used anyway. >>> >>> The compact phrase does not do pruning, it ends up being done in >>> memory, so if you have 750,000 translations of the full-stop in your >>> model then they all get loaded into memory, before Moses selects the >>> top 20. >>> >>> You can use prunePhraseTable from Moses (which bizarrely needs to load >>> a phrase table in order to parse the config file, last time I looked). >>> You could also apply Johnson / entropic pruning, whatever works for >>> you, >>> >>> cheers - Barry >>> >>> On 11/12/17 09:20, liling tan wrote: >>>> Dear Moses community/developers, >>>> >>>> I have a question on how to handle large models created using moses. >>>> >>>> I've a vanilla phrase-based model with >>>> >>>> * PhraseDictionary num-features=4 input-factor=0 output-factor=0 >>>> * LexicalReordering num-features=6 input-factor=0 output-factor=0 >>>> * KENLM order=5 factor=0 >>>> >>>> The size of the model is: >>>> >>>> * compressed phrase table is 5.4GB, >>>> * compressed reordering table is 1.9GB and >>>> * quantized LM is 600MB >>>> >>>> >>>> I'm running on a single 56 cores machine with 256GB RAM. Whenever I'm >>>> decoding I use -threads 56 parameter. >>>> >>>> It's takes really long to load the table and after loading, it breaks >>>> inconsistently at different lines when decoding, I notice that the >>>> RAM goes into swap before it breaks. >>>> >>>> I've tried compact phrased table and get a >>>> >>>> * 3.2GB .minphr >>>> * 1.5GV .minlexr >>>> >>>> And the same kind of random breakage happens when RAM goes into swap >>>> after loading the phrase-table. >>>> >>>> Strangely, it still manage to decode ~500K sentences before it breaks. >>>> >>>> Then I've tried with ondisk phrasetable and it's around 37GB >>>> uncompressed. Using the ondisk PT didn't cause breakage but the >>>> decoding time is significantly increased, now it can only decode 15K >>>> sentences in an hour. >>>> >>>> The setup is a little different from normal where we have the >>>> train/dev/test split. Currently, my task is to decode the train set. >>>> I've tried filtering the table with the trainset with >>>> filter-model-given-input.pl <http://filter-model-given-input.pl> but >>>> the size of the compressed table didn't really decrease much. >>>> >>>> The entire training set is made up of 5M sentence pairs and it's >>>> taking 3+ days just to decode ~1.5M sentences with ondisk PT. >>>> >>>> >>>> My questions are: >>>> >>>> - Are there best practices with regards to deploying large Moses >>>> models? >>>> - Why does the 5+GB phrase table take up > 250GB RAM when decoding? >>>> - How else should I filter/compress the phrase table? >>>> - Is it normal to decode only ~500K sentence a day given the machine >>>> specs and the model size? >>>> >>>> I understand that I could split the train set up into two and train 2 >>>> models then cross-decode but if the training size is 10M sentence >>>> pairs, we'll face the same issues. >>>> >>>> Thank you for reading the long post and thank you in advances for any >>>> answers, discussions and enlightenment on this issue =) >>>> >>>> Regards, >>>> LIling >>>> >>>> >>>> _______________________________________________ >>>> Moses-support mailing list >>>> [email protected] >>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>> >>> >>> The University of Edinburgh is a charitable body, registered in >>> Scotland, with registration number SC005336. >>> >>> >>> _______________________________________________ >>> Moses-support mailing list >>> [email protected] >>> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support > > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
