Oh yeah, good point. What I also used to do when translating large files 
it to cute them to a maximum length, say 200-300 tokens with a 
perl/python script before it goes into moses. So my recipe is:

* Split into files with about 10,000 lines each
* loop over files with gnu parallel and a single-thread or 2-4 thread 
moses, letting gnu parallel handling parallelization
* piping the file through a script that cuts the lines to length of 
200-300 tokens before piping into moses

That usually solves things. Another advantage is that you can easily 
check if there was intermediate crash and only retranslate the missing 
piece.

W dniu 12.12.2017 o 14:10, Barry Haddow pisze:
> Hi
>
> Yes, that's true. From Liling's description it sounds like a 
> pathologically long sentence is causing Moses to blow up. However he 
> states that it happens on random lines -- could it be that there are 
> so many threads that the amount of data translated is random but it's 
> the same problem line at each time?
>
> cheers - Barry
>
> On 12/12/17 09:24, Marcin Junczys-Dowmunt wrote:
>> Hi,
>> I think the important part is that Liling actually manages to translate
>> several tens of thousands of sentences before that happens. A quick fix
>> would be to break your corpus into pieces of 10K sentences each and loop
>> over the files. I usually have bad experience with trying to translate
>> large batches of text with moses.
>>
>> Is still trying to load the entire corpus into memory? It used to do 
>> that.
>>
>> W dniu 12.12.2017 o 10:16, Barry Haddow pisze:
>>> Hi Liling
>>>
>>> The short answer is you need need to prune/filter your phrase table
>>> prior to creating the compact phrase table. I don't mean "filter model
>>> given input", because that won't make much difference if you have a
>>> very large input, I mean getting rid of rare translations which won't
>>> be used anyway.
>>>
>>> The compact phrase does not do pruning, it ends up being done in
>>> memory, so if you have 750,000 translations of the full-stop in your
>>> model then they all get loaded into memory, before Moses selects the
>>> top 20.
>>>
>>> You can use prunePhraseTable from Moses (which bizarrely needs to load
>>> a phrase table in order to parse the config file, last time I looked).
>>> You could also apply Johnson / entropic pruning, whatever works for 
>>> you,
>>>
>>> cheers - Barry
>>>
>>> On 11/12/17 09:20, liling tan wrote:
>>>> Dear Moses community/developers,
>>>>
>>>> I have a question on how to handle large models created using moses.
>>>>
>>>> I've a vanilla phrase-based model with
>>>>
>>>>    * PhraseDictionary num-features=4 input-factor=0 output-factor=0
>>>>    * LexicalReordering num-features=6 input-factor=0 output-factor=0
>>>>    * KENLM order=5 factor=0
>>>>
>>>> The size of the model is:
>>>>
>>>>    * compressed phrase table is 5.4GB,
>>>>    * compressed reordering table is 1.9GB and
>>>>    * quantized LM is 600MB
>>>>
>>>>
>>>> I'm running on a single 56 cores machine with 256GB RAM. Whenever I'm
>>>> decoding I use -threads 56 parameter.
>>>>
>>>> It's takes really long to load the table and after loading, it breaks
>>>> inconsistently at different lines when decoding, I notice that the
>>>> RAM goes into swap before it breaks.
>>>>
>>>> I've tried compact phrased table and get a
>>>>
>>>>    * 3.2GB .minphr
>>>>    * 1.5GV .minlexr
>>>>
>>>> And the same kind of random breakage happens when RAM goes into swap
>>>> after loading the phrase-table.
>>>>
>>>> Strangely, it still manage to decode ~500K sentences before it breaks.
>>>>
>>>> Then I've tried with ondisk phrasetable and it's around 37GB
>>>> uncompressed. Using the ondisk PT didn't cause breakage but the
>>>> decoding time is significantly increased, now it can only decode 15K
>>>> sentences in an hour.
>>>>
>>>> The setup is a little different from normal where we have the
>>>> train/dev/test split. Currently, my task is to decode the train set.
>>>> I've tried filtering the table with the trainset with
>>>> filter-model-given-input.pl <http://filter-model-given-input.pl> but
>>>> the size of the compressed table didn't really decrease much.
>>>>
>>>> The entire training set is made up of 5M sentence pairs and it's
>>>> taking 3+ days just to decode ~1.5M sentences with ondisk PT.
>>>>
>>>>
>>>> My questions are:
>>>>
>>>>   - Are there best practices with regards to deploying large Moses 
>>>> models?
>>>>   - Why does the 5+GB phrase table take up > 250GB RAM when decoding?
>>>>   - How else should I filter/compress the phrase table?
>>>>   - Is it normal to decode only ~500K sentence a day given the machine
>>>> specs and the model size?
>>>>
>>>> I understand that I could split the train set up into two and train 2
>>>> models then cross-decode but if the training size is 10M sentence
>>>> pairs, we'll face the same issues.
>>>>
>>>> Thank you for reading the long post and thank you in advances for any
>>>> answers, discussions and enlightenment on this issue =)
>>>>
>>>> Regards,
>>>> LIling
>>>>
>>>>
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> [email protected]
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>>
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> [email protected]
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>


_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to