Re: [Moses-support] MT training on a laptop

Hieu Hoang Thu, 23 Feb 2012 04:41:50 -0800

thanks guys. it seems like a mixture of things causes this problem. If i manage to fix it, i'll check it in.

ventsislav - if you ever want to commit your modifications, you're most welcome to.

On 22/02/2012 20:41, "Венцислав Жечев (Ventsislav Zhechev)" wrote:

Hieu Hoang <hieuhoang@...> writes:

>
> hi all
>
> does anyone have experience running the training pipeline on a laptop?
> It seem very slow to me, especially some parts of the GIZA++ alignment
> (and possibly later stages too). Seems to be crawling due to IO waits on
> a GIZA process called snt2cooc.out. This doesn't happens when running on
> larger servers. Has anyone else encounter this problem?
>
> I'm using a MacBook 2.4Ghz dual core, OSX 10.7.3, 240GB disk (5400
> spin), 4GB ram.
>

Hi Hieu,

I’ve been training Moses engines successfully on Mac laptops for many years now, but I must concur that the standard Moses training workflow uses way too much disk IO.

Because of that, I’ve done significant modifications to the overall workflow to minimise unnecessary disk IO—including using bzip2 for intermediary files, gzip to compress the temp files for ‘sort’, named pipes to avoid storing temporary data that is only used once—the overall reduction for disk IO was about an order of magnitude in my experiments. These modifications were done on the Moses codebase from around June 2010 and I haven’t updated the codebase since. However, last year I recompiled many of the tools using clang and LLVM in Xcode.

Currently, I’m using an early 2011 15" MacBook Pro 2.4GHz with 16GB RAM for training. I’ve got two 750GB 7200rpm HDDs and I’m running the trainings on the non-boot HDD under Mac OS X 10.7.3. Another optimisation that could be done on Mac OS X is to put the folder where you perform the training in the Spotlight exclusion list, so that the indexer doesn’t sweat over your temporary training data. The average size of the data sets that I use for training is about 3.5M segments for a number of different languages, with the maximum being over 7.5M segments for EN–JA.

If I remember correctly, snt2cooc.out was not the worst offender for disk IO, but I cannot say for sure now. The way the output of the ‘export’ script is handled is not particularly efficient either…

Cheers,

Ventzi

–––––––
Dr. Ventsislav Zhechev
Computational Linguist

Language Technologies
Localisation Services
Autodesk Development Sàrl
Neuchâtel, Switzerland

http://VentsislavZhechev.eu
tel: +41 32 723 9122
fax: +41 32 723 9399
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] MT training on a laptop

Reply via email to