thanks guys. it seems like a mixture of things causes this problem.
If i manage to fix it, i'll check it in.
ventsislav - if you ever want to commit your modifications, you're
most welcome to.
On 22/02/2012 20:41, "Венцислав Жечев (Ventsislav Zhechev)" wrote:
Hieu Hoang <hieuhoang@...> writes:
>
> hi all
>
> does anyone have experience running the training pipeline on
a laptop?
> It seem very slow to me, especially some parts of the GIZA++
alignment
> (and possibly later stages too). Seems to be crawling due to
IO waits on
> a GIZA process called snt2cooc.out. This doesn't happens when
running on
> larger servers. Has anyone else encounter this problem?
>
> I'm using a MacBook 2.4Ghz dual core, OSX 10.7.3, 240GB disk
(5400
> spin), 4GB ram.
>
Hi Hieu,
I’ve been training Moses engines successfully on Mac laptops for
many years now, but I must concur that the standard Moses training
workflow uses way too much disk IO.
Because of that, I’ve done significant modifications to the
overall workflow to minimise unnecessary disk IO—including using
bzip2 for intermediary files, gzip to compress the temp files for
‘sort’, named pipes to avoid storing temporary data that is only
used once—the overall reduction for disk IO was about an order of
magnitude in my experiments. These modifications were done on the
Moses codebase from around June 2010 and I haven’t updated the
codebase since. However, last year I recompiled many of the tools
using clang and LLVM in Xcode.
Currently, I’m using an early 2011 15" MacBook Pro 2.4GHz with
16GB RAM for training. I’ve got two 750GB 7200rpm HDDs and I’m
running the trainings on the non-boot HDD under Mac OS X 10.7.3.
Another optimisation that could be done on Mac OS X is to put the
folder where you perform the training in the Spotlight exclusion
list, so that the indexer doesn’t sweat over your temporary
training data. The average size of the data sets that I use for
training is about 3.5M segments for a number of different
languages, with the maximum being over 7.5M segments for EN–JA.
If I remember correctly, snt2cooc.out was not the worst
offender for disk IO, but I cannot say for sure now. The way the
output of the ‘export’ script is handled is not particularly
efficient either…
Cheers,
Ventzi
–––––––
Dr. Ventsislav Zhechev
Computational Linguist
Language Technologies
Localisation Services
Autodesk Development Sàrl
Neuchâtel, Switzerland
http://VentsislavZhechev.eu
tel: +41 32 723 9122
fax: +41 32 723 9399

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support
|
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support