Looks like it's reading a line from each of three files then writing a line to stdout. They're all C++ streams. Sounds like a recipe for disk seeks. Try making the buffer bigger:
const size_t kBufSize = 1ULL << 25; char *t1_buffer = new char[kBuffer]; t1.rdbuf()->pubsetbuf(buffer, kBuffer); ... delete [] t1_buffer; Kenneth On 02/22/2012 03:41 PM, "Венцислав Жечев (Ventsislav Zhechev)" wrote: > Hieu Hoang <hieuhoang@...> writes: > > > > > hi all > > > > does anyone have experience running the training pipeline on a laptop? > > It seem very slow to me, especially some parts of the GIZA++ alignment > > (and possibly later stages too). Seems to be crawling due to IO waits on > > a GIZA process called snt2cooc.out. This doesn't happens when running on > > larger servers. Has anyone else encounter this problem? > > > > I'm using a MacBook 2.4Ghz dual core, OSX 10.7.3, 240GB disk (5400 > > spin), 4GB ram. > > > > Hi Hieu, > > I’ve been training Moses engines successfully on Mac laptops for many > years now, but I must concur that the standard Moses training workflow > uses way too much disk IO. > > Because of that, I’ve done significant modifications to the overall > workflow to minimise unnecessary disk IO—including using bzip2 for > intermediary files, gzip to compress the temp files for ‘sort’, named > pipes to avoid storing temporary data that is only used once—the overall > reduction for disk IO was about an order of magnitude in my experiments. > These modifications were done on the Moses codebase from around June > 2010 and I haven’t updated the codebase since. However, last year I > recompiled many of the tools using clang and LLVM in Xcode. > > Currently, I’m using an early 2011 15" MacBook Pro 2.4GHz with 16GB RAM > for training. I’ve got two 750GB 7200rpm HDDs and I’m running the > trainings on the non-boot HDD under Mac OS X 10.7.3. Another > optimisation that could be done on Mac OS X is to put the folder where > you perform the training in the Spotlight exclusion list, so that the > indexer doesn’t sweat over your temporary training data. The average > size of the data sets that I use for training is about 3.5M segments for > a number of different languages, with the maximum being over 7.5M > segments for EN–JA. > > If I remember correctly, snt2cooc.out was not the worst offender for > disk IO, but I cannot say for sure now. The way the output of the > ‘export’ script is handled is not particularly efficient either… > > > Cheers, > > Ventzi > > ––––––– > Dr. Ventsislav Zhechev > Computational Linguist > > Language Technologies > Localisation Services > Autodesk Development Sàrl > Neuchâtel, Switzerland > > http://VentsislavZhechev.eu > tel: +41 32 723 9122 > fax: +41 32 723 9399 > > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
