Looks like it's reading a line from each of three files then writing a 
line to stdout.  They're all C++ streams.  Sounds like a recipe for disk 
seeks.  Try making the buffer bigger:

const size_t kBufSize = 1ULL << 25;

char *t1_buffer = new char[kBuffer];
t1.rdbuf()->pubsetbuf(buffer, kBuffer);
...

delete [] t1_buffer;

Kenneth

On 02/22/2012 03:41 PM, "Венцислав Жечев (Ventsislav Zhechev)" wrote:
> Hieu Hoang <hieuhoang@...> writes:
>
>  >
>  > hi all
>  >
>  > does anyone have experience running the training pipeline on a laptop?
>  > It seem very slow to me, especially some parts of the GIZA++ alignment
>  > (and possibly later stages too). Seems to be crawling due to IO waits on
>  > a GIZA process called snt2cooc.out. This doesn't happens when running on
>  > larger servers. Has anyone else encounter this problem?
>  >
>  > I'm using a MacBook 2.4Ghz dual core, OSX 10.7.3, 240GB disk (5400
>  > spin), 4GB ram.
>  >
>
> Hi Hieu,
>
> I’ve been training Moses engines successfully on Mac laptops for many
> years now, but I must concur that the standard Moses training workflow
> uses way too much disk IO.
>
> Because of that, I’ve done significant modifications to the overall
> workflow to minimise unnecessary disk IO—including using bzip2 for
> intermediary files, gzip to compress the temp files for ‘sort’, named
> pipes to avoid storing temporary data that is only used once—the overall
> reduction for disk IO was about an order of magnitude in my experiments.
> These modifications were done on the Moses codebase from around June
> 2010 and I haven’t updated the codebase since. However, last year I
> recompiled many of the tools using clang and LLVM in Xcode.
>
> Currently, I’m using an early 2011 15" MacBook Pro 2.4GHz with 16GB RAM
> for training. I’ve got two 750GB 7200rpm HDDs and I’m running the
> trainings on the non-boot HDD under Mac OS X 10.7.3. Another
> optimisation that could be done on Mac OS X is to put the folder where
> you perform the training in the Spotlight exclusion list, so that the
> indexer doesn’t sweat over your temporary training data. The average
> size of the data sets that I use for training is about 3.5M segments for
> a number of different languages, with the maximum being over 7.5M
> segments for EN–JA.
>
> If I remember correctly, snt2cooc.out was not the worst offender for
> disk IO, but I cannot say for sure now. The way the output of the
> ‘export’ script is handled is not particularly efficient either…
>
>
> Cheers,
>
> Ventzi
>
> –––––––
> Dr. Ventsislav Zhechev
> Computational Linguist
>
> Language Technologies
> Localisation Services
> Autodesk Development Sàrl
> Neuchâtel, Switzerland
>
> http://VentsislavZhechev.eu
> tel: +41 32 723 9122
> fax: +41 32 723 9399
>
>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to