On Tue, 2011-02-22 at 05:50 +1100, Tony Theodore wrote:
> On 22 February 2011 01:33, Felipe Barriga Richards
> <fbarr...@algometrics.cl> wrote:
> > Hi everyone,
> >
> > I've developed a parser that read csv files and save them on a hdf5 file
> > (on python using pytables). The initial problem it was that pytables
> > doesn't support simultaneous write (on multiple threads/processes) so
> > I've decided to create multiple processes so each one of them read a
> > whole csv file on a array and then lock the hdf5 object, write the table
> > and release it.
> 
> Reading more may not help if you can't write it at the same rate, I
> usually append each row as it's read, then flush at the end:
The actual method is better than have only one thread. While the thread
is writting the hdf5 file, the others are processing the next csv files.
If I do it line by line it takes an eternity.

> > The problem with that it's that writing the array to the table is quite
> > slow. There is any option to speed up this ?
> 
> Try to write to a different disk than you're reading from, on an
> average machine, you can saturate your read bandwidth and write to an
> SSD (even over USB). High end machines ???
> 
The problem is not the disk, I'm using a iSCSI array connected by 2
ethergiga so I can easily achieve 100 MiB/s. Also I read the csv files
from local hdd and the output goes to iSCSI.

> > Do more pre-processing on
> > the threads (like an special kind of array) so when they keep locked the
> > hdf5 file less time and avoid/reduce the bottleneck ?
> 
> I think multiple processes/threads will only increase the bottleneck.
> Doing more pre-processing on a single read thread may help, but then
> you have to know your incoming data will always have a similar
> profile.
The bottleneck is the write method of pytables. I've 16 cpu's and when
the others processes are ready (read all csv files), there is only one
cpu working at full load writing the damn hdf5 file :(
I don't have swapping problem either.
> 
> > pytables pro will
> > improve this situation ?
> 
> Pytables Pro mostly optimises indexing and reading (hdf files), write
> performance is basically disk speed, chunking, and compression. Try to
> increase the compression level till your read/write io is balanced -
> then you'll also have the benefit of faster querying on the hdf file.
> I haven't played much with block sizes, there's much to read about
> here:
> 
> http://www.pytables.org/docs/manual/ch05.html
I know that it improve queries but I wasn't sure about the writing
speed.


Write a Numpy array or other kind of array can improve the speed v/s
writing an standard one ?

> 
> Cheers,
> 
> Tony
Thanks Tony!


------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to