Hola Francesc,

On Mon, 2011-02-21 at 22:17 +0100, Francesc Alted wrote:
> Hola Felipe,
> 
> 2011/2/21, Felipe Barriga Richards <fbarr...@algometrics.cl>:
> > On Tue, 2011-02-22 at 05:50 +1100, Tony Theodore wrote:
> >> On 22 February 2011 01:33, Felipe Barriga Richards
> >> <fbarr...@algometrics.cl> wrote:
> >> > Hi everyone,
> >> >
> >> > I've developed a parser that read csv files and save them on a hdf5 file
> >> > (on python using pytables). The initial problem it was that pytables
> >> > doesn't support simultaneous write (on multiple threads/processes) so
> >> > I've decided to create multiple processes so each one of them read a
> >> > whole csv file on a array and then lock the hdf5 object, write the table
> >> > and release it.
> >>
> >> Reading more may not help if you can't write it at the same rate, I
> >> usually append each row as it's read, then flush at the end:
> > The actual method is better than have only one thread. While the thread
> > is writting the hdf5 file, the others are processing the next csv files.
> > If I do it line by line it takes an eternity.
> 
> Hmm, can you show us how you are writing into the file?  I'd say that
> Tony is correct about using one single thread: it is the best venue
> towards maximum performance (let apart that simultaneous writes on
> different threads is not supported neither by PyTables or HDF5).
Basically a model producer-consumer where there are multiple csv ->
array producers and one consumer (array -> hdf5).
I've solved the problem in a quick and dirty way: use more servers to
process the data :P . As it's a one time thing I think that the problem
is end.

Thanks for all the help anyways.

Regards,

> 
> >> > The problem with that it's that writing the array to the table is quite
> >> > slow. There is any option to speed up this ?
> >>
> >> Try to write to a different disk than you're reading from, on an
> >> average machine, you can saturate your read bandwidth and write to an
> >> SSD (even over USB). High end machines ???
> >>
> > The problem is not the disk, I'm using a iSCSI array connected by 2
> > ethergiga so I can easily achieve 100 MiB/s. Also I read the csv files
> > from local hdd and the output goes to iSCSI.
> >
> >> > Do more pre-processing on
> >> > the threads (like an special kind of array) so when they keep locked the
> >> > hdf5 file less time and avoid/reduce the bottleneck ?
> >>
> >> I think multiple processes/threads will only increase the bottleneck.
> >> Doing more pre-processing on a single read thread may help, but then
> >> you have to know your incoming data will always have a similar
> >> profile.
> > The bottleneck is the write method of pytables. I've 16 cpu's and when
> > the others processes are ready (read all csv files), there is only one
> > cpu working at full load writing the damn hdf5 file :(
> > I don't have swapping problem either.
> >>
> >> > pytables pro will
> >> > improve this situation ?
> >>
> >> Pytables Pro mostly optimises indexing and reading (hdf files), write
> >> performance is basically disk speed, chunking, and compression. Try to
> >> increase the compression level till your read/write io is balanced -
> >> then you'll also have the benefit of faster querying on the hdf file.
> >> I haven't played much with block sizes, there's much to read about
> >> here:
> >>
> >> http://www.pytables.org/docs/manual/ch05.html
> > I know that it improve queries but I wasn't sure about the writing
> > speed.
> >
> >
> > Write a Numpy array or other kind of array can improve the speed v/s
> > writing an standard one ?
> 
> Perhaps, but you should first show us how you are saving right now.
> If what you want is to saturate your CPUs for writing, it would also
> help using compression (for example, Blosc can use all your cores
> during the compression process).
> 



------------------------------------------------------------------------------
Free Software Download: Index, Search & Analyze Logs and other IT data in 
Real-Time with Splunk. Collect, index and harness all the fast moving IT data 
generated by your applications, servers and devices whether physical, virtual
or in the cloud. Deliver compliance at lower cost and gain new business 
insights. http://p.sf.net/sfu/splunk-dev2dev 
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to