Hola Felipe, 2011/2/21, Felipe Barriga Richards <fbarr...@algometrics.cl>: > On Tue, 2011-02-22 at 05:50 +1100, Tony Theodore wrote: >> On 22 February 2011 01:33, Felipe Barriga Richards >> <fbarr...@algometrics.cl> wrote: >> > Hi everyone, >> > >> > I've developed a parser that read csv files and save them on a hdf5 file >> > (on python using pytables). The initial problem it was that pytables >> > doesn't support simultaneous write (on multiple threads/processes) so >> > I've decided to create multiple processes so each one of them read a >> > whole csv file on a array and then lock the hdf5 object, write the table >> > and release it. >> >> Reading more may not help if you can't write it at the same rate, I >> usually append each row as it's read, then flush at the end: > The actual method is better than have only one thread. While the thread > is writting the hdf5 file, the others are processing the next csv files. > If I do it line by line it takes an eternity.
Hmm, can you show us how you are writing into the file? I'd say that Tony is correct about using one single thread: it is the best venue towards maximum performance (let apart that simultaneous writes on different threads is not supported neither by PyTables or HDF5). >> > The problem with that it's that writing the array to the table is quite >> > slow. There is any option to speed up this ? >> >> Try to write to a different disk than you're reading from, on an >> average machine, you can saturate your read bandwidth and write to an >> SSD (even over USB). High end machines ??? >> > The problem is not the disk, I'm using a iSCSI array connected by 2 > ethergiga so I can easily achieve 100 MiB/s. Also I read the csv files > from local hdd and the output goes to iSCSI. > >> > Do more pre-processing on >> > the threads (like an special kind of array) so when they keep locked the >> > hdf5 file less time and avoid/reduce the bottleneck ? >> >> I think multiple processes/threads will only increase the bottleneck. >> Doing more pre-processing on a single read thread may help, but then >> you have to know your incoming data will always have a similar >> profile. > The bottleneck is the write method of pytables. I've 16 cpu's and when > the others processes are ready (read all csv files), there is only one > cpu working at full load writing the damn hdf5 file :( > I don't have swapping problem either. >> >> > pytables pro will >> > improve this situation ? >> >> Pytables Pro mostly optimises indexing and reading (hdf files), write >> performance is basically disk speed, chunking, and compression. Try to >> increase the compression level till your read/write io is balanced - >> then you'll also have the benefit of faster querying on the hdf file. >> I haven't played much with block sizes, there's much to read about >> here: >> >> http://www.pytables.org/docs/manual/ch05.html > I know that it improve queries but I wasn't sure about the writing > speed. > > > Write a Numpy array or other kind of array can improve the speed v/s > writing an standard one ? Perhaps, but you should first show us how you are saving right now. If what you want is to saturate your CPUs for writing, it would also help using compression (for example, Blosc can use all your cores during the compression process). -- Francesc Alted ------------------------------------------------------------------------------ Index, Search & Analyze Logs and other IT data in Real-Time with Splunk Collect, index and harness all the fast moving IT data generated by your applications, servers and devices whether physical, virtual or in the cloud. Deliver compliance at lower cost and gain new business insights. Free Software Download: http://p.sf.net/sfu/splunk-dev2dev _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users