On 22 February 2011 01:33, Felipe Barriga Richards <fbarr...@algometrics.cl> wrote: > Hi everyone, > > I've developed a parser that read csv files and save them on a hdf5 file > (on python using pytables). The initial problem it was that pytables > doesn't support simultaneous write (on multiple threads/processes) so > I've decided to create multiple processes so each one of them read a > whole csv file on a array and then lock the hdf5 object, write the table > and release it.
Reading more may not help if you can't write it at the same rate, I usually append each row as it's read, then flush at the end: import csv from tables import * class sales(IsDescription): customer = StringCol(30) item = StringCol(30) value = Float64Col() ... f = openFile("sales.h5", mode = "w") sales_yyyymm = f.createTable("/", "sales_yyyymm", sales) reader = csv.DictReader(open("sales_yyyymm.csv")) for r in reader: row = sales_yyyymm.row row["customer"] = r["customer"] .... row.append sales_yyyymm.flush f.close() > The problem with that it's that writing the array to the table is quite > slow. There is any option to speed up this ? Try to write to a different disk than you're reading from, on an average machine, you can saturate your read bandwidth and write to an SSD (even over USB). High end machines ??? > Do more pre-processing on > the threads (like an special kind of array) so when they keep locked the > hdf5 file less time and avoid/reduce the bottleneck ? I think multiple processes/threads will only increase the bottleneck. Doing more pre-processing on a single read thread may help, but then you have to know your incoming data will always have a similar profile. > pytables pro will > improve this situation ? Pytables Pro mostly optimises indexing and reading (hdf files), write performance is basically disk speed, chunking, and compression. Try to increase the compression level till your read/write io is balanced - then you'll also have the benefit of faster querying on the hdf file. I haven't played much with block sizes, there's much to read about here: http://www.pytables.org/docs/manual/ch05.html Cheers, Tony ------------------------------------------------------------------------------ The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: Pinpoint memory and threading errors before they happen. Find and fix more than 250 security defects in the development cycle. Locate bottlenecks in serial and parallel code that limit performance. http://p.sf.net/sfu/intel-dev2devfeb _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users