Re: [Pytables-users] Speed up saving tables

Tony Theodore Mon, 21 Feb 2011 11:25:54 -0800

On 22 February 2011 01:33, Felipe Barriga Richards
<fbarr...@algometrics.cl> wrote:
> Hi everyone,
>
> I've developed a parser that read csv files and save them on a hdf5 file
> (on python using pytables). The initial problem it was that pytables
> doesn't support simultaneous write (on multiple threads/processes) so
> I've decided to create multiple processes so each one of them read a
> whole csv file on a array and then lock the hdf5 object, write the table
> and release it.


Reading more may not help if you can't write it at the same rate, I
usually append each row as it's read, then flush at the end:

import csv
from tables import *

class sales(IsDescription):
    customer  = StringCol(30)
    item = StringCol(30)
    value = Float64Col()
    ...

f = openFile("sales.h5", mode = "w")
sales_yyyymm = f.createTable("/", "sales_yyyymm", sales)
reader = csv.DictReader(open("sales_yyyymm.csv"))
for r in reader:
    row = sales_yyyymm.row
    row["customer"] = r["customer"]
    ....
    row.append

sales_yyyymm.flush
f.close()

> The problem with that it's that writing the array to the table is quite
> slow. There is any option to speed up this ?

Try to write to a different disk than you're reading from, on an
average machine, you can saturate your read bandwidth and write to an
SSD (even over USB). High end machines ???

> Do more pre-processing on
> the threads (like an special kind of array) so when they keep locked the
> hdf5 file less time and avoid/reduce the bottleneck ?

I think multiple processes/threads will only increase the bottleneck.
Doing more pre-processing on a single read thread may help, but then
you have to know your incoming data will always have a similar
profile.

> pytables pro will
> improve this situation ?

Pytables Pro mostly optimises indexing and reading (hdf files), write
performance is basically disk speed, chunking, and compression. Try to
increase the compression level till your read/write io is balanced -
then you'll also have the benefit of faster querying on the hdf file.
I haven't played much with block sizes, there's much to read about
here:

http://www.pytables.org/docs/manual/ch05.html

Cheers,

Tony

------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Speed up saving tables

Reply via email to