Re: [Pytables-users] PyTables method v/s C HDF5 API's

Anthony Scopatz Tue, 15 Mar 2011 15:50:17 -0700

Hello Dhananjaya,

I see what is going on.  Yes, you are correct that PyTables is not optimized
for making new datasets, but neither is the C version of HDF5.  Really the
point of these libraries is to store large amounts of well-structured data.
 Not large numbers of small data sets.  (It looks like Francesc beat me to
this.)


Playing round with some timing on my machine I found the following:

Your code:  12.7 seconds

2000 datasets per group of length 10000 (same number of points): 2.96
seconds

Change to Array, rather than EArray: 14.91 seconds (unexpected, but probably
an init issue.)

Change EArray chunksize to 10: 41.8 seconds


Clearly, this shows Francesc and my point.  Also, I would argue that even
for 30000 datasets, 12.7 sec is really not all that bad.  If you want the
full benefit of HDF5, I would figure out a way to represent your data such
that io is improved.  By thinking about the problem in a slightly different
way you'll see gains in both C and Python implementations.

Be Well
Anthony

On Tue, Mar 15, 2011 at 10:21 PM, Dhananjaya <dhanush...@yahoo.com> wrote:

>
> hi Anthony,
>
>  thanks for you reply. My comparison is straight forward. Right now we are
> using C code and native C HDF5 (ver 1.6.5) routines to create datasets,
> groups
> and organize our sampling data in HDF5 file. In this aspect we tend to
> create
> numerous tables and datasets ( single dimensions ) in the HDF5 file. After
> going
> through the documentation of PyTables i wrote a simple python script to
> create 3
> groups and around 10,000 datasets in each group.
>     I am not expecting pytables to be as fast as C. But what i am seeing is
> a
> huge difference on the performance front. here you go with script..
>
> import warnings
> warnings.simplefilter("ignore")
> import numpy
> from tables import *
> import time
>
> tstart = time.clock()
> h5file = openFile('test5.h5', mode='w')
> groups = h5file.createGroup("/", 'Parameters_test', 'Parameters Group') #
> create
> groups
> atom = Float32Atom()
> cond_name = [ "cond1", "cond2", "cond3" ]
> #filters = Filters(complevel=1, complib='zlib' )
> data_array = numpy.arange(2000) # dummy data array
> for counter in range(len(cond_name)):
>    get_gp = h5file.createGroup( groups, str(cond_name[counter]),
> str(cond_name[counter]) )
>    for i in range(10000):
>        arr = h5file.createEArray(get_gp, str(i), atom, (0,),"E
> array",chunkshape=(1000,) )
>        arr.append(data_array)
> h5file.close()
> print time.clock() - tstart, "seconds"
>
> Regards
> Dhananjaya
>
>
>
> ------------------------------------------------------------------------------
> Colocation vs. Managed Hosting
> A question and answer guide to determining the best fit
> for your organization - today and in the future.
> http://p.sf.net/sfu/internap-sfd2d
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

------------------------------------------------------------------------------
Colocation vs. Managed Hosting
A question and answer guide to determining the best fit
for your organization - today and in the future.
http://p.sf.net/sfu/internap-sfd2d

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] PyTables method v/s C HDF5 API's

Reply via email to