Re: [Pytables-users] Large (to very large) datasets...

Andrea Gavana Tue, 30 Oct 2012 15:20:52 -0700

Hi Anthony,

On 30 October 2012 22:52, Anthony Scopatz wrote:
> Hi Andrea,
>
> Your problem is two fold.
>
> 1. Your timing wasn't reporting the time per data set, but rather the total
> time since writing all data sets.  You need to put the start time in the
> loop to get the time per data set.
>
> 2. Your larger problem was that you were writing too many times.  Generally
> it is faster to write fewer, bigger sets of data than performing a lot of
> small write operations.  Since you had data set opening and writing in a
> doubly nested loop, it is not surprising that you were getting terrible
> performance.   You were basically maximizing HDF5 overhead ;).  Using
> slicing I removed the outermost loop and saw timings like the following:
>
> H5 file creation time: 7.406
>
> Saving results for table: 0.0105440616608
> Saving results for table: 0.0158948898315
> Saving results for table: 0.0164661407471
> Saving results for table: 0.00654292106628
> Saving results for table: 0.00676298141479
> Saving results for table: 0.00664114952087
> Saving results for table: 0.0066990852356
> Saving results for table: 0.00687289237976
> Saving results for table: 0.00664210319519
> Saving results for table: 0.0157809257507
> Saving results for table: 0.0141618251801
> Saving results for table: 0.00796294212341
>
> Please see the attached version, at around line 82.  Additionally, if you
> need to focus on performance I would recommend reading the following
> (http://pytables.github.com/usersguide/optimization.html).  PyTables can be
> blazingly fast when implemented correctly.  I would highly recommend looking
> into compression.
>
> I hope this helps!


Thank you for your answer; indeed, I was timing it wrongly (I really
need to go to sleep...). However, although I understand the need of
"writing fewer", I am not sure I can actually do it in my situations.
Let me explain:

1. I have a GUI which starts a number of parallel processes (up to 16,
depending on a user selection);
2. These processes actually do the computation/simulations - so, if I
have 1,000 simulations to run and 8 parallel processes, each process
gets 125 simulations (each of which holds 1,200 "objects" with a 600x7
timeseries matrix per object).

If I had to write out the results only at the end, it would mean for
me to find a way to share the 1,200 "objects" matrices in all the
parallel processes (and I am not sure if pytables is going to complain
when multiple concurrent processes try to access the same underlying
HDF5 file).

Or I could create one HDF file per process, but given the nature of
the simulation I am running, every "object" in the 1,200 "objects"
pool would need to keep a reference to a 125x600x7 matrix (assuming
1,000 simulations and 8 processes) around in memory *OR* I will need
to write the results to the HDF5 file for every simulation. Although
we have extremely powerful PCs at work, I am not sure it is the right
way to go...

As always, I am open to all suggestions on how to improve my approach.

Thank you again for your quick and enlightening answer.

Andrea.

"Imagination Is The Only Weapon In The War Against Reality."
http://www.infinity77.net

# ------------------------------------------------------------- #
def ask_mailing_list_support(email):

    if mention_platform_and_version() and include_sample_app():
        send_message(email)
    else:
        install_malware()
        erase_hard_drives()
# ------------------------------------------------------------- #

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Large (to very large) datasets...

Reply via email to