Re: [Pytables-users] Large (to very large) datasets...

Anthony Scopatz Tue, 30 Oct 2012 15:32:16 -0700

On Tue, Oct 30, 2012 at 6:20 PM, Andrea Gavana <andrea.gav...@gmail.com>wrote:


> Hi Anthony,
>
> On 30 October 2012 22:52, Anthony Scopatz wrote:
> > Hi Andrea,
> >
> > Your problem is two fold.
> >
> > 1. Your timing wasn't reporting the time per data set, but rather the
> total
> > time since writing all data sets.  You need to put the start time in the
> > loop to get the time per data set.
> >
> > 2. Your larger problem was that you were writing too many times.
>  Generally
> > it is faster to write fewer, bigger sets of data than performing a lot of
> > small write operations.  Since you had data set opening and writing in a
> > doubly nested loop, it is not surprising that you were getting terrible
> > performance.   You were basically maximizing HDF5 overhead ;).  Using
> > slicing I removed the outermost loop and saw timings like the following:
> >
> > H5 file creation time: 7.406
> >
> > Saving results for table: 0.0105440616608
> > Saving results for table: 0.0158948898315
> > Saving results for table: 0.0164661407471
> > Saving results for table: 0.00654292106628
> > Saving results for table: 0.00676298141479
> > Saving results for table: 0.00664114952087
> > Saving results for table: 0.0066990852356
> > Saving results for table: 0.00687289237976
> > Saving results for table: 0.00664210319519
> > Saving results for table: 0.0157809257507
> > Saving results for table: 0.0141618251801
> > Saving results for table: 0.00796294212341
> >
> > Please see the attached version, at around line 82.  Additionally, if you
> > need to focus on performance I would recommend reading the following
> > (http://pytables.github.com/usersguide/optimization.html).  PyTables
> can be
> > blazingly fast when implemented correctly.  I would highly recommend
> looking
> > into compression.
> >
> > I hope this helps!
>
> Thank you for your answer; indeed, I was timing it wrongly (I really
> need to go to sleep...). However, although I understand the need of
> "writing fewer", I am not sure I can actually do it in my situations.
> Let me explain:
>
> 1. I have a GUI which starts a number of parallel processes (up to 16,
> depending on a user selection);
> 2. These processes actually do the computation/simulations - so, if I
> have 1,000 simulations to run and 8 parallel processes, each process
> gets 125 simulations (each of which holds 1,200 "objects" with a 600x7
> timeseries matrix per object).
>

Well, you can at least change the order of the loops and see if that helps.
That is rather than doing:

for i in xrange():
    for p in table:

Do the following instead:

for p in table:
    for i in xrange():

I don't believe that this will help too much since you are still writing
every element individually..


>
> If I had to write out the results only at the end, it would mean for
> me to find a way to share the 1,200 "objects" matrices in all the
> parallel processes (and I am not sure if pytables is going to complain
> when multiple concurrent processes try to access the same underlying
> HDF5 file).
>

Reading in parallel works pretty well.  Writing causes more headaches
but can be done.


> Or I could create one HDF file per process, but given the nature of
> the simulation I am running, every "object" in the 1,200 "objects"
> pool would need to keep a reference to a 125x600x7 matrix (assuming
> 1,000 simulations and 8 processes) around in memory *OR* I will need
> to write the results to the HDF5 file for every simulation. Although
> we have extremely powerful PCs at work, I am not sure it is the right
> way to go...
>
> As always, I am open to all suggestions on how to improve my approach.
>

My basic suggestion is to have all of you processes produce results which
are then
aggregated by a single master process.  This master is the only one which
has write
access to the hdf5 file and will allow you to create larger arrays and
minimize the
number of writes that you do.

You'll probably want to take a look at this example:
https://github.com/PyTables/PyTables/blob/develop/examples/multiprocess_access_queues.py

I think that there might be a page in the docs about it now too...

But I think that this is the strategy that you want to pursue.  Multiple
compute processes, one write process.


>
> Thank you again for your quick and enlightening answer.
>

No problem!
Be Well
Anthony



>
> Andrea.
>
> "Imagination Is The Only Weapon In The War Against Reality."
> http://www.infinity77.net
>
> # ------------------------------------------------------------- #
> def ask_mailing_list_support(email):
>
>     if mention_platform_and_version() and include_sample_app():
>         send_message(email)
>     else:
>         install_malware()
>         erase_hard_drives()
> # ------------------------------------------------------------- #
>
>
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_sfd2d_oct
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Large (to very large) datasets...

Reply via email to