Hi Anthony & All, On 30 October 2012 23:31, Anthony Scopatz wrote: > On Tue, Oct 30, 2012 at 6:20 PM, Andrea Gavana <andrea.gav...@gmail.com> > wrote: >> >> Hi Anthony, >> >> On 30 October 2012 22:52, Anthony Scopatz wrote: >> > Hi Andrea, >> > >> > Your problem is two fold. >> > >> > 1. Your timing wasn't reporting the time per data set, but rather the >> > total >> > time since writing all data sets. You need to put the start time in the >> > loop to get the time per data set. >> > >> > 2. Your larger problem was that you were writing too many times. >> > Generally >> > it is faster to write fewer, bigger sets of data than performing a lot >> > of >> > small write operations. Since you had data set opening and writing in a >> > doubly nested loop, it is not surprising that you were getting terrible >> > performance. You were basically maximizing HDF5 overhead ;). Using >> > slicing I removed the outermost loop and saw timings like the following: >> > >> > H5 file creation time: 7.406 >> > >> > Saving results for table: 0.0105440616608 >> > Saving results for table: 0.0158948898315 >> > Saving results for table: 0.0164661407471 >> > Saving results for table: 0.00654292106628 >> > Saving results for table: 0.00676298141479 >> > Saving results for table: 0.00664114952087 >> > Saving results for table: 0.0066990852356 >> > Saving results for table: 0.00687289237976 >> > Saving results for table: 0.00664210319519 >> > Saving results for table: 0.0157809257507 >> > Saving results for table: 0.0141618251801 >> > Saving results for table: 0.00796294212341 >> > >> > Please see the attached version, at around line 82. Additionally, if >> > you >> > need to focus on performance I would recommend reading the following >> > (http://pytables.github.com/usersguide/optimization.html). PyTables can >> > be >> > blazingly fast when implemented correctly. I would highly recommend >> > looking >> > into compression. >> > >> > I hope this helps! >> >> Thank you for your answer; indeed, I was timing it wrongly (I really >> need to go to sleep...). However, although I understand the need of >> "writing fewer", I am not sure I can actually do it in my situations. >> Let me explain: >> >> 1. I have a GUI which starts a number of parallel processes (up to 16, >> depending on a user selection); >> 2. These processes actually do the computation/simulations - so, if I >> have 1,000 simulations to run and 8 parallel processes, each process >> gets 125 simulations (each of which holds 1,200 "objects" with a 600x7 >> timeseries matrix per object). > > > Well, you can at least change the order of the loops and see if that helps. > That is rather than doing: > > for i in xrange(): > for p in table: > > Do the following instead: > > for p in table: > for i in xrange(): > > I don't believe that this will help too much since you are still writing > every element individually.. > >> >> >> If I had to write out the results only at the end, it would mean for >> me to find a way to share the 1,200 "objects" matrices in all the >> parallel processes (and I am not sure if pytables is going to complain >> when multiple concurrent processes try to access the same underlying >> HDF5 file). > > > Reading in parallel works pretty well. Writing causes more headaches > but can be done. > >> >> Or I could create one HDF file per process, but given the nature of >> the simulation I am running, every "object" in the 1,200 "objects" >> pool would need to keep a reference to a 125x600x7 matrix (assuming >> 1,000 simulations and 8 processes) around in memory *OR* I will need >> to write the results to the HDF5 file for every simulation. Although >> we have extremely powerful PCs at work, I am not sure it is the right >> way to go... >> >> As always, I am open to all suggestions on how to improve my approach. > > > My basic suggestion is to have all of you processes produce results which > are then > aggregated by a single master process. This master is the only one which > has write > access to the hdf5 file and will allow you to create larger arrays and > minimize the > number of writes that you do. > > You'll probably want to take a look at this example: > https://github.com/PyTables/PyTables/blob/develop/examples/multiprocess_access_queues.py > > I think that there might be a page in the docs about it now too... > > But I think that this is the strategy that you want to pursue. Multiple > compute processes, one write process.
Thank you for all your suggestions. I managed to slightly modify the script you attached and I am also experimenting with compression. However, in the newly attached script the underlying table is not modified, i.e., this assignment: for p in table: p['results'][:NUM_SIM, :, :] = numpy.random.random(size=(NUM_SIM, len(ALL_DATES), 7)) table.flush() Seems to be doing nothing (i.e., printing out the 'results' attribute for an object class prints a matrix full of zeros instead of random numbers...). Also, on my PC at work, the file creation time is tremendously slow (76 seconds for a 100 simulations - 1.9 GB file). In order to understand what's going on, I set back the number of simulations to 10 (NUM_SIM=10), but still I am getting only zeros out of the table. This is what my script is printing out: H5 file creation time: 7.652 Saving results for table: 1.03400015831 Results (should be random...) Object name : KB0001 Object results: [[[ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] ..., [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.]] [[ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] ..., [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.]] [[ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] ..., [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.]] ..., [[ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] ..., [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.]] [[ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] ..., [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.]] [[ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] ..., [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.]]] I am on Windows Vista, Python 2.7.2 64-bit from EPD 7.1-2, pytables version '2.3b1.devpro'. Any suggestion is really appreciated. Thank you in advance. Andrea. "Imagination Is The Only Weapon In The War Against Reality." http://www.infinity77.net # ------------------------------------------------------------- # def ask_mailing_list_support(email): if mention_platform_and_version() and include_sample_app(): send_message(email) else: install_malware() erase_hard_drives() # ------------------------------------------------------------- #
pytables_test2.py
Description: Binary data
------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users