On 08/03/2013, at 2:51 AM, Anthony Scopatz wrote: > Hey Tim, > > Awesome dataset! And neat image! > > As per your request, a couple of minor things I noticed were that you > probably don't need to do the sanity check each time (great for debugging, > but not needed always), you are using masked arrays which while sometimes > convenient are generally slower than creating an array, a mask and applying > the mask to the array, and you seem to be downcasting from float64 to float32 > for some reason that I am not entirely clear on (size, speed?). > > To the more major question of write performance, one thing that you could try > is compression. You might want to do some timing studies to find the best > compressor and level. Performance here can vary a lot based on how similar > your data is (and how close similar data is to each other). If you have got > a bunch of zeros and only a few real data points, even zlib 1 is going to be > blazing fast compared to writing all those zeros out explicitly. > > Another thing you could try doing is switching to EArray and using the > append() method. This might save PyTables, numpy, hdf5, etc from having to > check that the shape of "sst_node[qual_indices]" is actually the same as the > data you are giving it. Additionally dumping a block of memory to the file > directly (via append()) is generally faster than having to resolve fancy > indexes (which are notoriously the slow part of even numpy). > > Lastly, as a general comment, you seem to be doing a lot of stuff in the > inner most loop -- including writing to disk. I would look at how you could > restructure this to move as much as possible out of this loop. Your data > seems to be about 12 GB for a year, so this is probably too big to build up > the full sst array completely in memory prior to writing. That is, unless > you have a computer much bigger than my laptop ;). But issuing one fat write > command is probably going to be faster than making 365 of them. > > Happy hacking! > Be Well > Anthony >
Thanks Anthony for being so responsive and touching on a number of points. The netCDF library gives me a masked array so I have to explicitly transform that into a regular numpy array. I've looked under the covers and have seen that the ma masked implementation is all pure Python and so there is a performance drawback. I'm not up to speed yet on where the numpy.na masking implementation is (started a new job here). I tried to do an implementation in memory (except for the final write) and found that I have about 2GB of indices when I extract the quality indices. Simply using those indexes, memory usage grows to over 64GB and I eventually run out of memory and start churning away in swap. For the moment, I have pulled down the latest git master and am using the new in-memory HDF feature. This seems to give be better performance and is code-wise pretty simple so for the moment, it's good enough. Cheers and thanks again, Tim BTW I viewed your SciPy tutorial. Good stuff! ------------------------------------------------------------------------------ Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the endpoint security space. For insight on selecting the right partner to tackle endpoint security challenges, access the full report. http://p.sf.net/sfu/symantec-dev2dev _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users