On 08/03/2013, at 2:51 AM, Anthony Scopatz wrote:

> Hey Tim, 
> 
> Awesome dataset! And neat image!
> 
> As per your request, a couple of minor things I noticed were that you 
> probably don't need to do the sanity check each time (great for debugging, 
> but not needed always), you are using masked arrays which while sometimes 
> convenient are generally slower than creating an array, a mask and applying 
> the mask to the array, and you seem to be downcasting from float64 to float32 
> for some reason that I am not entirely clear on (size, speed?).
> 
> To the more major question of write performance, one thing that you could try 
> is compression.  You might want to do some timing studies to find the best 
> compressor and level. Performance here can vary a lot based on how similar 
> your data is (and how close similar data is to each other).  If you have got 
> a bunch of zeros and only a few real data points, even zlib 1 is going to be 
> blazing fast compared to writing all those zeros out explicitly.  
> 
> Another thing you could try doing is switching to EArray and using the 
> append() method.  This might save PyTables, numpy, hdf5, etc from having to 
> check that the shape of "sst_node[qual_indices]" is actually the same as the 
> data you are giving it.  Additionally dumping a block of memory to the file 
> directly (via append()) is generally faster than having to resolve fancy 
> indexes (which are notoriously the slow part of even numpy).
> 
> Lastly, as a general comment, you seem to be doing a lot of stuff in the 
> inner most loop -- including writing to disk.  I would look at how you could 
> restructure this to move as much as possible out of this loop.  Your data 
> seems to be about 12 GB for a year, so this is probably too big to build up 
> the full sst array completely in memory prior to writing.  That is, unless 
> you have a computer much bigger than my laptop ;).  But issuing one fat write 
> command is probably going to be faster than making 365 of them.  
> 
> Happy hacking!
> Be Well
> Anthony
> 


Thanks Anthony for being so responsive and touching on a number of points.

The netCDF library gives me a masked array so I have to explicitly transform 
that into a regular numpy array. I've looked under the covers and have seen 
that the ma masked implementation is all pure Python and so there is a 
performance drawback. I'm not up to speed yet on where the numpy.na masking 
implementation is (started a new job here).

I tried to do an implementation in memory (except for the final write) and 
found that I have about 2GB of indices when I extract the quality indices. 
Simply using those indexes, memory usage grows to over 64GB and I eventually 
run out of memory and start churning away in swap.

For the moment, I have pulled down the latest git master and am using the new 
in-memory HDF feature. This seems to give be better performance and is 
code-wise pretty simple so for the moment, it's good enough.

Cheers and thanks again, Tim

BTW I viewed your SciPy tutorial. Good stuff!



------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the  
endpoint security space. For insight on selecting the right partner to 
tackle endpoint security challenges, access the full report. 
http://p.sf.net/sfu/symantec-dev2dev
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to