Re: [Pytables-users] Writing to CArray

Anthony Scopatz Mon, 11 Mar 2013 12:16:35 -0700

On Sun, Mar 10, 2013 at 8:47 PM, Tim Burgess <timburg...@mac.com> wrote:


>
> On 08/03/2013, at 2:51 AM, Anthony Scopatz wrote:
>
> > Hey Tim,
> >
> > Awesome dataset! And neat image!
> >
> > As per your request, a couple of minor things I noticed were that you
> probably don't need to do the sanity check each time (great for debugging,
> but not needed always), you are using masked arrays which while sometimes
> convenient are generally slower than creating an array, a mask and applying
> the mask to the array, and you seem to be downcasting from float64 to
> float32 for some reason that I am not entirely clear on (size, speed?).
> >
> > To the more major question of write performance, one thing that you
> could try is compression.  You might want to do some timing studies to find
> the best compressor and level. Performance here can vary a lot based on how
> similar your data is (and how close similar data is to each other).  If you
> have got a bunch of zeros and only a few real data points, even zlib 1 is
> going to be blazing fast compared to writing all those zeros out explicitly.
> >
> > Another thing you could try doing is switching to EArray and using the
> append() method.  This might save PyTables, numpy, hdf5, etc from having to
> check that the shape of "sst_node[qual_indices]" is actually the same as
> the data you are giving it.  Additionally dumping a block of memory to the
> file directly (via append()) is generally faster than having to resolve
> fancy indexes (which are notoriously the slow part of even numpy).
> >
> > Lastly, as a general comment, you seem to be doing a lot of stuff in the
> inner most loop -- including writing to disk.  I would look at how you
> could restructure this to move as much as possible out of this loop.  Your
> data seems to be about 12 GB for a year, so this is probably too big to
> build up the full sst array completely in memory prior to writing.  That
> is, unless you have a computer much bigger than my laptop ;).  But issuing
> one fat write command is probably going to be faster than making 365 of
> them.
> >
> > Happy hacking!
> > Be Well
> > Anthony
> >
>
>
> Thanks Anthony for being so responsive and touching on a number of points.
>
> The netCDF library gives me a masked array so I have to explicitly
> transform that into a regular numpy array.


Ahh interesting.  Depending on the netCDF version the file was made with,
you should be able to read the file directly from PyTables.  You could thus
directly get a normal numpy array.  This *should* be possible, but I have
never tried it ;)


> I've looked under the covers and have seen that the ma masked
> implementation is all pure Python and so there is a performance drawback.
> I'm not up to speed yet on where the numpy.na masking implementation is
> (started a new job here).
>
> I tried to do an implementation in memory (except for the final write) and
> found that I have about 2GB of indices when I extract the quality indices.
> Simply using those indexes, memory usage grows to over 64GB and I
> eventually run out of memory and start churning away in swap.
>
> For the moment, I have pulled down the latest git master and am using the
> new in-memory HDF feature. This seems to give be better performance and is
> code-wise pretty simple so for the moment, it's good enough.
>

Awesome! I am glad that this is working for you.


> Cheers and thanks again, Tim
>
> BTW I viewed your SciPy tutorial. Good stuff!
>

Thanks!


>
>
>
>
> ------------------------------------------------------------------------------
> Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
> Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the
> endpoint security space. For insight on selecting the right partner to
> tackle endpoint security challenges, access the full report.
> http://p.sf.net/sfu/symantec-dev2dev
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the  
endpoint security space. For insight on selecting the right partner to 
tackle endpoint security challenges, access the full report. 
http://p.sf.net/sfu/symantec-dev2dev

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Writing to CArray

Reply via email to