Hey Tim,

Awesome dataset! And neat image!

As per your request, a couple of minor things I noticed were that you
probably don't need to do the sanity check each time (great for debugging,
but not needed always), you are using masked arrays which while
sometimes convenient are generally slower than creating an array, a mask
and applying the mask to the array, and you seem to be downcasting from
float64 to float32 for some reason that I am not entirely clear on (size,
speed?).

To the more major question of write performance, one thing that you could
try is 
compression<http://pytables.github.com/usersguide/optimization.html#compression-issues>.
 You might want to do some timing studies to find the best compressor and
level. Performance here can vary a lot based on how similar your data is
(and how close similar data is to each other).  If you have got a bunch of
zeros and only a few real data points, even zlib 1 is going to be blazing
fast compared to writing all those zeros out explicitly.

Another thing you could try doing is switching to EArray and using the
append() method.  This might save PyTables, numpy, hdf5, etc from having to
check that the shape of "sst_node[qual_indices]" is actually the same as
the data you are giving it.  Additionally dumping a block of memory to the
file directly (via append()) is generally faster than having to resolve
fancy indexes (which are notoriously the slow part of even numpy).

Lastly, as a general comment, you seem to be doing a lot of stuff in the
inner most loop -- including writing to disk.  I would look at how you
could restructure this to move as much as possible out of this loop.  Your
data seems to be about 12 GB for a year, so this is probably too big to
build up the full sst array completely in memory prior to writing.  That
is, unless you have a computer much bigger than my laptop ;).  But issuing
one fat write command is probably going to be faster than making 365 of
them.

Happy hacking!
Be Well
Anthony


On Wed, Mar 6, 2013 at 11:25 PM, Tim Burgess <timburg...@mac.com> wrote:

> I'm producing a large chunked HDF5 using CArray and want to clarify that
> the performance I'm getting is what would normally be expected.
>
> The source data is a large annual satellite dataset - 365 days x 4320
> latitiude by 8640 longitude of 32bit floats. I'm only interested in pixels
> of a certain quality so I am iterating over the source data (which is in
> daily files) and then determining the indices of all quality pixels in that
> day. There are usually about 2 million quality pixels in a day.
>
> I then set the equivalent CArray locations to the value of the quality
> pixels. As you can see in the code below, the source numpy array is a 1 x
> 4320 x 8640. So for addressing the CArray, I simply take the first index
> and set it to the current day to map indices to the 365 x 4320 x 8640
> CArray.
>
> I've tried a couple of different chunkshapes. As I will be reading the HDF
> sequentially day by day and as the data comes from a polar-orbit, I'm using
> a 1 x 1080 x 240 chunk to try and optimize for chunks that will have no
> data (and therefore reduce the total filesize). You can see an image of an
> example day at
>
>
> http://data.nodc.noaa.gov/pathfinder/Version5.2/browse_images/2011/sea_surface_temperature/20110101001735-NODC-L3C_GHRSST-SSTskin-AVHRR_Pathfinder-PFV5.2_NOAA19_G_2011001_night-v02.0-fv01.0-sea_surface_temperature.png
>
>
> To produce a day takes about 2.5 minutes on a Linux (Ubuntu 12.04) machine
> with two SSDs in RAID 0. The system has 64GB of RAM but I don't think
> memory is a constraint here.
> Looking at a profile, most of that 2.5 minutes is spent in _g_writeCoords
> in tables.hdf5Extension.Array
>
> Here's the pertinent code:
>
>     for year in range(2011, 2012):
>
>         # create dataset and add global attrs
>         annualfile_path =
> '%sPF4km/V5.2/hdf/annual/PF52-%d-c1080x240-test.h5' % (crwdir, year)
>         print 'Creating ' + annualfile_path
>
>
>         with tables.openFile(annualfile_path, 'w', title=('Pathfinder V5.2
> %d' % year)) as h5f:
>
>             # write lat lons
>             lat_node = h5f.createArray('/', 'lat', lats, title='latitude')
>             lon_node = h5f.createArray('/', 'lon', lons, title='longitude')
>
>
>             # glob all the region summaries in a year
>             files = [glob.glob('%sPF4km/V5.2/%d/*night*' % (crwdir,
> year))[0]]
>             print 'Found %d days' % len(files)
>             files.sort()
>
>
>             # create a 365 x 4320 x 8640 array
>             shape = (NUMDAYS, 4320, 8640)
>             atom = tables.Float32Atom(dflt=np.nan)
>             # we chunk into daily slices and then further chunk days
>             sst_node = h5f.createCArray(h5f.root, 'sst', atom, shape,
> chunkshape=(1, 1080, 240))
>
>
>             for filename in files:
>
>                 # get day
>                 day = int(filename[-25:-22])
>                 print 'Processing %d day %d' % (year, day)
>
>                 ds = Dataset(filename)
>                 kelvin64 = ds.variables['sea_surface_temperature'][:]
>                 qual = ds.variables['quality_level'][:]
>                 ds.close()
>                 # convert sst to single precision with nan as mask
>                 kelvin32 = np.ma.filled(kelvin64,
> fill_value=np.nan).astype(np.float32)
>                 sst = kelvin32 - 273.15
>
>                 # find all quality >4 locations
>                 qual_indices = np.where(np.ma.filled(qual) >= 4)
>                 print 'Found %d quality pixels' % len(qual_indices[0])
>
>                 # qual_indices is actually a 3D index. so set first sst
> quality index
>                 # to match the current day and write to sst_node
>                 qual_indices[0].flags.writeable = True
>                 qual_indices[0][:] = day
>
>                 sst_node[qual_indices] =
> sst[0,qual_indices[1],qual_indices[2]]
>
>                 # sanity check that max values are the same in sst_node as
> source sst data
>                 print 'sst max %4.1f node max %4.1f' %
> (np.nanmax(sst[0,qual_indices[1],qual_indices[2]]),
> np.nanmax(sst_node[day]))
>
> Would value any comments on this :-)
> Thanks,
>
> Tim Burgess
>
>
>
> ------------------------------------------------------------------------------
> Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
> Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the
> endpoint security space. For insight on selecting the right partner to
> tackle endpoint security challenges, access the full report.
> http://p.sf.net/sfu/symantec-dev2dev
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the  
endpoint security space. For insight on selecting the right partner to 
tackle endpoint security challenges, access the full report. 
http://p.sf.net/sfu/symantec-dev2dev
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to