[Pytables-users] Possible patches for PyTables1.4 for reading big tables directly into numarray recarrays

Stephen Simmons Mon, 08 May 2006 18:23:56 -0700

Hi,

Here are some patches Carabos might consider for PyTables 1.4. I'vedeveloped these against PyTables 1.4alpha (20060505) to assist inexperimenting with different HDF chunk sizes and read buffer strategiesfor analysing very large datasets. My main data set comes in 270MBmonthly files with 61 fields, 244 bytes/row, 4,135,482 rows, which is962MB uncompressed and 272MB compressed with LZO. By trying differentfile storage strategies (uncompressed, zlib, LZO), different HDF chunksizes (32, 64 ,128, 256, 512 rows), different read buffer sizes (1024 upto 256k rows) and different numbers of HDF chunks to read at once(4,16,64), I've been able to increase row read speeds from 250k rows/secup to 900k rows/sec, and can process data at sustained rates over 500krows/sec on my year-old Thinkpad (i.e. it's not very fast, no Core Duo,etc). The most successful approach seems to be optimising how PyTable'sread buffers and HDF file chunking interact with WinXP's disk caching,then once the data is in memory, process it using numarray functions tominimise the number of intermediate variables Python has to create anddestroy.

The changes described below involve both Python and Pyrex files. I'm noC programmer (the last time I'd compiled a line of C code was 10 yearsago) and even though I've always been too scared of Python sourcedistributions on Windows, it took just 2 hours to download Pyrex,Mingw32, the sources for HDF, LZO, ZLIB and PyTables, make my codechanges to PyTables' files, recompile PyTables into a binary installer,reinstall PyTables, and have the test scripts run successfully. Thewhole process was ridiculously easy and to my great surprise, everythingworked first time (after correcting a redefine ssize_t error in HDF).Now I know what I am doing, it takes less than 30 seconds to recompilePyTables and reinstall. So many thanks to Greg Ewing for Pyrex and theCarabos team for PyTables, which together make such a powerful andsimple set of tools!



Cheers

Stephen

============================================

Change #1: An iterator that returns a table's data into a numarrayRecArray buffer, reusing the same buffer for each call

This eliminates the duplication of creating a buffer to read in data andthen copying it somewhere else to analyse. By simply passing thedestination numarray array to _read_records(), the data is read straightfrom disk into the numarray array.This function defines a new iterator Table.iterbuffer() in Table.pywhich returns the one numarray array filled with a succession of slicesfrom the disk file. A patch is also needed to fix _read_records() inTableExtension.pyx to check for a buffer offset in the destination array.


------ new function Table.iterbuffer() in Table.py ------
def iterbuffer(table, buffer_rows=32768, read_rows=4096):
   """
   High-speed iteration over a table in blocks of buffer_rows at a time.

This initially creates a numarray RecArray buffer of length buffer_rows

   and returns it filled with new data on each iteration.

Reading from disk seems to be faster if the buffer is filled withsmallerchunks (perhaps it makes OS's caching more predictable), so rows areread

   in read_rows at a time until the buffer is full or the data is finished.
   # SS 2006-0506
   """
   # Work out sensible values for the buffer and the sub-buffer read it
   if not buffer_rows or buffer_rows<=0 or not read_rows or read_rows<=0:
       raise StopIteration
   if read_rows>buffer_rows:
       read_rows = buffer_rows
   buffer = numarray.records.array(None, shape=buffer_rows,
                           formats=table.description._v_nestedFormats,
                           names=table.description._v_nestedNames)
   total_rows = table.nrows
   tpos = 0        # Current position in disk table
   bpos = 0        # Current position in memory buffer
   while 1:
       if bpos + read_rows > buffer_rows:
           # No room for another chunk, so return buffer data
           yield buffer[:bpos]
           bpos = 0
       # Read another row_rows rows of data into the buffer
       # NB: Passing the buffer offset by bpos only works if
       #   TableExtension.pyx is patched to recognise the buffer's offset.
       #   Otherwise _read_records ignores the offset and reads all data
       #   into buffer[0:read_rows]
       num_read = table._read_records(tpos, read_rows, buffer[bpos:])
       tpos += num_read
       bpos += num_read
       if num_read < read_rows or tpos >= total_rows:
           # At the end of the file so return any data left in the buffer
           yield buffer[:bpos]
           raise StopIteration
----------------

This is supported by a patch to _read_records() to consider whether theread buffer has an offset. Any offset is currently ignored so thattrying to read data into buffer[100:200] will actually write it intobuffer[0:100]. This is easily fixed by adding one line of code:


------ _read_records() in TableExtension.pyx ------
 def _read_records(self, hsize_t start, hsize_t nrecords, object recarr):
   cdef long buflen
   cdef void *rbuf
   cdef int ret

   # Correct the number of records to read, if needed
   if (start + nrecords) > self.totalrecords:
     nrecords = self.totalrecords - start

   # Get the pointer to the buffer data area
   buflen = NA_getBufferPtrAndSize(recarr._data, 1, &rbuf)
   # SS 2006-0506 - Correct the offset
   rbuf = <void *>(<char *>rbuf + recarr._byteoffset)

   # Read the records from disk
   Py_BEGIN_ALLOW_THREADS
   ret = H5TBOread_records(self.dataset_id, self.type_id, start,
                           nrecords, rbuf)
   Py_END_ALLOW_THREADS
   if ret < 0:
     raise HDF5ExtError("Problems reading records.")

   # Convert some HDF5 types to Numarray after reading.
   self._convertTypes(recarr, nrecords, 1)

   return nrecords
-----------------



============================================
Change #2: Read the HDF chunksize parameter when opening a file

PyTables currently uses a rough-and-ready heuristic in functioncalcBufferSize() in utils.py to work outreasonably efficient sizes for the read buffer and HDF chunksize. InTables.py, both

_g_create() and _g_open() have code like:
       # Compute some values for buffering and I/O parameters

(self._v_maxTuples, self._v_chunksize) =calcBufferSize(self.rowsize, self._v_expectedrows)

so _v_ chunksize is not actually read from the file.

I wanted more control over these parameters, so modified _g_open() inTable.py and _getInfo() in TableExtension.pyx to read self._v_chunksizefrom the HDF file rather than assuming the value returned bycalcBufferSize() is appropriate.


------ _getInfo() in TableExtension.pyx ------
 def _getInfo(self):
   "Get info from a table on disk."
   cdef hid_t   space_id
   cdef size_t  type_size
   cdef hsize_t dims[1]
   cdef hid_t   plist
   cdef H5D_layout_t layout

   # Open the dataset
   self.dataset_id = H5Dopen(self.parent_id, self.name)

   <snip>

   # SS 2006-0506 Get chunksize in the file
   plist = H5Dget_create_plist(self.dataset_id)
   H5Pget_chunk(plist, 1, dims)
   self.chunksize = dims[0]
   self._v_chunksize = self.chunksize
   H5Pclose(plist)

<snip>

---------------------------------------------

------ _g_open() in Table.py ------
   def _g_open(self):
       """Opens a table from disk and read the metadata on it.

       Creates an user description on the fly to ease access to
       the actual data.

       """
       # Get table info
       # SS 2006-0506 _getInfo now fills in self._v_chunksize so take out
       # assignment from calcBufferSize() further down
       self._v_objectID, description = self._getInfo()

<snip>


       # Compute buffer size
       # SS 2006-0506 Took out assignment to self._v_chunksize
      # as this is now done in _getInfo()
      (self._v_maxTuples, dummy) = \
             calcBufferSize(self.rowsize, self.nrows)
   <snip>
------------------------------------


============================================
Change #3: Allow direct control over the HDF chunksize when creating a file

Finally I needed a way to create HDF files with different chunksizesthat that selected by calcBufferSize(). This involves(i) modifying createTable(), openTable() and copyFile() in File.py toaccept keyword arguments for passing in a chunk_rows parameter;(ii) modifing _g_create() and _g_copyWithStats() in Table.py to use achunk_rows keyword argument in place of the value supplied bycalcBufferSize()(iii) adding **kwargs to the argument list and exiting function call of__init__(), _g_copy() and _f_copy() in Group.py, Node.py and Leaf.py.This enables the chunk_rows keyword in (i) to flow through to the tablecreating functions in (ii). I think I've added more of these **kwargsthan are strictly necessary, but everything seems to work just fine.



------ createTable() in File.py ------
   def createTable(self, where, name, description, title="",
                   filters=None, expectedrows=10000,
                   buffer_rows=None, chunk_rows=None, # SS 2006-0506
                   compress=None, complib=None):  # Deprecated
       """<snip>"""
       parentNode = self.getNode(where)  # Does the parent node exist?
       fprops = _checkFilters(filters, compress, complib)
       return Table(parentNode, name,
                    description=description, title=title,
                    filters=fprops, expectedrows=expectedrows

buffer_rows=buffer_rows, chunk_rows=chunk_rows, #SS 2006-0506

                    )
---------------------------------------

------ Table.__init__() in Table.py ------
   def __init__(self, parentNode, name,
                description=None, title="", filters=None,
                expectedrows=EXPECTED_ROWS_TABLE,
                **kwargs, # SS 2006-0506
                log=True):
       """<snip>"""
      <snip>
       # SS 2006-0506 - Modified to read _v_MaxTuples and _v_chunksize
       # from buffer_rows and chunk_rows arguments to Table(xx)
       self._v_maxTuples = kwargs.get('buffer_rows', None)
       """The number of rows that fit in the table buffer."""
       self._v_chunksize = kwargs.get('chunk_rows', None)
       """The HDF5 chunk size."""
       <snip>
-----------------------------------


------ _g_create() in Table.py ------
   def _g_create(self):
       """Create a new table on disk."""
       <snip>
       # Compute some values for buffering and I/O parameters
       # SS 2006-0506 Only uses calcBufferSize if default
   # parameters not supplied to File.createTable()

(calc_mt, calc_cs) = calcBufferSize(self.rowsize,self._v_expectedrows)

   if self._v_maxTuples is None:
       self._v_maxTuples = calc_mt
   if self._v_chunksize is None:
       self._v_chunksize = calc_cs
   <snip>
-------------------------------------


------ _g_copyWithStats() in Table.py ------
   # SS 2006-0506 Added **kwargs
   def _g_copyWithStats(self, group, name, start, stop, step,
                        title, filters, log, **kwargs):
       "Private part of Leaf.copy() for each kind of leaf"
       # Build the new Table object
   <snip>

   object = Table(
           group, name, description, title=title, filters=filters,
           expectedrows=self.nrows, log=log,
           **kwargs # SS 2006-0506
           )
----------------------------------

------ _g_copy() in Leaf.py ------
   # SS 2006-0506 - Added **kwargs to _g_copy()
   def _g_copy(self, newParent, newName, recursive, log, **kwargs):
       # Compute default arguments.
<snip>
       # Create a copy of the object.
       (newNode, bytes) = self._g_copyWithStats(newParent, newName,
           start, stop, step, title, filters, log,
           **kwargs, # SS 2006-0506
           )

<snip>
----------------------------------

------ _g_copy() in Group.py ------
def _g_copy(self, newParent, newName, recursive, log, **kwargs):
       # Compute default arguments.
<snip>
       # Create a copy of the object.
       # SS 2006-0506 - Add kwargs for passing parameters
       newNode = Group(newParent, newName,
                       title, new=True, filters=filters, log=log, **kwargs)
<snip>
------------------------------------

------__init__() in Group.py ------
   # SS 20060506 - Add **kwargs at start and end of __init__()
   def __init__(self, parentNode, name,
                title="", new=False, filters=None,
                log=True, **kwargs):
       """Create the basic structures to keep group information.
<snip>
       """
<snip>
       # Finally, set up this object as a node.
       super(Group, self).__init__(parentNode, name, log, **kwargs)
-----------------------------------

------__init__() in Leaf.py ------
   # SS 20060506 - Add **kwargs at start and end of __init__()
class Leaf(Node):
<snip>
   def __init__(self, parentNode, name,
                new=False, filters=None,
                log=True,
        **kwargs        # SS 2006-0506
       ):
<snip>
       super(Leaf, self).__init__(parentNode, name, log, **kwargs)

-----------------------------------

__init__() in Node complained about extra keyword arguments, so I addedkwargs here as well. It doesn't actually do anything useful, so perhapssome PyTables expert could work out where else I've got too many kwargs.


------ __init__() in Node.py ------
   # SS 20060506 - Add **kwargs at start and end of __init__()
class Node(object):
<snip>
   def __init__(self, parentNode, name, log=True, **kwargs):
      <snip>
-----------------------------------



-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users

[Pytables-users] Possible patches for PyTables1.4 for reading big tables directly into numarray recarrays

Reply via email to