Hi,
Here are some patches Carabos might consider for PyTables 1.4. I've
developed these against PyTables 1.4alpha (20060505) to assist in
experimenting with different HDF chunk sizes and read buffer strategies
for analysing very large datasets. My main data set comes in 270MB
monthly files with 61 fields, 244 bytes/row, 4,135,482 rows, which is
962MB uncompressed and 272MB compressed with LZO. By trying different
file storage strategies (uncompressed, zlib, LZO), different HDF chunk
sizes (32, 64 ,128, 256, 512 rows), different read buffer sizes (1024 up
to 256k rows) and different numbers of HDF chunks to read at once
(4,16,64), I've been able to increase row read speeds from 250k rows/sec
up to 900k rows/sec, and can process data at sustained rates over 500k
rows/sec on my year-old Thinkpad (i.e. it's not very fast, no Core Duo,
etc). The most successful approach seems to be optimising how PyTable's
read buffers and HDF file chunking interact with WinXP's disk caching,
then once the data is in memory, process it using numarray functions to
minimise the number of intermediate variables Python has to create and
destroy.
The changes described below involve both Python and Pyrex files. I'm no
C programmer (the last time I'd compiled a line of C code was 10 years
ago) and even though I've always been too scared of Python source
distributions on Windows, it took just 2 hours to download Pyrex,
Mingw32, the sources for HDF, LZO, ZLIB and PyTables, make my code
changes to PyTables' files, recompile PyTables into a binary installer,
reinstall PyTables, and have the test scripts run successfully. The
whole process was ridiculously easy and to my great surprise, everything
worked first time (after correcting a redefine ssize_t error in HDF).
Now I know what I am doing, it takes less than 30 seconds to recompile
PyTables and reinstall. So many thanks to Greg Ewing for Pyrex and the
Carabos team for PyTables, which together make such a powerful and
simple set of tools!
Cheers
Stephen
============================================
Change #1: An iterator that returns a table's data into a numarray
RecArray buffer, reusing the same buffer for each call
This eliminates the duplication of creating a buffer to read in data and
then copying it somewhere else to analyse. By simply passing the
destination numarray array to _read_records(), the data is read straight
from disk into the numarray array.
This function defines a new iterator Table.iterbuffer() in Table.py
which returns the one numarray array filled with a succession of slices
from the disk file. A patch is also needed to fix _read_records() in
TableExtension.pyx to check for a buffer offset in the destination array.
------ new function Table.iterbuffer() in Table.py ------
def iterbuffer(table, buffer_rows=32768, read_rows=4096):
"""
High-speed iteration over a table in blocks of buffer_rows at a time.
This initially creates a numarray RecArray buffer of length buffer_rows
and returns it filled with new data on each iteration.
Reading from disk seems to be faster if the buffer is filled with
smaller
chunks (perhaps it makes OS's caching more predictable), so rows are
read
in read_rows at a time until the buffer is full or the data is finished.
# SS 2006-0506
"""
# Work out sensible values for the buffer and the sub-buffer read it
if not buffer_rows or buffer_rows<=0 or not read_rows or read_rows<=0:
raise StopIteration
if read_rows>buffer_rows:
read_rows = buffer_rows
buffer = numarray.records.array(None, shape=buffer_rows,
formats=table.description._v_nestedFormats,
names=table.description._v_nestedNames)
total_rows = table.nrows
tpos = 0 # Current position in disk table
bpos = 0 # Current position in memory buffer
while 1:
if bpos + read_rows > buffer_rows:
# No room for another chunk, so return buffer data
yield buffer[:bpos]
bpos = 0
# Read another row_rows rows of data into the buffer
# NB: Passing the buffer offset by bpos only works if
# TableExtension.pyx is patched to recognise the buffer's offset.
# Otherwise _read_records ignores the offset and reads all data
# into buffer[0:read_rows]
num_read = table._read_records(tpos, read_rows, buffer[bpos:])
tpos += num_read
bpos += num_read
if num_read < read_rows or tpos >= total_rows:
# At the end of the file so return any data left in the buffer
yield buffer[:bpos]
raise StopIteration
----------------
This is supported by a patch to _read_records() to consider whether the
read buffer has an offset. Any offset is currently ignored so that
trying to read data into buffer[100:200] will actually write it into
buffer[0:100]. This is easily fixed by adding one line of code:
------ _read_records() in TableExtension.pyx ------
def _read_records(self, hsize_t start, hsize_t nrecords, object recarr):
cdef long buflen
cdef void *rbuf
cdef int ret
# Correct the number of records to read, if needed
if (start + nrecords) > self.totalrecords:
nrecords = self.totalrecords - start
# Get the pointer to the buffer data area
buflen = NA_getBufferPtrAndSize(recarr._data, 1, &rbuf)
# SS 2006-0506 - Correct the offset
rbuf = <void *>(<char *>rbuf + recarr._byteoffset)
# Read the records from disk
Py_BEGIN_ALLOW_THREADS
ret = H5TBOread_records(self.dataset_id, self.type_id, start,
nrecords, rbuf)
Py_END_ALLOW_THREADS
if ret < 0:
raise HDF5ExtError("Problems reading records.")
# Convert some HDF5 types to Numarray after reading.
self._convertTypes(recarr, nrecords, 1)
return nrecords
-----------------
============================================
Change #2: Read the HDF chunksize parameter when opening a file
PyTables currently uses a rough-and-ready heuristic in function
calcBufferSize() in utils.py to work out
reasonably efficient sizes for the read buffer and HDF chunksize. In
Tables.py, both
_g_create() and _g_open() have code like:
# Compute some values for buffering and I/O parameters
(self._v_maxTuples, self._v_chunksize) =
calcBufferSize(self.rowsize, self._v_expectedrows)
so _v_ chunksize is not actually read from the file.
I wanted more control over these parameters, so modified _g_open() in
Table.py and _getInfo() in TableExtension.pyx to read self._v_chunksize
from the HDF file rather than assuming the value returned by
calcBufferSize() is appropriate.
------ _getInfo() in TableExtension.pyx ------
def _getInfo(self):
"Get info from a table on disk."
cdef hid_t space_id
cdef size_t type_size
cdef hsize_t dims[1]
cdef hid_t plist
cdef H5D_layout_t layout
# Open the dataset
self.dataset_id = H5Dopen(self.parent_id, self.name)
<snip>
# SS 2006-0506 Get chunksize in the file
plist = H5Dget_create_plist(self.dataset_id)
H5Pget_chunk(plist, 1, dims)
self.chunksize = dims[0]
self._v_chunksize = self.chunksize
H5Pclose(plist)
<snip>
---------------------------------------------
------ _g_open() in Table.py ------
def _g_open(self):
"""Opens a table from disk and read the metadata on it.
Creates an user description on the fly to ease access to
the actual data.
"""
# Get table info
# SS 2006-0506 _getInfo now fills in self._v_chunksize so take out
# assignment from calcBufferSize() further down
self._v_objectID, description = self._getInfo()
<snip>
# Compute buffer size
# SS 2006-0506 Took out assignment to self._v_chunksize
# as this is now done in _getInfo()
(self._v_maxTuples, dummy) = \
calcBufferSize(self.rowsize, self.nrows)
<snip>
------------------------------------
============================================
Change #3: Allow direct control over the HDF chunksize when creating a file
Finally I needed a way to create HDF files with different chunksizes
that that selected by calcBufferSize(). This involves
(i) modifying createTable(), openTable() and copyFile() in File.py to
accept keyword arguments for passing in a chunk_rows parameter;
(ii) modifing _g_create() and _g_copyWithStats() in Table.py to use a
chunk_rows keyword argument in place of the value supplied by
calcBufferSize()
(iii) adding **kwargs to the argument list and exiting function call of
__init__(), _g_copy() and _f_copy() in Group.py, Node.py and Leaf.py.
This enables the chunk_rows keyword in (i) to flow through to the table
creating functions in (ii). I think I've added more of these **kwargs
than are strictly necessary, but everything seems to work just fine.
------ createTable() in File.py ------
def createTable(self, where, name, description, title="",
filters=None, expectedrows=10000,
buffer_rows=None, chunk_rows=None, # SS 2006-0506
compress=None, complib=None): # Deprecated
"""<snip>"""
parentNode = self.getNode(where) # Does the parent node exist?
fprops = _checkFilters(filters, compress, complib)
return Table(parentNode, name,
description=description, title=title,
filters=fprops, expectedrows=expectedrows
buffer_rows=buffer_rows, chunk_rows=chunk_rows, #
SS 2006-0506
)
---------------------------------------
------ Table.__init__() in Table.py ------
def __init__(self, parentNode, name,
description=None, title="", filters=None,
expectedrows=EXPECTED_ROWS_TABLE,
**kwargs, # SS 2006-0506
log=True):
"""<snip>"""
<snip>
# SS 2006-0506 - Modified to read _v_MaxTuples and _v_chunksize
# from buffer_rows and chunk_rows arguments to Table(xx)
self._v_maxTuples = kwargs.get('buffer_rows', None)
"""The number of rows that fit in the table buffer."""
self._v_chunksize = kwargs.get('chunk_rows', None)
"""The HDF5 chunk size."""
<snip>
-----------------------------------
------ _g_create() in Table.py ------
def _g_create(self):
"""Create a new table on disk."""
<snip>
# Compute some values for buffering and I/O parameters
# SS 2006-0506 Only uses calcBufferSize if default
# parameters not supplied to File.createTable()
(calc_mt, calc_cs) = calcBufferSize(self.rowsize,
self._v_expectedrows)
if self._v_maxTuples is None:
self._v_maxTuples = calc_mt
if self._v_chunksize is None:
self._v_chunksize = calc_cs
<snip>
-------------------------------------
------ _g_copyWithStats() in Table.py ------
# SS 2006-0506 Added **kwargs
def _g_copyWithStats(self, group, name, start, stop, step,
title, filters, log, **kwargs):
"Private part of Leaf.copy() for each kind of leaf"
# Build the new Table object
<snip>
object = Table(
group, name, description, title=title, filters=filters,
expectedrows=self.nrows, log=log,
**kwargs # SS 2006-0506
)
----------------------------------
------ _g_copy() in Leaf.py ------
# SS 2006-0506 - Added **kwargs to _g_copy()
def _g_copy(self, newParent, newName, recursive, log, **kwargs):
# Compute default arguments.
<snip>
# Create a copy of the object.
(newNode, bytes) = self._g_copyWithStats(newParent, newName,
start, stop, step, title, filters, log,
**kwargs, # SS 2006-0506
)
<snip>
----------------------------------
------ _g_copy() in Group.py ------
def _g_copy(self, newParent, newName, recursive, log, **kwargs):
# Compute default arguments.
<snip>
# Create a copy of the object.
# SS 2006-0506 - Add kwargs for passing parameters
newNode = Group(newParent, newName,
title, new=True, filters=filters, log=log, **kwargs)
<snip>
------------------------------------
------__init__() in Group.py ------
# SS 20060506 - Add **kwargs at start and end of __init__()
def __init__(self, parentNode, name,
title="", new=False, filters=None,
log=True, **kwargs):
"""Create the basic structures to keep group information.
<snip>
"""
<snip>
# Finally, set up this object as a node.
super(Group, self).__init__(parentNode, name, log, **kwargs)
-----------------------------------
------__init__() in Leaf.py ------
# SS 20060506 - Add **kwargs at start and end of __init__()
class Leaf(Node):
<snip>
def __init__(self, parentNode, name,
new=False, filters=None,
log=True,
**kwargs # SS 2006-0506
):
<snip>
super(Leaf, self).__init__(parentNode, name, log, **kwargs)
-----------------------------------
__init__() in Node complained about extra keyword arguments, so I added
kwargs here as well. It doesn't actually do anything useful, so perhaps
some PyTables expert could work out where else I've got too many kwargs.
------ __init__() in Node.py ------
# SS 20060506 - Add **kwargs at start and end of __init__()
class Node(object):
<snip>
def __init__(self, parentNode, name, log=True, **kwargs):
<snip>
-----------------------------------
-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users