PEP: <unassigned>
Title: Adding data-type objects to the standard library
Version: $Revision: $
Last-Modified: $Date: $
Author: Travis Oliphant <[EMAIL PROTECTED]>
Status: Draft
Type: Standards Track
Created: 05-Sep-2006
Python-Version: 2.6
Abstract
This PEP proposes adapting the data-type objects from NumPy for
inclusion in standard Python, to provide a consistent and standard
way to discuss the format of binary data.
Rationale
There are many situations crossing multiple areas where an
interpretation is needed of binary data in terms of fundamental
data-types such as integers, floating-point, and complex
floating-point values. Having a common object that carries
information about binary data would be beneficial to many
people. The creation of data-type objects in NumPy to carry the
load of describing what each element of the array contains
represents an evolution of a solution that began with the
PyArray_Descr structure in Python's own array object. These
data-type objects can represent arbitrary byte data. Currently
such information is usually constructed using strings and
character codes which is unwieldy when a data-type consists of
nested structures.
Proposal
Add a PyDatatypeObject in Python (adapted from NumPy's dtype
object which evolved from the PyArray_Descr structure in Python's
array module) that holds information about a data-type. This object
will allow packages to exchange information about binary data in
a uniform way (see the extended buffer protocol PEP for an application
to exchanging information about array data).
Specification
The datatype is an object that specifies how a certain block of
memory should be interpreted as a basic data-type. In addition to
being able to describe basic data-types, the data-type object can
describe a data-type that is itself an array of other data-types
as well as a data-type that contains arbitrary "fields" (structure
members) which are located at specific offsets. In its most basic
form, however, a data-type is of a particular kind (bit, bool,
int, uint, float, complex, object, string, unicode, void) and size.
Datatype objects can be created using either a type-object, a
string, a tuple, a list, or a dictionary according to the following
constructors:
Type-object:
For a select set of type-objects a data-type object describing that
basic type can be described:
Examples:
>>> datatype(float)
datatype('float64')
>>> datatype(int)
datatype('int32') # on 32-bit platform (64 if c-long is 64-bits)
Tuple-object
A tuple of length 2 can be used to specify a data-type that is
an array of another kind of basic data-type (this array always
describes a C-contiguous array).
Examples:
>>> datatype((int, 5))
datatype(('int32', (5,)))
# describes a 5*4=20-byte block of memory laid out as
# a[0], a[1], a[2], a[3], a[4]
>>> datatype((float, (3,2))
datatype(('float64', (3,2))
# describes a 3*2*8=48 byte block of memory that should be
# interpreted as 6 doubles laid out as arr[0,0], arr[0,1],
# ... a[2,0], a[1,2]
String-object:
The basic format is '%s%s%s%d' % (endian, shape, kind, itemsize)
kind : one of the basic array kinds given below.
itemsize : the nubmer of bytes (or bits for 't' kind) for
this data-type.
endian : either '', '=' (native), '|' (doesn't matter),
'>' (big-endian) or '<' (little-endian).
shape : either '', or a shape-tuple describing a data-type that
is an array of the given shape.
A string can also be a comma-separated sequence of basic
formats. The result will be a data-type with default field
names: 'f0', 'f1', ..., 'fn'.
Examples:
>>> datatype('u4')
datatype('uint32')
>>> datatype('f4')
datatype('float32')
>>> datatype('(3,2)f4')
datatype(('float32', (3,2))
>>> datatype('(5,)i4, (3,2)f4, S5')
datatype([('f0', '<i4', (5,)), ('f1', '<f4', (3, 2)), ('f2', '|S5')])
List-object:
A list should be a list of tuples where each tuple describes a
field. Each tuple should contain (name, datatype{, shape}) or
((meta-info, name), datatype{, shape}) in order to specify the
data-type.
This list must fully specify the data-type (no memory holes). If
would would like to return a data-type with memory holes where the
compiler would place them, then pass the keyword align=1 to this
construction. This will result in un-named fields of Void kind of
the correct size interspersed where needed.
Examples:
datatype([( ([1,2],'coords'), 'f4', (3,6)), ('address', 'S30')])
A data-type that could represent the structure
float coords[3*6] /* Has [1,2] associated with this field */
char address[30]
datatype([( 'simple', 'i4'), ('nested', [('name', 'S30'),
('addr', 'S45'),
('amount', 'i4')])])
Can represent the memory layout of
struct {
int simple;
struct nested {
char name[30];
char addr[45];
int amount;
}
There is no formal limit to the nesting that is possible.
datatype('i2, i4, i1, f8', align=1)
datatype([('f0', '<i2'), ('', '|V2'), ('f1', '<i4'),
('f2', '|i1'), ('', '|V3'), ('f3', '<f8')])
# Notice the padding bytes placed in the structure to make sure
# f1 and f8 are aligned correctly for the 32-bit system.
Dictionary-object:
Sometimes, you are only concerned about a few fields in a larger
memory structure. The dictionary object allows specification of
a data-type with fields using a dictionary with names as keys and
tuples as values. The value tuples are
(data-type, offset{, meta-info}). The offset is the offset in
bytes (or bits when data-type is 't') from the beginning of the
structure to the field data-type.
Example:
datatype({'f3' : ('f8', 12), 'f2': ('i1', 8)})
type([('', '|V8'), ('f2', '|i1'), ('', '|V3'), ('f3', '<f8')])
Attributes
byteorder -- returns the byte-order of this data-type
isnative -- returns True if this data-type is in correct byte-order
for the platform.
descr -- returns an description of this data-type as a list of
tuples (name or (name, meta), datatype{, shape})
itemsize -- returns the total size of the data-type.
kind -- returns the basic "kind" of the data-type. The basic kinds
are:
't' - bit,
'b' - bool,
'i' - signed integer,
'u' - unsigned integer,
'f' - floating point,
'c' - complex floating point,
'S' - string (fixed-length sequence of char),
'U' - fixed length sequence of UCS4,
'O' - pointer to PyObject,
'V' - Void (anything else).
names -- returns a list of names (keys to the fields dictionary) in
offset-order.
fields -- returns a read-only dictionary indicating the fields or
None if this data-type has no fields. The dictionary
is keyed by the field name and each entry contains
a tuple of (data-type, offset{, meta-object}). The
offset indicates the byte-offset (or bit-offset for 't')
from the beginning of the data-type to the data-type
indicated.
hasobject -- returns True if this data-type is an "object" data-type
or has "object" fields.
name -- returns a 'name'-bitwidth description of data-type.
base -- returns self unless this data-type is an array of some
other data-type and then it returns that basic
data-type.
shape -- returns the shape of this data-type (for data-types
that are arrays of other data-types) or () if there
is no array.
str -- returns the type-string of this data-type which is the
basic kind followed by the number of bytes (or bits
for 't')
alignment -- returns alignment needed for this data-type on platform
as determined by the compiler.
Methods
newbyteorder ({endian})
create a new data-type with byte-order changed in any and all
fields (including deeply nested ones), to {endian}. If endian is
not given, then swap all byte-orders.
__len__(self)
equivalent to len(self.field)
__getitem__(self, name)
get the field named [name]. Equivalent to self.field[name].
C-functions : These are function pointers attached in a C-structure
connected with the data-type object that perform specific
functions.
setitem (PyObject *datatype, void *data, PyObject *obj)
set a Python object into memory of this data-type
at the given memory location.
getitem (PyObject *datatype, void *data)
get a Python object from memory of this data-type.
Implementation
A reference implementation (with more features than are proposed
here) is available in NumPy and will be adapted if this PEP is
accepted.
Questions:
There should probably be a limited C-API so that data-type objects
can be returned and sent through the extended buffer protocol (see
extended buffer protocol PEP).
Should bit-fields be handled by re-interpreting the offsets as
bit-values, use some other mechanism for handling the offset, or
should they be unsupported?
NumPy supports "string" and "unicode" data-types. The unicode
data-type in NumPy always means UCS4 (but it is translated
back-and forth to Python unicode scalars as needed for narrow
builds). With Python 3.0 looming, we should probably support
different encodings as data-types and drop the string type for a
bytes type. Some help in understanding what to do here is
appreciated.
Copyright
This PEP is placed in the public domain
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com