[Python-Dev] PEP: Adding data-type objects to Python

Travis E. Oliphant Fri, 27 Oct 2006 13:17:07 -0700

PEP: <unassigned>
Title: Adding data-type objects to the standard library
Version: $Revision: $
Last-Modified: $Date:  $
Author: Travis Oliphant <[EMAIL PROTECTED]>
Status: Draft
Type: Standards Track
Created: 05-Sep-2006
Python-Version: 2.6


Abstract

    This PEP proposes adapting the data-type objects from NumPy for
    inclusion in standard Python, to provide a consistent and standard
    way to discuss the format of binary data. 

Rationale

    There are many situations crossing multiple areas where an
    interpretation is needed of binary data in terms of fundamental
    data-types such as integers, floating-point, and complex
    floating-point values.  Having a common object that carries
    information about binary data would be beneficial to many
    people. The creation of data-type objects in NumPy to carry the
    load of describing what each element of the array contains
    represents an evolution of a solution that began with the
    PyArray_Descr structure in Python's own array object.  These
    data-type objects can represent arbitrary byte data.  Currently
    such information is usually constructed using strings and
    character codes which is unwieldy when a data-type consists of
    nested structures.

Proposal

    Add a PyDatatypeObject in Python (adapted from NumPy's dtype
    object which evolved from the PyArray_Descr structure in Python's
    array module) that holds information about a data-type.  This object
    will allow packages to exchange information about binary data in
    a uniform way (see the extended buffer protocol PEP for an application
    to exchanging information about array data). 

Specification

    The datatype is an object that specifies how a certain block of
    memory should be interpreted as a basic data-type. In addition to
    being able to describe basic data-types, the data-type object can
    describe a data-type that is itself an array of other data-types
    as well as a data-type that contains arbitrary "fields" (structure
    members) which are located at specific offsets. In its most basic
    form, however, a data-type is of a particular kind (bit, bool,
    int, uint, float, complex, object, string, unicode, void) and size.

    Datatype objects can be created using either a type-object, a
    string, a tuple, a list, or a dictionary according to the following
    constructors:

    Type-object: 

      For a select set of type-objects a data-type object describing that
      basic type can be described:

      Examples: 

      >>> datatype(float)
      datatype('float64')
      
      >>> datatype(int)
      datatype('int32')  # on 32-bit platform (64 if c-long is 64-bits)

    Tuple-object
   
      A tuple of length 2 can be used to specify a data-type that is
      an array of another kind of basic data-type (this array always
      describes a C-contiguous array).

      Examples: 

      >>> datatype((int, 5))
      datatype(('int32', (5,)))
      # describes a 5*4=20-byte block of memory laid out as 
      #  a[0], a[1], a[2], a[3], a[4]

      >>> datatype((float, (3,2))
      datatype(('float64', (3,2))   
      # describes a 3*2*8=48 byte block of memory that should be
      # interpreted as 6 doubles laid out as arr[0,0], arr[0,1],
      # ... a[2,0], a[1,2]


    String-object:
 
      The basic format is '%s%s%s%d' % (endian, shape, kind, itemsize) 

         kind     : one of the basic array kinds given below. 
         
         itemsize : the nubmer of bytes (or bits for 't' kind) for 
                     this data-type.  

         endian   : either '', '=' (native), '|' (doesn't matter),
                     '>' (big-endian) or '<' (little-endian).

         shape    : either '', or a shape-tuple describing a data-type that
                     is an array of the given shape.

      A string can also be a comma-separated sequence of basic
      formats. The result will be a data-type with default field
      names: 'f0', 'f1', ..., 'fn'.

      Examples: 

      >>> datatype('u4')
      datatype('uint32')

      >>> datatype('f4')
      datatype('float32')

      >>> datatype('(3,2)f4')
      datatype(('float32', (3,2))

      >>> datatype('(5,)i4, (3,2)f4, S5')
      datatype([('f0', '<i4', (5,)), ('f1', '<f4', (3, 2)), ('f2', '|S5')])


    List-object:

      A list should be a list of tuples where each tuple describes a
      field. Each tuple should contain (name, datatype{, shape}) or
      ((meta-info, name), datatype{, shape}) in order to specify the
      data-type. 

      This list must fully specify the data-type (no memory holes). If
      would would like to return a data-type with memory holes where the
      compiler would place them, then pass the keyword align=1 to this
      construction.  This will result in un-named fields of Void kind of
      the correct size interspersed where needed.

      Examples: 

      datatype([( ([1,2],'coords'), 'f4', (3,6)), ('address', 'S30')])

      A data-type that could represent the structure
 
      float coords[3*6]   /* Has [1,2] associated with this field */
      char  address[30]
      

      datatype([( 'simple', 'i4'), ('nested', [('name', 'S30'), 
                                               ('addr', 'S45'),    
                                               ('amount', 'i4')])])

      Can represent the memory layout of 

      struct {
      int  simple;
      struct nested {
           char name[30];
           char addr[45];
           int  amount;
      }

      There is no formal limit to the nesting that is possible. 

      
      datatype('i2, i4, i1, f8', align=1)
      datatype([('f0', '<i2'), ('', '|V2'), ('f1', '<i4'), 
                ('f2', '|i1'), ('', '|V3'), ('f3', '<f8')])

      # Notice the padding bytes placed in the structure to make sure
      #  f1 and f8 are aligned correctly for the 32-bit system. 


    Dictionary-object: 

      Sometimes, you are only concerned about a few fields in a larger
      memory structure.  The dictionary object allows specification of  
      a data-type with fields using a dictionary with names as keys and
      tuples as values.  The value tuples are 
      (data-type, offset{, meta-info}).  The offset is the offset in   
      bytes (or bits when data-type is 't') from the beginning of the 
      structure to the field data-type.

      Example:

      datatype({'f3' : ('f8', 12), 'f2': ('i1', 8)})
      type([('', '|V8'), ('f2', '|i1'), ('', '|V3'), ('f3', '<f8')])


  Attributes

     byteorder --  returns the byte-order of this data-type

     isnative  --  returns True if this data-type is in correct byte-order
                      for the platform.

     descr     --  returns an description of this data-type as a list of 
                     tuples (name or (name, meta), datatype{, shape})

     itemsize  --  returns the total size of the data-type. 

     kind      --  returns the basic "kind" of the data-type. The basic kinds
                     are:
                       't' - bit, 
                       'b' - bool, 
                       'i' - signed integer, 
                       'u' - unsigned integer,
                       'f' - floating point,                  
                       'c' - complex floating point, 
                       'S' - string (fixed-length sequence of char),
                       'U' - fixed length sequence of UCS4,
                       'O' - pointer to PyObject,
                       'V' - Void (anything else).

     names     --  returns a list of names (keys to the fields dictionary) in
                     offset-order.

     fields    --  returns a read-only dictionary indicating the fields or 
                     None if this data-type has no fields.  The dictionary
                     is keyed by the field name and each entry contains
                     a tuple of (data-type, offset{, meta-object}). The
                     offset indicates the byte-offset (or bit-offset for 't')
                     from the beginning of the data-type to the data-type 
                     indicated.

     hasobject --  returns True if this data-type is an "object" data-type
                      or  has "object" fields. 

     name      --  returns a 'name'-bitwidth description of data-type.

     base      --  returns self unless this data-type is an array of some
                      other data-type and then it returns that basic
                      data-type.                  

     shape     --  returns the shape of this data-type (for data-types
                      that are arrays of other data-types) or () if there
                      is no array. 

     str       --  returns the type-string of this data-type which is the
                      basic kind followed by the number of bytes (or bits 
                      for 't')

     alignment --  returns alignment needed for this data-type on platform
                     as determined by the compiler. 
     
  Methods

     newbyteorder ({endian})

       create a new data-type with byte-order changed in any and all
       fields (including deeply nested ones), to {endian}.  If endian is
       not given, then swap all byte-orders.

     __len__(self)  
  
         equivalent to len(self.field)

     __getitem__(self, name)
  
         get the field named [name].  Equivalent to self.field[name].
     

  C-functions :  These are function pointers attached in a C-structure
                  connected with the data-type object that perform specific 
                  functions.
 
     setitem  (PyObject *datatype, void *data, PyObject *obj)

          set a Python object into memory of this data-type
              at the given memory location.

     getitem  (PyObject *datatype, void *data) 
     
          get a Python object from memory of this data-type.


Implementation

    A reference implementation (with more features than are proposed
    here) is available in NumPy and will be adapted if this PEP is
    accepted.

Questions:

    There should probably be a limited C-API so that data-type objects
    can be returned and sent through the extended buffer protocol (see 
    extended buffer protocol PEP). 

    Should bit-fields be handled by re-interpreting the offsets as
    bit-values, use some other mechanism for handling the offset, or
    should they be unsupported?

    NumPy supports "string" and "unicode" data-types.  The unicode
    data-type in NumPy always means UCS4 (but it is translated
    back-and forth to Python unicode scalars as needed for narrow
    builds).  With Python 3.0 looming, we should probably support
    different encodings as data-types and drop the string type for a
    bytes type.  Some help in understanding what to do here is
    appreciated.


Copyright

    This PEP is placed in the public domain

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] PEP: Adding data-type objects to Python

Reply via email to