[Python-3000] Revised PEP for buffer protocol

Travis E. Oliphant Tue, 20 Mar 2007 02:11:43 -0800

Attached is my revised PEP for the buffer protocol after incorporatingsuggestions from Greg Ewing. It is as simple as I can make it andstill share what I think needs to be sharable. Suggestions arewelcome. I will provide and maintain code to implement the PEP whenthe basic idea of the PEP is accepted.


Thanks,

-Travis

PEP: <unassigned>
Title: Revising the buffer protocol
Version: $Revision: $
Last-Modified: $Date:  $
Author: Travis Oliphant <[EMAIL PROTECTED]>
Status: Draft
Type: Standards Track
Created: 28-Aug-2006
Python-Version: 3000

Abstract

   This PEP proposes re-designing the buffer API (PyBufferProcs
   function pointers) to improve the way Python allows memory sharing
   in Python 3.0

   In particular, it is proposed that the multiple-segment and
   character buffer portions of the buffer API are eliminated and
   additional function pointers are provided to allow sharing any
   multi-dimensional nature of the memory and what data-format the
   memory contains.

Rationale

   The buffer protocol allows different Python types to exchange a
   pointer to a sequence of internal buffers.  This functionality is
   *extremely* useful for sharing large segments of memory between
   different high-level objects, but it's too limited and has issues.

    1. There is the little (never?) used "sequence-of-segments" option
       (bf_getsegcount)

    2. There is the apparently redundant character-buffer option
       (bf_getcharbuffer)

    3. There is no way for a consumer to tell the buffer-API-exporting
       object it is "finished" with its view of the memory and
       therefore no way for the exporting object to be sure that it is
       safe to reallocate the pointer to the memory that it owns (for
       example, the array object reallocating its memory after sharing
       it with the buffer object which held the original pointer led
       to the infamous buffer-object problem).

    4. Memory is just a pointer with a length. There is no way to
       describe what is "in" the memory (float, int, C-structure, etc.)

    5. There is no shape information provided for the memory.  But,
       several array-like Python types could make use of a standard
       way to describe the shape-interpretation of the memory
       (wxPython, GTK, pyQT, CVXOPT, PyVox, Audio and Video
       Libraries, ctypes, NumPy, data-base interfaces, etc.)

    There are two widely used libraries that use the concept of
    discontiguous memory: PIL and NumPy.  Their view of discontiguous
    arrays is a bit different, though.  NumPy uses the notion of
    constant striding in each dimension as its basic concept of an
    array. In this way a simple sub-region of a larger array can be
    described without copying the data.  Strided memory is also a common
    way to describe data in many computing libraries (such as the BLAS
    and LAPACK).

    The PIL uses a more opaque memory representation. Sometimes an
    image is contained in a contiguous segment of memory, but
    sometimes it is contained in an array of pointers to the
    contiguous segments (usually lines) of the image.  This allows the
    image to not be loaded entirely into memory but still managed
    abstractly as if it were. I believe, the PIL is where the idea of
    multiple buffer segments in the original buffer interface came
    from, I believe.

    The buffer interface should allow discontiguous memory areas to
    share standard striding information.  However, consumers that do
    not want to deal with strided memory should also be able to
    request a contiguous segment easily.    


Proposal Overview

   * Eliminate the char-buffer and multiple-segment sections of the
     buffer-protocol.

   * Unify the read/write versions of getting the buffer.

   * Add a new function to the interface that should be called when
     the consumer object is "done" with the view.

   * Add a new memory_view object that is returned from the 
     buffer interface getbuffer call.  This memory_view object
     contains
   * Add a new function to allow the interface to describe what is in
     memory (unifying what is currently done now in struct and
     array)

   * Add a new function to allow the protocol to share shape and 
     stride information

   * Fix all objects in the core and the standard library to conform
     to the new interface

   * Extend the struct module to handle more format specifiers

Specification

    Change the PyBufferProcs structure to

    typedef struct {
         getbufferproc bf_getbuffer
         releasebufferproc bf_releasebuffer
    }

    typedef PyObject *(*getbufferproc)(PyObject *obj, void **buf,
                                       Py_ssize_t *len, int *writeable,
                                       char **format, int *ndims,
                                       Py_ssize_t **shape,
                                       Py_ssize_t **strides)
 
      Return a pointer to memory in *buf and the length of that memory
      buffer (in bytes) in *len.  The next arguments are optional.  
      NULL is returned on failure.   On success an oject-specific 
      view is returned (which may just be a borrowed reference to obj).
      This view should be passed to bf_releasebuffer when the consumer
      is done with the view. 

      writeable -- address of an integer variable to hold 
                     whether or not the memory is writeable.
                     If this is NULL, then you must assume the memory 
                     is read-only.
      format    -- address of a format-string (following extended struct 
                     syntax) indicating what is in each element of
                     of memory.  The number of elements is len / itemsize,
                     where itemsize is the number of bytes implied by the 
format.
                     NULL if not needed in which case format == "B" for 
                     unsigned bytes.  The memory for this string must not
                     be freed by the consumer --- it is managed by the exporter.
      ndims     -- address of a variable storing the number of dimensions 
                     or NULL if not needed.  If shape and/or strides are given
                     then this must be non NULL.  If this variable is 
                     not provided then it is assumed that *ndims == 1
      shape     -- address of a Py_ssize_t* variable that will be filled
                     with a pointer to an array of Py_ssize_t of length *ndims
                     indicating the shape of the memory as an N-D array.  
                     Ignored if this is NULL.  Note that
                     ((*shape)[0] * ... * (*shape)[ndims-1])*itemsize = len
                     If this variable is not provided then it is assumed that
                     (*shape[0]) == len / itemsize. 
      stride    -- address of a Py_ssize_t* variable that will be filled
                     with a pointer to an array of Py_ssize_t of length *ndims
                     indicating the number of bytes to skip to get to the next
                     element in each dimension.  If this is NULL, then
                     the memory is assumed to be C-style contigous with
                     the last dimension varying the fastest.  An  
                     error should be raised if this is not accurate and
                     strides are not requested.  This variable may be
                     set to NULL when called if memory is C-style
                     contiguous. 
                
      This view object should be used in the other API call and 
      does not need to be decref'd.  It should be "released" if the
      interface exporter provides the bf_releasebuffer function.

    typedef int (*releasebufferproc)(PyObject *view)

      This function is called (if defined by the exporting object)
      when a view of memory previously acquired from the object is no
      longer needed.  It is up to the exporter of the API to make sure
      all views have been released before re-allocating any previously
      shared memory.  It is up to consumers of the API to call this
      function on the object whose view is obtained when it is no
      longer needed.   Any format string, shape array or strides array
      returned through the interface should also not be referenced after 
      the releasebuffer call is made. 
      A -1 is returned on error and 0 on success.

    Both of these routines are optional for a type object 


New C-API calls are proposed

   int 
   PyObject_CheckBuffer(PyObject *obj)

      return 1 if the getbuffer function is available otherwise 0

   PyObject * 
   PyObject_GetBuffer(PyObject *obj, void **buf, Py_ssize_t *len,
                      int *writeable, char **format, int *ndims,
                      Py_ssize_t **shape, Py_ssize_t **strides)

      Get the buffer and optional information variables about the buffer.
      Return an object-specific view object (which may be simply a 
      borrowed reference to the object itself). 
      
   int
   PyObject_ReleaseBuffer(PyObject *view)
      
      call this function to tell obj that you are done with your "view"
      This doesn't do anything if the object doesn't implement a release 
function.
      Only call this after a previous PyObject_GetBuffer has succeeded and when
      you will not be needing or referring to the memory (or the format, shape, 
      and strides memory used in the view -- if you will use these for a longer
      period of time make copies). 
      Returns -1 on error. 
      
   int PyObject_SizeFromFormat(char *)
      Return the implied itemsize of the data-format area from a struct-style
      description. 


Additions to the struct string-syntax

   The struct string-syntax is missing some characters to fully
   implement data-format descriptions already available elsewhere (in
   ctypes and NumPy for example).  Here are the proposed additions:

   Character         Description
   =============================================================
   't'               bit (number before states how many bits)
   '?'               platform _Bool type
   'g'               long double  
   'Z'               complex (whatever the next specifier is)
   'c'               ucs-1 (latin-1) encoding 
   'u'               ucs-2 
   'w'               ucs-4 
   'O'               pointer to Python Object 
   'T{}'             structure (detailed layout inside {}) 
   '(k1,k2,...,kn)'  multi-dimensional array of whatever follows 
   ':name:'          optional name of the preceeding element 
   '&'               specific pointer (prefix before another charater) 
   'X{}'             pointer to a function (optional function 
                                             signature inside {})
   ' '               ignored (allow better readability)

   The struct module will be changed to understand these as well and
   return appropriate Python objects on unpacking.  Un-packing a
   long-double will return a c-types long_double.  Unpacking 'u' or
   'w' will return Python unicode.  Unpacking a multi-dimensional
   array will return a list of lists.  Un-packing a pointer will
   return a ctypes pointer object.  Un-packing a bit will return a
   Python Bool.  Spaces in the struct-string syntax will be ignored.

   Endian-specification ('=','>','<') is also allowed inside the
   string so that it can change if needed.  The previously-specified
   endian string is enforce until changed.  The default endian is '='.

   According to the struct-module, a number can preceed a character
   code to specify how many of that type there are.  The
   (k1,k2,...,kn) extension also allows specifying if the data is
   supposed to be viewed as a (C-style contiguous, last-dimension
   varies the fastest) multi-dimensional array of a particular format.

   Functions should be added to ctypes to create a ctypes object from
   a struct description, and add long-double, and ucs-2 to ctypes.

Examples of Data-Format Descriptions

   Here are some examples of C-structures and how they would be
   represented using the struct-style syntax:

   float                     
      'f'
   complex double
      'Zd'
   RGB Pixel data
      'BBB' or 'B:r: B:g: B:b:'
   Mixed endian (weird but possible)
      '>i:big: <i:little:'
   Nested structure
      struct {
             int ival;
             struct {
                 unsigned short sval;
                 unsigned char bval;
                 unsigned char cval;
             } sub;
      }
      'i:ival: T{H:sval: B:bval: B:cval:}:sub:'
   Nested array
      struct {
             int ival;
             double data[16*4];
      }
      'i:ival: (16,4)d:data:'

Code to be affected

   All objects and modules in Python that export or consume the old
   buffer interface will be modified.  Here is a partial list.
     
   * buffer object
   * bytes object
   * string object
   * array module
   * struct module
   * mmap module
   * ctypes module

   anything else using the buffer API



Issues and Details


   The proposed locking mechanism relies entirely on the objects
   implementing the buffer interface to do their own thing.  Ideally
   an object that implements the buffer interface should keep at least
   a number indicating how many releases are extant.  If there are views
   to a memory location, then any subsequent reallocation should fail and raise
   an error. 

   The sharing of strided memory is new and can be seen as a
   modification of the multiple-segment interface.  It is motivated by
   NumPy.  NumPy objects should be able to share their strided memory
   with code that understands how to manage strided memory because
   strided memory is very common when interfacing with compute libraries.

   Currently the struct module does not allow specification of nested
   structures.  It seems like specifying a nested structure should be
   specified as several ways of viewing memory areas (e.g. ctypes and
   NumPy) already allow this.

   Memory management of the format string and the shape and strides
   array is always the responsibility of the exporting object and can
   be shared between different views. If the consuming object needs to
   keep these memory areas longer than the view is held, then it must
   copy them to its own memory. 


Copyright

   This PEP is placed in the public domain

_______________________________________________
Python-3000 mailing list
Python-3000@python.org
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

[Python-3000] Revised PEP for buffer protocol

Reply via email to