Re: [Python-Dev] Extended Buffer Interface/Protocol

Travis Oliphant Wed, 21 Mar 2007 13:22:44 -0800

Attached is the PEP.

:PEP: XXX
:Title: Revising the buffer protocol
:Version: $Revision: $
:Last-Modified: $Date:  $
:Author: Travis Oliphant <[EMAIL PROTECTED]>
:Status: Draft
:Type: Standards Track
:Content-Type: text/x-rst
:Created: 28-Aug-2006
:Python-Version: 3000


Abstract
========

This PEP proposes re-designing the buffer API (PyBufferProcs
function pointers) to improve the way Python allows memory sharing
in Python 3.0

In particular, it is proposed that the multiple-segment and
character buffer portions of the buffer API be eliminated and
additional function pointers be provided to allow sharing any
multi-dimensional nature of the memory and what data-format the
memory contains.

Rationale
=========

The buffer protocol allows different Python types to exchange a
pointer to a sequence of internal buffers.  This functionality is
*extremely* useful for sharing large segments of memory between
different high-level objects, but it is too limited and has issues.

1. There is the little (never?) used "sequence-of-segments" option
   (bf_getsegcount)

2. There is the apparently redundant character-buffer option
   (bf_getcharbuffer)

3. There is no way for a consumer to tell the buffer-API-exporting
   object it is "finished" with its view of the memory and
   therefore no way for the exporting object to be sure that it is
   safe to reallocate the pointer to the memory that it owns (for
   example, the array object reallocating its memory after sharing
   it with the buffer object which held the original pointer led
   to the infamous buffer-object problem).

4. Memory is just a pointer with a length. There is no way to
   describe what is "in" the memory (float, int, C-structure, etc.)

5. There is no shape information provided for the memory.  But,
   several array-like Python types could make use of a standard
   way to describe the shape-interpretation of the memory
   (wxPython, GTK, pyQT, CVXOPT, PyVox, Audio and Video
   Libraries, ctypes, NumPy, data-base interfaces, etc.)

6. There is no way to share discontiguous memory (except through
   the sequence of segments notion).  

   There are two widely used libraries that use the concept of
   discontiguous memory: PIL and NumPy.  Their view of discontiguous
   arrays is different, though.  This buffer interface allows
   sharing of either memory model.  Exporters will only use one         
   approach and consumers may choose to support discontiguous 
   arrays of each type however they choose. 

   NumPy uses the notion of constant striding in each dimension as its
   basic concept of an array. With this concept, a simple sub-region
   of a larger array can be described without copying the data.   T
   Thus, stride information is the additional information that must be
   shared. 

   The PIL uses a more opaque memory representation. Sometimes an
   image is contained in a contiguous segment of memory, but sometimes
   it is contained in an array of pointers to the contiguous segments
   (usually lines) of the image.  The PIL is where the idea of multiple
   buffer segments in the original buffer interface came from. 
  

   NumPy's strided memory model is used more often in computational
   libraries and because it is so simple it makes sense to support
   memory sharing using this model.  The PIL memory model is used often
   in C-code where a 2-d array can be then accessed using double
   pointer indirection:  e.g. image[i][j].  

   The buffer interface should allow the object to export either of these
   memory models.  Consumers are free to either require contiguous memory
   or write code to handle either memory model.  

Proposal Overview
=================

* Eliminate the char-buffer and multiple-segment sections of the
  buffer-protocol.

* Unify the read/write versions of getting the buffer.

* Add a new function to the interface that should be called when
  the consumer object is "done" with the view.

* Add a new variable to allow the interface to describe what is in
  memory (unifying what is currently done now in struct and
  array)

* Add a new variable to allow the protocol to share shape information

* Add a new variable for sharing stride information

* Add a new mechanism for sharing array of arrays. 

* Fix all objects in the core and the standard library to conform
  to the new interface

* Extend the struct module to handle more format specifiers

Specification
=============

Change the PyBufferProcs structure to

::

    typedef struct {
         getbufferproc bf_getbuffer
         releasebufferproc bf_releasebuffer
    }

::

    typedef PyObject *(*getbufferproc)(PyObject *obj, void **buf,
                                       Py_ssize_t *len, int *writeable,
                                       char **format, int *ndims,
                                       Py_ssize_t **shape,
                                       Py_ssize_t **strides,
                                       void **segments)

All variables except the first are optional.  Use NULL for all
un-needed variables.  Thus, this function can be called to get only
the desired information from an object. NULL is returned on failure.
On success an object-specific view is returned (which may just be a
borrowed reference to obj).  This view should be passed to
bf_releasebuffer when the consumer is done with the view.

buf
     a pointer to the start of the memory for the object is returned in
    ``*buf``

len
     adress of an integer variable to hold the total bytes
     of memory the object uses.  This should be the same
     as the product of the shape array multiplied by the
     number of bytes per item of memory. 

writeable
    address of an integer variable to hold whether or not the memory
    is writeable. If this is NULL, then you must assume the memory
    is read-only.

format
    address of a format-string (following extended struct
    syntax) indicating what is in each element of
    of memory.  The number of elements is len / itemsize,
    where itemsize is the number of bytes implied by the format.
    NULL if not needed in which case format is "B" for
    unsigned bytes.  The memory for this string must not
    be freed by the consumer --- it is managed by the exporter.

ndims
    address of a variable storing the number of dimensions
    or NULL if not needed.  If shape and/or strides are given
    then this must be non NULL.  If this variable is
    not provided then it is assumed that ``*ndims == 1``.

shape
    address of a ``Py_ssize_t*`` variable that will be filled
    with a pointer to an array of ``Py_ssize_t`` of length ``*ndims``
    indicating the shape of the memory as an N-D array.
    Ignored if this is NULL.  Note that
    ``((*shape)[0] * ... * (*shape)[ndims-1])*itemsize = len``.
    If this variable is not provided then it is assumed that
    ``(*shape[0]) == len / itemsize``.


strides 
    address of a ``Py_ssize_t*`` variable that will be filled with a
    pointer to an array of ``Py_ssize_t`` of length ``*ndims``
    indicating the number of bytes to skip to get to the next element
    in each dimension.  If this is NULL, then the memory is assumed to
    be C-style contigous with the last dimension varying the fastest.
    An error should be raised if this is not accurate and strides are
    not requested.  This variable may be set to NULL (with no error
    set) if memory is actually C-style contiguous.


segments
    address to array-of-pointers-style array model.  Only one of
    strides or segments can be used (the other one must be NULL).  
    If the object does not support this kind of memory model and it
    is requested, then an error should be raised and *segments set
    to NULL.  The segments variable should be recast to a 
    pointer-to-a-pointer-to-a-pointer-...-to-a-pointer depending on 
    the output of ndims.   

    Thus, if ndims is 2, segments should be cast to (<type> ***)
    so that (*segments)[i][j] refers to the (i,j)th element
    of the array.  If ndims is 3, segments should be cast to (<type> ****)
    so that (*segments)[i][j][k] refers to the (i,j,k)th element
    of the array. 


The view object should be used in the other API call and does not need
to be decref'd.  It should be "released" if the interface exporter
provides the bf_releasebuffer function.  Otherwise, it may be
discared.  The view object is exporter-specific.


``typedef int (*releasebufferproc)(PyObject *view)``
    This function is called (if defined by the exporting object)
    when a view of memory previously acquired from the object is no
    longer needed.  It is up to the exporter of the API to make sure
    all views have been released before re-allocating any previously
    shared memory.  It is up to consumers of the API to call this
    function on the object whose view is obtained when it is no
    longer needed.   Any format string, shape array or strides array
    returned through the interface should also not be referenced after
    the releasebuffer call is made.
    A -1 is returned on error and 0 on success.

    Both of these routines are optional for a type object


New C-API calls are proposed
============================

::

    int PyObject_CheckBuffer(PyObject *obj)

Return 1 if the getbuffer function is available otherwise 0.

::

    PyObject * PyObject_GetBuffer(PyObject *obj, void **buf,
                                  Py_ssize_t *len, int *writeable,
                                  char **format, int *ndims,
                                  Py_ssize_t **shape, Py_ssize_t **strides,
                                  void **segments)

Get the buffer and optional information variables about the buffer.
Return an object-specific view object (which may be simply a
borrowed reference to the object itself).

::

    int PyObject_ReleaseBuffer(PyObject *view)

Call this function to tell obj that you are done with your "view"
This doesn't do anything if the object doesn't implement a release function.
Only call this after a previous PyObject_GetBuffer has succeeded and when
you will not be needing or referring to the memory (or the format, shape, 
and strides memory used in the view -- if you will use these for a longer
period of time make copies).
Returns -1 on error.

::

    int PyObject_SizeFromFormat(char *)

Return the implied itemsize of the data-format area from a struct-style
description.

::

    int PyObject_GetContiguous(PyObject *obj, void **buf, Py_ssize_t *len)

Return a contiguous chunk of memory representing the buffer.  If a
copy is made then return 1.  If no copy was needed return 0.  If an
error occurred in probing the buffer interface, then return -1.  The
contiguous chunk of memory is pointed to by ``*buf`` and the length
of that memory is ``*len``.  The buffer is C-style contiguous
meaning the last dimension varies the fastest. 

:: 

    int PyObject_CopyToObject(PyObject *obj, void *buf, Py_ssize_t len)

Copy ``len`` bytes of data pointed to by the contiguous chunk of
memory pointed to by ``buf`` into the buffer exported by obj.  Return
0 on success and return -1 and raise an error on failure.  If the 
object does not have a writeable buffer, then an error is raised.  
The data is copied into an array in C-style contiguous fashion meaning the
last variable varies the fastest. 

The last two C-API calls allow a standard way of getting data in and out
of Python objects no matter how it is actually stored.  These calls use
the buffer interface to perform their work. 



Additions to the struct string-syntax
=====================================

The struct string-syntax is missing some characters to fully
implement data-format descriptions already available elsewhere (in
ctypes and NumPy for example).  Here are the proposed additions:


================  ===========
Character         Description
================  ===========
't'               bit (number before states how many bits)
'?'               platform _Bool type
'g'               long double  
'Z'               complex (whatever the next specifier is)
'c'               ucs-1 (latin-1) encoding 
'u'               ucs-2 
'w'               ucs-4 
'O'               pointer to Python Object 
'T{}'             structure (detailed layout inside {}) 
'(k1,k2,...,kn)'  multi-dimensional array of whatever follows 
':name:'          optional name of the preceeding element 
'&'               specific pointer (prefix before another charater) 
'X{}'             pointer to a function (optional function 
                                         signature inside {})
' '               ignored (allow better readability)
================  ===========

The struct module will be changed to understand these as well and
return appropriate Python objects on unpacking.  Un-packing a
long-double will return a decimal object.  Unpacking 'u' or
'w' will return Python unicode.  Unpacking a multi-dimensional
array will return a list of lists.  Un-packing a pointer will
return a ctypes pointer object.  Un-packing a bit will return a
Python Bool.  Spaces in the struct-string syntax will be ignored.
Unpacking a named-object will return a Python class with attributes 
having those names. 

Endian-specification ('=','>','<') is also allowed inside the
string so that it can change if needed.  The previously-specified
endian string is enforce until changed.  The default endian is '='.

According to the struct-module, a number can preceed a character
code to specify how many of that type there are.  The
(k1,k2,...,kn) extension also allows specifying if the data is
supposed to be viewed as a (C-style contiguous, last-dimension
varies the fastest) multi-dimensional array of a particular format.

Functions should be added to ctypes to create a ctypes object from
a struct description, and add long-double, and ucs-2 to ctypes.

Examples of Data-Format Descriptions
====================================

Here are some examples of C-structures and how they would be
represented using the struct-style syntax:

float
    'f'
complex double
    'Zd'
RGB Pixel data
    'BBB' or 'B:r: B:g: B:b:'
Mixed endian (weird but possible)
    '>i:big: <i:little:'
Nested structure
    ::

        struct {
             int ival;
             struct {
                 unsigned short sval;
                 unsigned char bval;
                 unsigned char cval;
             } sub;
        }
        'i:ival: T{H:sval: B:bval: B:cval:}:sub:'
Nested array
    ::

        struct {
             int ival;
             double data[16*4];
        }
        'i:ival: (16,4)d:data:'

Code to be affected
===================

All objects and modules in Python that export or consume the old
buffer interface will be modified.  Here is a partial list.

* buffer object
* bytes object
* string object
* array module
* struct module
* mmap module
* ctypes module

Anything else using the buffer API


Issues and Details
==================

The proposed locking mechanism relies entirely on the objects
implementing the buffer interface to do their own thing.  Ideally an
object that implements the buffer interface and can re-allocate
memory, should store in its structure at least a number indicating how
many views are extant.  If there are still un-released views to a
memory location, then any subsequent reallocation should fail and
raise an error.

The sharing of strided memory is new and can be seen as a
modification of the multiple-segment interface.  It is motivated by
NumPy.  NumPy objects should be able to share their strided memory
with code that understands how to manage strided memory because
strided memory is very common when interfacing with compute libraries.

Currently the struct module does not allow specification of nested
structures.  The modifications to struct requested allow for
specifying nested structures as several ways of viewing memory areas
(e.g. ctypes and NumPy) already allow this.

Memory management of the format string and the shape and strides
array is always the responsibility of the exporting object and can
be shared between different views. If the consuming object needs to
keep these memory areas longer than the view is held, then it must
copy them to its own memory.


Copyright
=========

This PEP is placed in the public domain

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Extended Buffer Interface/Protocol

Reply via email to