On 9/18/07, Jim Jewett <[EMAIL PROTECTED]> wrote: > On 9/18/07, Guido van Rossum <[EMAIL PROTECTED]> wrote: > > On 9/18/07, Jim Jewett <[EMAIL PROTECTED]> wrote: > > > On 9/18/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote: > > > > > There's no UTF-8 in Python's internal string encoding. > > > > (At least as of a few days ago) > > > > In Python 3 there is; strings are unicode. A PyUnicodeObject object > > > has two encodings that you can grab from a pointer (which means > > > they have to be there; you don't have time to generate them like > > > you would with a function pointer). > > > Incorrect. The pointer can be NULL. > > I had missed that comment, but I do see it now; thank you. > > > The API for getting the UTF-8 encoding is a function > > Thank you. But given that defenc is now always UTF-8, won't exposing > it in the public typedef then just be an attractive nuisance?
*ALL* fields of the struct def are strictly internal. > > (moreover a function whose name starts with _Py). > > That I still don't see. I am talking about _PyUnicode_AsDefaultEncoding(). (Which you shouldn't be calling. :-) > http://svn.python.org/view/python/branches/py3k/Include/unicodeobject.h?rev=57656&view=markup > > PyAPI_FUNC(PyObject*) PyUnicode_AsUTF8String( > PyObject *unicode /* Unicode object */ > ); > > PyAPI_FUNC(PyObject*) PyUnicode_EncodeUTF8( > const Py_UNICODE *data, /* Unicode char buffer */ > Py_ssize_t length, /* number of Py_UNICODE chars to encode */ > const char *errors /* error handling */ > ); > > > Later, the same file shows me: > > /* --- Unicode Type ------------------------------------------------------- */ > > typedef struct { > PyObject_HEAD > Py_ssize_t length; /* Length of raw Unicode data in buffer */ > Py_UNICODE *str; /* Raw Unicode buffer */ > long hash; /* Hash value; -1 if not set */ > int state; /* != 0 if interned. In this case the two > * references from the dictionary to this > object > * are *not* counted in ob_refcnt. */ > PyObject *defenc; /* (Default) Encoded version as Python > string, or NULL; this is used for > implementing the buffer protocol */ > } PyUnicodeObject; > > > I would be happier with: > > typedef struct { > PyObject_VAR_HEAD /* Length in code points, not chars */ > } PyUnicodeObject; > > And, in unicodeobject.c (*not* in a public header) > > typedef struct { > PyUnicodeObject ob_unicodehead; > Py_UNICODE *str; /* Raw Unicode buffer */ > long hash; /* Hash value; -1 if not set */ > int state; /* != 0 if interned. In this case the two > * references from the dictionary to this > object > * are *not* counted in ob_refcnt. */ > PyObject *defenc; /* (Default) Encoded version as Python > string, or NULL; this is used for > implementing the buffer protocol */ > } _PyDefaultUnicodeObject; > > As this would allow 3rd parties to create implementations specialized > for (and saving space on) smaller alphabets, without breaking C > extensions that stick to the public header files. (Moving hash or > even state to the public header might be OK too, but they seemed to > get ignored for subclasses anyhow.) That is not a supported use case. -- --Guido van Rossum (home page: http://www.python.org/~guido/) _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com