Re: [Python-3000] string C API

Jim Jewett Tue, 03 Oct 2006 08:14:06 -0700

On 10/3/06, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> Jim Jewett schrieb:


> > The problem isn't the hash; it is the equality.  Which encoding do you
> > keep interned?

When I wrote this, I had been assuming that UCS4(string) and
UCS2(string) would be completely unrelated objects.  With more
thought, I realized that might not make sense.  Today, we special-case
the (compile-time) internal python encoding and the default encoding.

In python 3, a string object might look like

#define PyObject_str_HEAD   \
    PyObject_VAR_HEAD   \
    long ob_shash;   \
    PyObject *cache;

with a typical concrete implementation looking like

typedef struct {
    PyObject_str_HEAD
    PyObject *encoding   /* concrete method implementation, not just codecs */
    data
} PyAbstractUnicodeObject;

cache is a (weak?) mapping encoding -> this string in that encoding,
so two strings with the same cache pointer are equal.

This particular implementation would

  +  take the same overhead memory as today's unicode (adding cache
and encoding, taking out pointers to str and defenc)
  +  co-locate the header and data (which Guido has indicated is important)

  -  Be slightly worse (by object headers) when exactly two encodings
are needed, and those two are the compile-time internal format and the
default encoding.

  +  Be much better if several encodings are needed, or if recoding is
never needed.

A different concrete implementation for UTF-8 (which admitted it was a
subclass of unicode/str) could save the *encoding field.


> >> What about never recoding?  The benefit of the latin-1/ucs-2/ucs-4
> >> method I previously described is that each of the encodings offer a
> >> minimal representation of the code points that the text object contains.

> > There may be some thrashing as

> >     s+= (larger char)

> [ Martin's size-independent code ]

oops ... I had been thinking about surrogates, but Josiah's proposal
chooses the size explicitly to avoid them.

> > It is easy enough to answer why not for each specific case, but I'm
> > not *certain* that it is the right answer -- so why not leave it up to
> > implementors if they want to do more than the basic three?

> Not sure what implementors you are talking about: anybody who wants
> to clone Python is free to do whatever they want. We *are* the
> implementors of CPython, and if we don't want to do more, then
> we just don't want it.

implementors of concrete string types.

Python is normally pretty good about duck typing, but str is a
notorious exception.

One of the major objections to the Path class was that it had to
inherit from string in order to actually work.  If people used (the
equivalent of) PyUnicode_AsUnicode instead of ->str, this wouldn't
have been so hard.

I also expect that the number of concrete types in the core itself may
increase if it is easy to do that.  I don't think any single person
would care enough to maintain all of UCS4, UCS2, Latin-1, Latin-2,
UTF-8, and NSString versions; it wouldn't surprise me if there were
someone who cared enough to maintain each of those.

-jJ
_______________________________________________
Python-3000 mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Re: [Python-3000] string C API

Reply via email to