Hello,

I'm sending the following here as they involve both cffi and PyPy.

For the last few years i have been trying to find the most efficient way to pass UTF8 strings between PyPy and C code using cffi.

Right now when PyPy receives a utf8 string (from a C function) it has to do 2 copies:

1. convert the cdata string to a pypy byte string via ffi.string
2. convert ffi.string to a unicode string

When pypy sends a utf8 string it also does 2 copies:

1. convert pypy unicode string to utf8-encoded byte string
2. copy the byte string into a cdata string.

From what i understand, there is a cffi optimization dealing with windows unicode (via set_unicode) where on windows platforms and when using the native windows unicode strings, cffi avoids doing one of the copies in both of above cases.

On linux where the default unicode format for C libraries nowadays is UTF8, there is no such optimization, so we have to do the two copies in all string passing.

PyPy at some point was going towards using utf8 string internally, but i don't know if this is still the plan or not. Using utf8 strings would optimize away one of the two copies on the linux platform (utf8 encoding/decoding would become a nop operator).

All of the above is the current status of cffi and pypy string handling as i understand it. So my proposal to reduce the string copies to a minimum is this:

1. If PyPy doesn't go towards using utf8 strings internally, maybe we need some special C type that denotes that the string is utf8 and pypy/cffi should do the conversion from-to it automatically. Something like "wchar_t" in windows but denoting a utf8 string. CFFI can define a special type ("__utf8char_t"?) for these strings.

Alternatively, an encoding parameter could be added in ffi.string, so that it'll do both the cdata and encoding conversions in one step.

2. If PyPy does go towards using utf8 string internally. Then it could call C functions that do not mutate the pypy strings and do not store pointers to them, by passing the strings directly. This could be accomplished by using a cffi annotation for these kind of non-string-mutating C functions.

Above ideas are based on my understanding of the current status and the future directions of PyPy. If i have misunderstood something i would be glad to be set right :).

Kind regards,

l.

_______________________________________________
pypy-dev mailing list
pypy-dev@python.org
https://mail.python.org/mailman/listinfo/pypy-dev

Reply via email to