Hi, 2015-03-17 18:27 GMT+01:00 Eleytherios Stamatogiannakis <est...@gmail.com>:
> Hello, > > I'm sending the following here as they involve both cffi and PyPy. > > For the last few years i have been trying to find the most efficient way > to pass UTF8 strings between PyPy and C code using cffi. > > Right now when PyPy receives a utf8 string (from a C function) it has to > do 2 copies: > > 1. convert the cdata string to a pypy byte string via ffi.string > 2. convert ffi.string to a unicode string > > When pypy sends a utf8 string it also does 2 copies: > > 1. convert pypy unicode string to utf8-encoded byte string > 2. copy the byte string into a cdata string. > > From what i understand, there is a cffi optimization dealing with windows > unicode (via set_unicode) where on windows platforms and when using the > native windows unicode strings, cffi avoids doing one of the copies in both > of above cases. > > On linux where the default unicode format for C libraries nowadays is > UTF8, there is no such optimization, so we have to do the two copies in all > string passing. > > PyPy at some point was going towards using utf8 string internally, but i > don't know if this is still the plan or not. Using utf8 strings would > optimize away one of the two copies on the linux platform (utf8 > encoding/decoding would become a nop operator). > > All of the above is the current status of cffi and pypy string handling as > i understand it. So my proposal to reduce the string copies to a minimum is > this: > > 1. If PyPy doesn't go towards using utf8 strings internally, maybe we need > some special C type that denotes that the string is utf8 and pypy/cffi > should do the conversion from-to it automatically. Something like "wchar_t" > in windows but denoting a utf8 string. CFFI can define a special type > ("__utf8char_t"?) for these strings. > This is a first step towards SWIG's typemaps: http://www.swig.org/Doc3.0/Typemaps.html#Typemaps_nn4 That's also something I wanted to have in another projects: automatic conversion to PYTHON_HANDLE, for example. But typemaps are a tough thing, and they would likely differ between CPython and PyPy. Armin, what do you think? Alternatively, an encoding parameter could be added in ffi.string, so that > it'll do both the cdata and encoding conversions in one step. > > 2. If PyPy does go towards using utf8 string internally. Then it could > call C functions that do not mutate the pypy strings and do not store > pointers to them, by passing the strings directly. This could be > accomplished by using a cffi annotation for these kind of > non-string-mutating C functions. > Even utf8 is used internally (it is the case already in the py3k branch, as a cached attribute), I'm not sure I would like fuctions like strlen() to silently accept unicode strings... > > Above ideas are based on my understanding of the current status and the > future directions of PyPy. If i have misunderstood something i would be > glad to be set right :). > > Kind regards, > > l. > > _______________________________________________ > pypy-dev mailing list > pypy-dev@python.org > https://mail.python.org/mailman/listinfo/pypy-dev > -- Amaury Forgeot d'Arc
_______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev