Re: [pypy-dev] UTF8 string passing in cffi and PyPy internal string optimizations

Amaury Forgeot d'Arc Wed, 18 Mar 2015 07:49:46 -0700

Hi,

2015-03-17 18:27 GMT+01:00 Eleytherios Stamatogiannakis <est...@gmail.com>:


> Hello,
>
> I'm sending the following here as they involve both cffi and PyPy.
>
> For the last few years i have been trying to find the most efficient way
> to pass UTF8 strings between PyPy and C code using cffi.
>
> Right now when PyPy receives a utf8 string (from a C function) it has to
> do 2 copies:
>
> 1. convert the cdata string to a pypy byte string via ffi.string
> 2. convert ffi.string to a unicode string
>
> When pypy sends a utf8 string it also does 2 copies:
>
> 1. convert pypy unicode string to utf8-encoded byte string
> 2. copy the byte string into a cdata string.
>
> From what i understand, there is a cffi optimization dealing with windows
> unicode (via set_unicode) where on windows platforms and when using the
> native windows unicode strings, cffi avoids doing one of the copies in both
> of above cases.
>
> On linux where the default unicode format for C libraries nowadays is
> UTF8, there is no such optimization, so we have to do the two copies in all
> string passing.
>
> PyPy at some point was going towards using utf8 string internally, but i
> don't know if this is still the plan or not. Using utf8 strings would
> optimize away one of the two copies on the linux platform (utf8
> encoding/decoding would become a nop operator).
>
> All of the above is the current status of cffi and pypy string handling as
> i understand it. So my proposal to reduce the string copies to a minimum is
> this:
>
> 1. If PyPy doesn't go towards using utf8 strings internally, maybe we need
> some special C type that denotes that the string is utf8 and pypy/cffi
> should do the conversion from-to it automatically. Something like "wchar_t"
> in windows but denoting a utf8 string. CFFI can define a special type
> ("__utf8char_t"?) for these strings.
>

This is a first step towards SWIG's typemaps:
http://www.swig.org/Doc3.0/Typemaps.html#Typemaps_nn4

That's also something I wanted to have in another projects: automatic
conversion to PYTHON_HANDLE, for example.

But typemaps are a tough thing, and they would likely differ between
CPython and PyPy.
Armin, what do you think?


Alternatively, an encoding parameter could be added in ffi.string, so that
> it'll do both the cdata and encoding conversions in one step.
>
> 2. If PyPy does go towards using utf8 string internally. Then it could
> call C functions that do not mutate the pypy strings and do not store
> pointers to them, by passing the strings directly. This could be
> accomplished by using a cffi annotation for these kind of
> non-string-mutating C functions.
>

Even utf8 is used internally (it is the case already in the py3k branch, as
a cached attribute), I'm not sure I would like fuctions like strlen() to
silently accept unicode strings...


>
> Above ideas are based on my understanding of the current status and the
> future directions of PyPy. If i have misunderstood something i would be
> glad to be set right :).
>
> Kind regards,
>
> l.
>
> _______________________________________________
> pypy-dev mailing list
> pypy-dev@python.org
> https://mail.python.org/mailman/listinfo/pypy-dev
>



-- 
Amaury Forgeot d'Arc

_______________________________________________
pypy-dev mailing list
pypy-dev@python.org
https://mail.python.org/mailman/listinfo/pypy-dev

Re: [pypy-dev] UTF8 string passing in cffi and PyPy internal string optimizations

Reply via email to