Re: [Cython] Another string encoding idea

Robert Bradshaw Mon, 30 Nov 2009 19:10:43 -0800

Just to clarify discussion, here is what I'm proposing (which is still  
in flux, and simplified due to memory issues, which does make it less  
attractive as one does not get to choose the used encoding, but it  
would always be UTF-8 in Py3).


Without directive(s) (as it is now):

    char* <-> bytes

With the directive(s) (which can be applied locally or globally):

     char* <-> str
     unicode/bytes -> char* would also work (for Py2/Py3 respectively)

The encoding used would be the system default (in Py2) and UTF-8 (in  
Py3). This would use the defenc slot so the encoded char* would be  
valid as long as the unicode object is around, and the long term  
future of the defenc slot needs to be ensured before this could be  
used for non-arguments conversion.

Also out there is the idea of a directive that would make char* become  
unicode in both Py2 and Py3.


On Nov 29, 2009, at 8:47 AM, Stefan Behnel wrote:

> Robert Bradshaw, 28.11.2009 22:12:
>> My personal concern is the pain I see porting Sage to Py3. I'd have  
>> to
>> go through the codebase and throw in encodes() and decodes() and
>> change signatures of functions that take char* arguments
>
> That's what I figured. Instead of having to fix up the code, you  
> want a
> do-what-I-mean str data type that unifies everything that's unicode,  
> bytes
> and char*, and that magically handles it all for you.

Exactly. Improve the compiler rather than change the code.

> In that case, you should drop the argument of Pyrex compatibility  
> for now,
> because I don't think you can have a Cython specific hyper-versatile  
> data
> type with automatic memory management and all that, while staying
> compatible to the simple str/bytes type in Pyrex - even if we manage  
> to get
> it working without new syntax.

Just because I don't need Pyrex compatibility doesn't mean it isn't a  
worthwhile goal (though that was the main point of the previous  
thread, not this one).

> We'd clearly break a lot of existing Pyrex/Cython code by starting  
> to coerce char* to unicode, for example.

Only if the directive was enabled, and perhaps only in Py3. Existing  
code wouldn't break.

>> (which, I just realized, will be a step backwards for cpdef  
>> functions).
>
> True. For cpdef functions, a char* parameter would be well-defined  
> as long
> as user code doesn't use different encodings for char* internally  
> (which is
> somewhat unlikely).
>
> Ok, let's think this through. There's two different scenarios. One  
> deals
> with function signatures (strings going in and out), the other one  
> deals
> with conversion on assignments or casts.

I think it's easier if the Python to C and C to Python conversions are  
uniform whether it happen via to coercion, assignment, or function  
signature constraints. Then the question is what objects can be turned  
into a char* (the directive would add unicode) and what object does  
char* turn into (the directive would create str in Py2 and Py3).

Function arguments typed as char* are a particularly useful case  
though, and it would be nice to make this friendlier for Py3.

> In total, there are three cases:
> accepting bytes/str/unicode in a str/bytes/char* signature, coercing
> str/unicode to char*, and coercing char* to bytes or unicode.
>
> Function signatures have two sides to them that are not symmetric.  
> One is
> that you want your string accepting functions to be agnostic about  
> the type
> of string that comes in (although you may or may not want to have  
> control
> about memory usage if you use char* in the signature), and the other  
> side
> is that you want some string to go back out, which you may want to  
> be a
> Py2-str (read: bytes) or a unicode string (maybe in Py2 and  
> definitely in
> Py3). Remember that if your code originally couldn't handle unicode,
> there's likely to be more code that can't handle it, either, so you
> wouldn't want your hyper-versatile type to always turn into unicode.

You're right, it would be nice to be able to return a str in both Py2  
and Py3, which neither "return some_c_string" nor "return  
some_c_string.decode(...)" will do.

> 1) Passing unicode strings into a function that expects char* means  
> that
> some kind of encoding must happen and a new Python bytes object must  
> be
> created on the fly. The input object isn't a problem here as the  
> caller
> holds a reference to it anyway. The encoded object, however, must  
> have a
> lifetime. Looking at buffer arguments, I wouldn't mind if that was the
> lifetime of the function call itself. After all, it's the user's  
> choice to
> use char* instead of str/bytes/unicode. So the case of a parameter  
> typed as
> char* is actually easy to handle from a memory POV, given that some  
> kind of
> automatic encoding is in place.
>
> 2) Automatic encoding for an assignment from unicode to char* is  
> tricky,
> because you can't easily make assumptions about the lifetime of the  
> unicode
> object itself. You could get away with a weak-ref mapping from unicode
> strings to their byte encoded representation. I think every other  
> attempt
> to keep track of the lifetime of the unicode object is futile in  
> current
> Cython. Think of code like this, which I would expect to work:
>
>    cdef unicode u = u"abcdefg"
>    cdef char* s1 = u
>    u2 = u
>    cdef char* s2 = u2
>    u = None
>    print u, u2, s1, s2
>
> So supporting automatic unicode->char* coercion on assignments is  
> really
> hard to do internally.

As Greg pointed out, this would have *exactly* the same semantics as  
we now have for

    cdef bytes b = b"abcdefg"
    cdef char* s1 = b
    b2 = b
    cdef char* s2 = b2
    b = None
    print b, b2, s1, s2

using the defenc slot.
>

> 3) The third case is the same for both sigs and assignments: automatic
> decoding of char* to unicode vs. instantiation of a bytes object,  
> i.e. the
> following should do The Right Thing:
>
>    cdef char* some_c_string = ...
>    some_python_name = some_c_string
>
> This would be heavily simplified if some_python_name was typed as  
> either
> bytes or unicode (the latter of which might fail due to decoding  
> errors),
> and even str would work if it did different things in Py2 and Py3  
> (with
> potential decoding errors only in Py3). However, that won't work for
> untyped return values of def functions.

This is not really about assignment, it's about coercion. If we declare

     some_python_name = some_c_string

to always have the same meaning as

     some_python_name = <typeof(some_python_name)>some_c_string

then the meaning of <bytes> some_c_string and <unicode> some_c_string  
are clear, and <object> some_c_string is the only ambiguity, and the  
directive would control what <object> some_c_string means.

> Given that users would likely want
> to use bytes in Py2 (for simple non-unicode strings) and unicode for  
> other
> strings in Py2 and all text strings in Py3, this isn't easy to handle
> automatically.

If a user wants to return a mixture of str/unicode in Py2, or bytes/ 
str in Py3, they're going to have to be explicit one way or another,  
whether or not this directive is used. (It just changes the default.)

> Now, the proposal was to enable this with a compiler directive,  
> which would
> basically provide a default encoding. If this directive was used, all
> untyped coercions from char* to a Python object would use it. As Dag  
> noted
> already, this would interfere with type inference, as the resulting  
> type
> would still be char* in that case.

This is completely orthogonal to type inference. Type inference  
happens before any coercions are inserted, and would work just as well  
with our without this proposal. It's a question about what kind of  
coercion to allow/insert when one is needed.

> The only exception are untyped function return values.
>
> For typed coercions to str or unicode, I personally don't think that  
> it's
> too much typing to require "c_string.decode(enc)", which would work  
> nicely
> with type inference. However, that would, again, not yield the
> do-what-I-mean result of returning a byte string in Py2. Arguably,  
> that
> might be considered an optimisation, but it could still fall under  
> the DWIM
> compiler directive, e.g. as an "return_bytes_in_py2" option.
>
>
> Ok, to sum things up, it looks like a special kind of coercion at  
> function
> call bounderies would be quite easy to support, and would work  
> nicely with
> type inference enabled. It would also match the support that CPython's
> C-API argument unpacking functions have for converting Python strings.
> Everything else would mean hard work inside of Cython and be rather  
> hard to
> explain to users.

Argument unpacking would be the most useful case. If defenc is  
actually going away, then I agree things could get messy, but  
otherwise it would be as easy (or hard) to explain and use as str ->  
char* is now.

> BTW, I wouldn't mind extending the string input argument conversion  
> support
> to everything that supports the buffer protocol.

That might be interesting, though one difficulty is that buffers in  
general don't have a intrinsic notion of length. (Technically, nor do  
strings, null terminated strings encoded with null-free encodings are  
common enough to make char* useable.)

- Robert

_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] Another string encoding idea

Reply via email to