Re: [Cython] Another string encoding idea

Stefan Behnel Mon, 30 Nov 2009 22:42:18 -0800

Robert Bradshaw, 01.12.2009 04:09:
> Just to clarify discussion, here is what I'm proposing (which is still  
> in flux, and simplified due to memory issues, which does make it less  
> attractive as one does not get to choose the used encoding, but it  
> would always be UTF-8 in Py3).

... and the 'default encoding' in Py2, which may or may not be ASCII, but
would likely be at least something that's compatible with ASCII, as it
would break tons of code otherwise.

> Without directive(s) (as it is now):
> 
>     char* <-> bytes
>
> With the directive(s) (which can be applied locally or globally):
> 
>      char* <-> str
>      unicode/bytes -> char* would also work (for Py2/Py3 respectively)

'respectively' in the sense of 'for both'?

> The encoding used would be the system default (in Py2) and UTF-8 (in  
> Py3). This would use the defenc slot so the encoded char* would be  
> valid as long as the unicode object is around, and the long term  
> future of the defenc slot needs to be ensured before this could be  
> used for non-arguments conversion.

That's my main concern here. We are basing a major feature on a side-effect
of something that's declared "for internal use only".

The new buffer interface isn't even supported by Unicode strings in Py3, so
the mere existence of the defenc slot in Py3 is plainly for internal
optimisation purposes, and the fact that it's safe for external code to
just borrow the reference into a char* is everything but clear to me.

It's obvious enough that defenc isn't going to go away in Py2 any more, but
since you keep insisting, please ask on python-dev for making that part of
the C-API publicly specified (i.e. the slot itself and the fact that the
object in defenc is kept alive for the lifetime of the unicode string)
before we even consider doing anything like this.

I still don't like the list.pop() optimisation, but this is much worse, as
we can't just take this feature back when we realise that it was a mistake
in the first place.

> Also out there is the idea of a directive that would make char* become  
> unicode in both Py2 and Py3.

... which would likely only be useful for new code, as existing code would
break in all sorts of places if you enable that (just as with type inference).

> On Nov 29, 2009, at 8:47 AM, Stefan Behnel wrote:
> 
>> Robert Bradshaw, 28.11.2009 22:12:
>>> My personal concern is the pain I see porting Sage to Py3. I'd have 
>>> to go through the codebase and throw in encodes() and decodes() and 
>>> change signatures of functions that take char* arguments
>> That's what I figured. Instead of having to fix up the code, you want
>> a do-what-I-mean str data type that unifies everything that's unicode,
>> bytes and char*, and that magically handles it all for you.
> 
> Exactly. Improve the compiler rather than change the code.

You calling it 'improve' actually makes it sound better than I think it is.

I do see the interest of simplifying the path between unicode strings and
char*, but I also see an interest in making it easy for developers to write
safe APIs that reject broken input (e.g. with 0 bytes or other control
characters). I really don't like APIs that use "well, it's written in C"
(and certainly not "well, it's written in Cython"!) as an excuse for
silently dropping parts of my accidentally broken input (which I may not
even have control of myself). Automatic coercion to char* is only one side
of input handling, and it may just as well lead to less helpful APIs being
written. So enabling such a directive requires careful consideration, too,
because it's not a simple all-win thing, not even in the long term.

> I think it's easier if the Python to C and C to Python conversions are  
> uniform whether it happen via to coercion, assignment, or function  
> signature constraints. Then the question is what objects can be turned  
> into a char* (the directive would add unicode) and what object does  
> char* turn into (the directive would create str in Py2 and Py3).
> [...]
> If we declare
>
>      some_python_name = some_c_string
>
> to always have the same meaning as
>
>      some_python_name = <typeof(some_python_name)>some_c_string
>
> then the meaning of <bytes> some_c_string and <unicode> some_c_string
> are clear, and <object> some_c_string is the only ambiguity, and the
> directive would control what <object> some_c_string means.

This sounds reasonable - except for the implementation details.

> Function arguments typed as char* are a particularly useful case  
> though, and it would be nice to make this friendlier for Py3.

Function arguments typed bytes/str/unicode are a lot easier and safer to
handle, though, and not a bit slower in general. Coercion from bytes to
plain char* is pretty fast, and could be even faster if it's typed (as we
could use a None check and a macro in that case).

>> Now, the proposal was to enable this with a compiler directive, which
>> would basically provide a default encoding. If this directive was
>> used, all untyped coercions from char* to a Python object would use
>> it. As Dag noted already, this would interfere with type inference, as
>> the resulting type would still be char* in that case.
>
> This is completely orthogonal to type inference.

It's not orthogonal, as type inference currently breaks C type to untyped
Python name assignments, which is exactly the case you want to influence
with the directive. This means that the char* directive would override the
type inference directive for one special case.

>> BTW, I wouldn't mind extending the string input argument conversion  
>> support to everything that supports the buffer protocol.
> 
> That might be interesting, though one difficulty is that buffers in  
> general don't have a intrinsic notion of length.

Huh? Py_buffer.len will do just fine for a 1D buffer. Actually,
PyUnicode_FromEncodedObject() will handle this for us in Py3 (although
incorrectly by using PyObject_AsCharBuffer(), I just filed a bug report on
their tracker).

But the case where this would matter doesn't seem to be part of your
revised proposal above any more.

Stefan

_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] Another string encoding idea

Reply via email to