On Apr 16, 2008, at 2:25 PM, Dag Sverre Seljebotn wrote:
>
>> The more basic question is if we can transparently support unicode in
>> char*, why not? Even for non-English speakers, the majority of
>> strings being passed around will be ASCII.
>>
> Always defaulting to UTF-8 for this could be confusing in some  
> contexts.
> For instance, if one has a Cython source file in latin1, and calls a
> spelling correction library that works exclusively in latin1 (I've
> worked with such a library once...), and in general don't touch UTF-8
> anywhere, it might seem confusing that UTF-8 is passed to the library.

True. If you're using an external library that takes non-ASCII  
strings and you don't bother to think about encoding, you should be  
surprised if things just work. My goals are to make it natural to go  
object -> char* -> object for unicode objects, and object -> char* ->  
c library for ASCII unicode objects. When string literals become  
unicode objects, people are going to have unicode objects floating  
around everywhere, not bytes objects.

> All in all it seems to be the lesser of evils though. (In particular I
> like defaulting to UTF-8 a lot better than having the encoding of the
> Cython source matter, which is where Stefan would disagree if I
> understand correctly.)

Having the source files be transferable from computer to computer is  
a big plus, and UTF-8 plays nice with ASCII and most standard c  
string processing functions.

At least we don't have people clamoring for EDBIC support :).

>> I think both (a) and (b) are non-negligible issues, especially in the
>> context of wrapping existing C libraries. Having to learn a new type
>> like utf8charbuf, (which it masks the pointer nature of it as well,
>> is its memory managed?) isn't desirable, especially if one is casting
>> everywhere back between any object and char*. It also creates the
>> expectation that all different kinds of encodings need to be
>> supported with their own special type, and I don't think we want
>> anything as heavy as a class.
>>
> OK, I've polished it to deal with some of these. Your main points are
> still valid though so I'll consider it dismissed...
>
> It wouldn't be beyond the Cython compiler to do something like
>
> cdef uchar("utf-8")* buf = "my æøåÅ"
>
> Which would directly be translated to
>
> cdef char* buf = "my \some\escape\sequence"
>
> and have
>
> cdef uchar("utf-8")* buf = pyobj
>
> become
>
> cdef char* buf = unicode(pyobj).encode("utf-8")

^^^ I always want to support people being able to do this if the need  
to be explicit.

The magic of uchar("gb5)* getting translated to the above, it would  
complicate the type system (both in terms of codebase and user's  
perspective). Would conversion be performed assigning from a uchar 
("gb5") to a uchar("utf-8"), or to a uchar*? If we decide to support  
such automatic conversions, this seems like the best syntax I've  
seen, but still think the default should be accept unicode objects  
(via utf-8).

> It wouldn't be complicated to support many encodings, they would  
> just be
> passed on to CPython. No heavy class involved.

OK, I wasn't sure if your utf8charbuf was a class or not.

- Robert

_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Reply via email to