>>>> from __future__ import unicode_literals (or -3)
>>>>
>>>>       len(x1) == 4
>>>>       len(x2) == 4
>>>>
>>>> Otherwise
>>>>
>>>>       len(x1) == 9
>>>>       len(x2) == 4
>>>
>>> Hmm, now *that* looks unexpected to me.
>>
>> But this *exactly* how Python handles.
>>
>> x1 = 'abc\u0001'
>> x2 = 'abc\x01'
>> len(x1), len(x2)
>>
>> for with and without unicode_literals.
>
> Not for byte strings.

We're talking about unprefixed literals.

> Seriously, what you are trying to push here is that users must decide if
> they prefix a char* literal with a 'b' or not, depending on the content of
> the string.

Users are free to always prefix all their byte literals with 'b', I'm
proposing for the simple, unambiguous case that they aren't forced to.

> Sometimes, Cython will force them to do it, sometimes, it will
> just work, even for calls to exactly the same function. Great. Why can't we
> *always* require a 'b'

I think this is overkill for the vast majority of libraries that I've
wrapped (admittedly mostly math), as well as all the standard c
libraries that take char* arguments (e.g. stdio, as in my previous
example).

> or *always* make it work as expected? What would be
> wrong with that?

Because clearly "what is expected" is not consistant across the
participants in this thread, and I'd certainly rather have an
unexpected compile time error than unexpected (potentially undetected)
runtime behavior.

>>> The way I see it, a C string is the
>>> C equivalent of a Python byte string and should always and predictably
>>> behave like a Python byte string, regardless of the way Python object
>>> literals are handled.
>>
>> Python bytes are very different than strings. C (and most C libraries)
>> use char* for both strings and binary data.
>
> No. They use it for binary data and *encoded* text content, even if the
> encoding is ASCII. That's different. The fact that they accept text content
> encoded in ASCII, CP1250, UTF-8, UCS4, Latin-15, Kanji or whatever doesn't
> mean they know what Unicode is or even how to handle text. They may just
> store it away as binary, they may interpret it a filename encoded in a
> platform specific way, or they may pass it to a recoder. Cython can't know.
> The user will know it, though, and will (in almost all cases) pass content
> that suits the other side, be it ASCII encoded or not.

[Sigh] I know the difference, but to say the C statement

    char *x = "abc";

doesn't contain any strings, only encoded text content, is IMHO overly
pedantic, and I think it's too much to push this level of pedantism on
all our users when the result is unambiguous.

> Could you comment on this please?

Sure, at the risk of being redundant.

> http://permalink.gmane.org/gmane.comp.python.cython.devel/10243

> I think I made it pretty clear there what I think the two suitable
> alternatives are.

Yes, you favor either (1) re-interpretation of the literal depending
on the type context they're used in or (2) disallowing interpretation
of string literals when unicode literal are enabled.

I think (1) is a bad path to take and would prefer not to burden users
with (2).

- Robert
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Reply via email to