After all your comments, my conclusion is the following:
- Any code dealing with string processing should be written in a way
were byte and unicode strings should have to be prefixed like b"abc"
and u"abc". This way, the code is not only semantically correct, but
also explicit about the intended meaning of a string literal.
Unfortunatelly, after long discussions, Guido decided Python 3 not to
support the u"abc" form. And now, from a user/developer perspective,
althoug this contradict the "only one way...", I'm not sure at all if
that was a good idea in practice.
- I still think that unprefixed forms should match the builtin 'str'
type in Py2 and Py3. This way, things like docstrings, exception
messages, and calls like getattr([], "append") will just work. And of
course, casting to a raw 'char*' pointer should only accept bytes (by
using PyString_AsString). This is going to be safe in Py3, but not in
Py2. But Py2 is broken anyway, right?
In my current understanding of the problem, the evil thing is
automatic conversion. I'm completelly convinced of this, I believe
Robert and Greg are also convinced, and Stefan and Dag are definitely
sure. Then perhaps a way to make all us happy is the following:
- Add a '-py3' command line flag to Cython (this will be needed in the
future anyway, right?). When this flag is active, then the following
will happen at runtime:
* Something like 'cdef char *p = obj' will only accept a byte string
('str' type in Py2, 'bytes' in Py3). If 'obj' is an unicode string,
the generated code raises a TypeError both on a Py2 and a Py3 runtime.
* A new C pseudo-type have to be added, lets call it 'uchar' (better
name would be needed, it can be confused with unsigned char). Then
something like 'cdef uchar *p = obj' will only accept an unicode
string ('unicode' type in Py2 and 'str' in Py3). If 'obj' is a byte
string, the generated code raises TypeError both on a Py2 and a Py3
runtime.
I want to remark that the above behavior should be enabled ONLY trough
a command line switch. This proposal just tries to make Cython/Pyrex
stricter, more explicit, even in a Py2 runtime.
That's all. Stefan, please do not get angry if all this is a
non-sense. All this stuff is going to be real pain to all Python
users, all us have to be ready for handle this, and for answer
questions from confused end-users.
About a week ago, I asked Fernando Perez and Brian Granger about
suggestion about how they think I should handle the byte/unicode stuff
in mpi4py. I paste below Fernando's anwer:
"""
Wow, this one's going to be a *huge* thorn in everyone's side, if I
understand the problem correctly. What has been the policy of other
projects?
"""
And I believe this is the situation of many, many Python users and
developers out there, even of the very smart and productive ones like
Fernando. If Cython takes a good direction on all this, then this will
be benefical for the Cython project itself, but also for other people
and other projects to follow the right path.
On 5/18/08, Stefan Behnel <[EMAIL PROTECTED]> wrote:
> Hi,
>
> since we had a lengthy discussion on whether or not non-prefixed byte strings
> should automatically mutate into unicode strings when compiled for Py3, here
> are some initial lessons from my first attempt to port lxml.
>
> My first approach was (obviously) to import unicode_literals from __future__.
> This failed miserably, and even showed a couple of further bugs in Cython. :)
>
> I then chose the route to explicitly prepend unicode strings with 'u', as I
> wanted to keep my source compilable with older Cython versions that do not
> support the 'b' prefix. Currently, I have changed about 700 lines this way in
> a quick walk-through, and now I'm searching the places where this was the
> wrong thing to do. :)
>
> Most important evidence found: it's definitely non-trivial in a lot of places
> to decide what has to be unicode and what doesn't. It's non-trivial for me,
> and definitely not easier for Cython.
>
> One important place where I ended up with a lot of trivial changes are
> docstrings. Here, I would give an almost 100% chance that the user meant a
> unicode string if it's not prefixed. The remaining cases, e.g. where some
> external tool may require binary data for some kind of configuration or
> analysis are rare enough to just ignore them. For exactly this reason (I
> think), the doctest module in Py3 ignores docstrings that are not unicode.
> This might be a place where an automatic conversion might make sense
> (although, if it's the only place, that would be some funny string
> semantics...)
>
> Another important place are exception messages. Here, I'd give a real 100%
> for
> string literals, as their only purpose is to be human readable.
>
> A field where I really had to take care is when working with byte sequences.
> For example, lxml has a couple of places where strings are converted into
> UTF-8 and then passed into re.findall() or re.sub(). When substituting, the
> replacement string obviously has to be a byte string, too. I also found a bug
> in the Py3 re module when working with byte strings in one specific case.
>
> There are actually quite a number of places where strings are built as byte
> strings by combining and formatting literals, and then converted to a char*.
> Another place where automatic conversion must not happen.
>
> So, while still on the way, my first real-world impression meets my original
> opinion. There are definitely a lot of unprefixed strings in my own code that
> are meant to be unicode strings. Simply switching their type in Py3 will fix
> a
> lot of them, but at the same time break many others. The things that it fixes
> are the trivial parts: docstrings and exceptions. Almost everything else
> really were byte strings, and some were non-trivial things that need real
> work.
>
> If I can choose, I opt for going through this once and then having code that
> correctly distinguishes between byte strings and unicode strings in *both*
> Py2
> and Py3, instead of additionally having to deal with changing string
> semantics
> for identical code in different environments. We might think about a way to
> simplify the transition from unprefixed docstrings and exception messages to
> unicode strings. As it currently stands, everything else is definitely out of
> scope for any automatism.
>
> Stefan
>
> _______________________________________________
> Cython-dev mailing list
> [email protected]
> http://codespeak.net/mailman/listinfo/cython-dev
>
--
Lisandro Dalcín
---------------
Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC)
Instituto de Desarrollo Tecnológico para la Industria Química (INTEC)
Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET)
PTLC - Güemes 3450, (3000) Santa Fe, Argentina
Tel/Fax: +54-(0)342-451.1594
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev