Re: [Cython] How to deal with byte strings and unicode strings at the API level

Stefan Behnel Fri, 09 May 2008 12:30:46 -0700

Hi,

Lisandro Dalcin wrote:
> I have this class Info (something like a dictionary ) wrapping some
> MPI calls  for getting and setting (key,value) pairs (they have to be
> ascii, 8-bits strings in order MPI understands them)


:) first problem: ASCII is 7-bit, not 8-bit.


> cdef class Info:
>     def Get(self, char key[]):
>         [...]
>     def Set(self, char key[], char value[]):
>         [...]

Here, you define your API as taking byte strings as input. Fair enough.

However, in your example, you do not verify that your input is really ASCII
encoded, so you allow users to pass 8-bit strings without any warning.


> So then all your comments actually means that I should not take any
> special action in Cython for wrapping this, and then
> 
> * - If I run this in Python 2, I just do: info.Set("key", ''value")

This works, as plain string literals in Python are byte strings. Also,
conversion between plain ASCII byte strings and unicode strings happens
automatically in Python2 (the infamous UnicodeDecodeError on print).


> * - If I run this in Python 3, I should do: info.Set(b"key", b"value")

Again, this works as you pass byte strings, as enforced by your API.


> In that case, suppose now that a user running on Python 3 does the following:
> 
>>>> info.Set("key", "value") .
> 
> This is broken,

It's not broken, it's just incorrect API usage. In a way, it's like doing this:

   >>> info.Set(-999, 123456789)

and, guess what: Python will raise a TypeError for this!


> because MPI will not recognize the key or the value,
> as they are not C plain char arrays containing null-terminated ascii
> 8-bits string.
>
> Then, how should I modify my *.pyx code to detect this and generate an
> error/warning,

Once the code is in place, Cython will generate a TypeError for you, just like
Py3 itself does when attempting automatic conversion between unicode strings
and bytes objects.


> or even try to coerce the input to ascii 8-bits if the input is 'unicode'

No, that's one of the problems why there is a lot of broken code in Python2:
"works on my machine, so it can't be broken, can it?"


> or pass it unchanged if the input is 'bytes'?

That will work, as a bytes object (which is actually a PyStringObject in both
Py2 and Py3) is compatible with a char*.

However, imagine this line in a Python2 source file:

    info.Set("üöä", "äöüßßß")

or this line in a Python3 source file:

    info.Set(b"üöä", b"äöüßßß")

What will be the byte sequence that you get in your char[] for key and value?

Well, it depends on the source code encoding, which you can declare at the
beginning of your source file. If, for example, it's "UTF-8", you will get a
UTF-8 encoded byte sequence, which is 6 bytes long for key and 12 bytes long
for value. If, on the other hand, it's "iso-8859-1", you will get a 3-byte
sequence for key and a 6-byte sequence for value. Same code, looks exactly the
same in an editor, but results in completely different input to your methods.

How is your code going to deal with this? How will it even know what happens?


> And all this working both in 2.3/2.4/2.5/2.6 and 3.0??

If byte strings is what your API deals with, this will continue to work across
all of those versions. You can use the 2to3 tool to convert your Python2 code
to Python3. It works quite well and will change this

    info.Set("key", "value")

into this

    info.Set(b"key", b"value")

for you (you can even call the tool from your setup.py).

If, however, what you actually want is text input (i.e. unicode characters),
you can fix your API like this (probably plus some more input checking):

  cdef class Info:
      def Get(self, key):
          key = key.encode("ASCII") # or whatever encoding you use internally
          [...]
      def Set(self, key, value):
          key   = key.encode("ASCII")   # or whatever encoding you use
          value = value.encode("ASCII") # or whatever encoding you use
          [...]

That way, users can call your API in Python2 like this:

    info.Set(u"key", u"value")

and the 2to3 tool will convert this to

    info.Set("key", "value")

for them, which (again) will continue to work, and (now the cool thing) your
methods will always receive the input in the expected encoding, and your users
will get an encoding exception if they pass non-ASCII strings. :)

So all you have to take care of is what your actual API is: bytes? characters?

Does this make things a bit clearer?

Stefan
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] How to deal with byte strings and unicode strings at the API level

Reply via email to