Re: [Cython] Odd behavior with std::string and .decode()

Barry Warsaw Fri, 06 Jul 2012 07:22:22 -0700

Thanks for the follow up Stefan,

On Jul 06, 2012, at 06:48 AM, Stefan Behnel wrote:


>This is very weird behaviour indeed. I wouldn't know why that should
>happen. What "return as_bytes.decode('utf-8')" does is that is calls
>strlen() to see how long the string is, then it calls the UTF-8 decode
>C-API function with that.

It seems like either the strlen() or the cast through char* is the problem.

>The string that get_description() returns is allocated internally in the
>C++ object, right? So it can't suddenly die or something?

I don't think so.

>One thing I would generally suggest is to do this:
>
>    descr = self._this.get_description()
>    return descr.data()[:descr.size()].decode('utf-8')
>
>Avoids the call to strlen() by explicitly slicing the pointer. Also avoids
>needing to make sure the C string is 0-terminated.

According to

http://www.cplusplus.com/reference/string/string/c_str/

    The returned array points to an internal location with the required
    storage space for this sequence of characters plus its terminating
    null-character, but the values in this array should not be modified in the
    program and are only guaranteed to remain unchanged until the next call to
    a non-constant member function of the string object.

I believe the const char* returned by c_str() is guaranteed to be null
terminated.  AFAICT, there are no embedded NULs.  I also don't think there are
any non-constant member function calls of the parent string object getting in
the way.

Next, I tried two different implementations:

    property description:
        def __get__(self):
            # works
            descr = self._this.get_description()
            return descr.c_str()[:descr.size()].decode('utf-8')

    property destruction:
        def __get__(self):
            # broken
            as_bytes = <char *>self._this.get_description().c_str()
            return as_bytes.decode('utf-8')

The second case requires the cast or you get an error:

xapian.cpp:1409:67: error: invalid conversion from ‘const char*’ to ‘char*’ 
[-fpermissive]

but I don't think that's the problem.  Looking at the generated C++ code, I
see these two different implementations:

works:

  __pyx_t_1 = ((PyObject *)PyUnicode_Decode(__pyx_v_descr.c_str(), 
__pyx_v_descr.size(), __pyx_k_1, NULL)); if (unlikely(!__pyx_t_1)) 
{__pyx_filename = __pyx_f[0]; __pyx_lineno = 84; __pyx_clineno = __LINE__; goto 
__pyx_L1_error;}

broken:

  __pyx_t_1 = ((PyObject *)PyUnicode_Decode(__pyx_v_as_bytes, 
strlen(__pyx_v_as_bytes), __pyx_k_1, NULL)); if (unlikely(!__pyx_t_1)) 
{__pyx_filename = __pyx_f[0]; __pyx_lineno = 91; __pyx_clineno = __LINE__; goto 
__pyx_L1_error;}

In the working case, __pyx_v_descr is a std::string, so the const char*
returned by .c_str() is passed directly to PyUnicode_Decode() without a cast.
The length is returned by std::string.size().

In the broken case, __pyx_v_as_bytes is a char* (I could not figure out how to
preserve the const char* type) and strlen() is used to find the length.

Those are the only substantive differences I could find.

>I wouldn't know any differences out of the top of my head, except that 0.17
>has generally better support for STL containers and std:string (but that's
>unrelated to this failure). I'm planning to enable direct support for
>cpp_string.decode(...) as well, but that's not implemented yet. It would
>basically generate the verbose code above automatically.
>
>> Is this a bug or am I doing something stupid?
>
>Definitely not doing something stupid, but I have no idea why this should
>go wrong.

Okay, at least I have a few workarounds :).  I'd file a bug but I don't have
permission to file new issues.

If you have any other suggestions for ways to debug this, I'm happy to give
them a try.

Cheers,
-Barry

signature.asc
Description: PGP signature

_______________________________________________
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel

Re: [Cython] Odd behavior with std::string and .decode()

Reply via email to