Hi Barry, Barry Warsaw, 06.07.2012 00:29: > I'm currently exploring using Cython to provide new Python 3 bindings for > Xapian. I'm pretty much a Cython n00b but the documentation is great, and I > was able to pretty quickly get something really simple working. I'm using > Cython 0.15 on Ubuntu 12.04 with Python 3.2 and Xapian 1.2.12. I've pushed my > current branch to github: > > https://github.com/warsaw/xapian/tree/py3/xapian-bindings/python3 > > There you'll see my xapianlib.pxd and xapian.pyx files. > > Where I'm seeing some odd behavior is in trying to expose the > Xapian::TermGenerator.get_description() method. This returns a std::string > and I'm trying to create a `description` property that coerces this to unicode > before returning it to Python. Here's the relevant code: > > -----snip snip----- > cdef class TermGenerator: > cdef xapianlib.TermGenerator * _this > > def __cinit__(self): > self._this = new xapianlib.TermGenerator() > > def __dealloc__(self): > del self._this > > property description: > def __get__(self): > as_bytes = <char *>self._this.get_description().c_str() > #return as_bytes > return as_bytes.decode('utf-8') > -----snip snip----- > > I'm sure I'm doing something naive or stupid, but the problem is that > as written above, .description is returning nonsense. > > % python > Python 3.2.3 (default, May 3 2012, 15:51:42) > [GCC 4.6.3] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import xapian > >>> tg = xapian.TermGenerator() > >>> tg.description > '\x00\x00\x00\x00_\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' > > If instead, I return just the bytes object (i.e. what > .get_description().c_str() returns), then I get more like what I expect. > > % python > Python 3.2.3 (default, May 3 2012, 15:51:42) > [GCC 4.6.3] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import xapian > >>> tg = xapian.TermGenerator() > >>> tg.description > b'Xapian::TermGenerator(stem=Xapian::Stem(none), > doc=Document(Xapian::Document::Internal()), termpos=0)' > >>> tg.description.decode('utf-8') > 'Xapian::TermGenerator(stem=Xapian::Stem(none), > doc=Document(Xapian::Document::Internal()), termpos=0)'
This is very weird behaviour indeed. I wouldn't know why that should happen. What "return as_bytes.decode('utf-8')" does is that is calls strlen() to see how long the string is, then it calls the UTF-8 decode C-API function with that. The string that get_description() returns is allocated internally in the C++ object, right? So it can't suddenly die or something? One thing I would generally suggest is to do this: descr = self._this.get_description() return descr.data()[:descr.size()].decode('utf-8') Avoids the call to strlen() by explicitly slicing the pointer. Also avoids needing to make sure the C string is 0-terminated. > I looked at the generated code in the first example, but didn't really see > anything obvious. There are no NULs in the char* description afaict. I > haven't yet tested Cython 0.16 or 0.17 to see if this behaves differently. I wouldn't know any differences out of the top of my head, except that 0.17 has generally better support for STL containers and std:string (but that's unrelated to this failure). I'm planning to enable direct support for cpp_string.decode(...) as well, but that's not implemented yet. It would basically generate the verbose code above automatically. > Is this a bug or am I doing something stupid? Definitely not doing something stupid, but I have no idea why this should go wrong. Stefan _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel