On Fri, Feb 6, 2015 at 10:27 AM, Francis Giraldeau <francis.girald...@gmail.com> wrote: > Instead, I access members directly: > char *str = PyUnicode_DATA(frame->f_code->co_filename); > size_t len = PyUnicode_GET_DATA_SIZE(frame->f_code->co_filename); > > Is it safe to assume that unicode objects co_filename and co_name are always > UTF-8 data for loaded code? I looked at the PyTokenizer_FromString() and it > seems to convert everything to UTF-8 upfront, and I would like to make sure > this assumption is valid.
I don't think you should be using _GET_DATA_SIZE with _DATA - they're mix-and-matched from old and new APIs. If you want a raw, no-allocation look at the data, you'd need to check PyUnicode_KIND and then read Latin-1, UCS-2, or UCS-4 data: https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_1BYTE_DATA (By the way, I don't think the name "UCS-1" is part of the Unicode spec. But it's an obvious parallel to UCS-2 and UCS-4.) Getting UTF-8 data out of the structure, if it had indeed been cached, ought to be possible. But I can't see a documented function or macro for doing it. Is there a way? Reaching into the structure and grabbing the utf8 and utf8_length members seems like a bad idea. ChrisA _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com