[pypy-dev] [cpyext] partial fake PEP393 implementation to provide access to single unicode characters in strings

Stefan Behnel Sat, 14 Apr 2012 09:45:17 -0700

Hi,

PEP393 (the new Unicode type in Py3.3) defines a rather useful C interface
towards the characters of a Unicode string. I think it would be cool if
cpyext provided that, so that access to single characters won't require
copying the unicode buffer into C space anymore.


I attached an untested (and likely non-working) patch that adds the most
important parts of it. The implementation does not care about non-BMP
characters, which (if I'm not mistaken) are encoded as surrogate pairs in
PyPy. Apart from that, the functions behave like their CPython
counterparts, which means that the implementation shouldn't get in the way
of a future real PEP393 implementation.

What do you think?

I have no idea if the way the index access is done in PyUnicode_READ_CHAR()
is in any way efficient - would be good if it was. Specifically, the
intention is to avoid creating a 1-character unicode string copy before
taking its ord(). Does this happen automatically, or is there a way to make
sure it does that?

Stefan

diff -r 886d352cf776 pypy/module/cpyext/include/unicodeobject.h
--- a/pypy/module/cpyext/include/unicodeobject.h	Sat Apr 14 20:04:30 2012 +1000
+++ b/pypy/module/cpyext/include/unicodeobject.h	Sat Apr 14 18:42:31 2012 +0200
@@ -18,6 +18,17 @@
 
 #define Py_UNICODE_REPLACEMENT_CHARACTER ((Py_UNICODE) 0xFFFD)
 
+/* PEP393-like macros for Unicode string objects */
+enum PyUnicode_Kind {
+/* String contains only wstr byte characters.  Currently always true in cpyext.  */
+    PyUnicode_WCHAR_KIND = 0,
+/* CPython-only return values of the PyUnicode_KIND() macro: */
+    PyUnicode_1BYTE_KIND = 1,
+    PyUnicode_2BYTE_KIND = 2,
+    PyUnicode_4BYTE_KIND = 4
+};
+#define PyUnicode_KIND(obj) (PyUnicode_WCHAR_KIND)
+
 typedef struct {
     PyObject_HEAD
     Py_UNICODE *buffer;
diff -r 886d352cf776 pypy/module/cpyext/test/test_unicodeobject.py
--- a/pypy/module/cpyext/test/test_unicodeobject.py	Sat Apr 14 20:04:30 2012 +1000
+++ b/pypy/module/cpyext/test/test_unicodeobject.py	Sat Apr 14 18:42:31 2012 +0200
@@ -485,3 +485,10 @@
                 api.PyUnicode_Splitlines(w_str, 0)))
         assert r"[u'a\n', u'b\n', u'c\n', u'd']" == space.unwrap(space.repr(
                 api.PyUnicode_Splitlines(w_str, 1)))
+
+    def test_fake_pep393(self, space, api):
+        w_str = space.wrap(u"abc\u0111d")
+        assert api.PyUnicode_GET_LENGTH(w_str) == 5
+        assert api.PyUnicode_READ_CHAR(w_str, 0) == ord(u'a')
+        assert api.PyUnicode_READ_CHAR(w_str, 3) == 0x111
+        assert api.PyUnicode_READ_CHAR(w_str, 4) == ord(u'd')
diff -r 886d352cf776 pypy/module/cpyext/unicodeobject.py
--- a/pypy/module/cpyext/unicodeobject.py	Sat Apr 14 20:04:30 2012 +1000
+++ b/pypy/module/cpyext/unicodeobject.py	Sat Apr 14 18:42:31 2012 +0200
@@ -180,6 +180,31 @@
     """Get the maximum ordinal for a Unicode character."""
     return runicode.UNICHR(runicode.MAXUNICODE)
 
+@cpython_api([PyObject], Py_ssize_t, error=CANNOT_FAIL)
+def PyUnicode_GET_LENGTH(space, w_obj):
+    """Return the length of the Unicode string, in code points.
+    o has to be a Unicode object in the 'canonical' representation
+    (not checked).
+
+    PyPy: Currently no 'canonical' (PEP393) representation involved.
+    Returned length is in code units (i.e. it counts any
+    surrogate pairs as two characters).
+    """
+    assert isinstance(w_obj, unicodeobject.W_UnicodeObject)
+    return space.len_w(w_obj)
+
+@cpython_api([PyObject, Py_ssize_t], lltype.Unsigned, error=CANNOT_FAIL)
+def PyUnicode_READ_CHAR(space, w_obj, index):
+    """Read a character from a Unicode object o, which must be in
+    the 'canonical' representation. This is less efficient than
+    PyUnicode_READ() if you do multiple consecutive reads.
+
+    PyPy: Currently no 'canonical' (PEP393) representation involved.
+    """
+    assert isinstance(w_obj, unicodeobject.W_UnicodeObject)
+    assert 0 <= index < space.len_w(w_obj)
+    return rffi.cast(lltype.Unsigned, ord(w_obj[index]))
+
 @cpython_api([PyObject], rffi.CCHARP, error=CANNOT_FAIL)
 def PyUnicode_AS_DATA(space, ref):
     """Return a pointer to the internal buffer of the object. o has to be a

_______________________________________________
pypy-dev mailing list
pypy-dev@python.org
http://mail.python.org/mailman/listinfo/pypy-dev

[pypy-dev] [cpyext] partial fake PEP393 implementation to provide access to single unicode characters in strings

Reply via email to