Le mar. 30 juin 2020 à 13:53, M.-A. Lemburg <m...@egenix.com> a écrit : > > I would prefer to analyze the list on a case by case basis. I don't > > think that it's useful to expose every single encoding supported by > > Python as a C function. > > (...) > This does not mean we have to give up the symmetry in the C API, > or that the encoding APIs are now suddenly useless. It only means > that we have to replace Py_UNICODE with one of the supported data > for storing Unicode.
Let's agree to disagree :-) I don't think that completeness is a good rationale to design the C API. The C API is too large, we have to make it smaller. A specialized function, like PyUnicode_AsUTF8String(), can be justified by different reasons: * It is a very common use case and so it helps to write C extensions * It is significantly faster than the alternative generic function In C, you can execute arbitrary Python code by calling methods on Python str objects. For example, "abc".encode("utf-8", "surrogateescape") in Python becomes PyObject_CallMethod(obj, "encode", "ss", "utf-8", "surrogatepass") in C. Well, there is already a more specialized and generic PyUnicode_AsEncodedObject() function. We must not add a C API function for every single Python feature, otherwise it would be too expensive to maintain, and it would become impossible for other Python implementations to implement the fully C API. Well, even today, PyPy already only implements a small subset of the C API. > Since the C world has adopted wchar_t for this purpose, it's the > natural choice. In my experience, in C extensions, there are two kind of data: * bytes is used as a "char*": array of bytes * Unicode is used as a Python object For the very rare cases involving wchar_t*, PyUnicode_FromWideChar() can be used. I don't think that performance justifies to duplicate each function, once for a Python str object, once for wchar_t*. I mostly saw code involving wchar_t* to initialize Python. But this code was wrong since it used PyUnicode function *before* Python was initialized. That's bad and can now crash in recent Python versions. The new PEP 587 has a different design and avoids Python objects and anything related to the Python runtime: https://docs.python.org/dev/c-api/init_config.html#c.PyConfig_SetString Moreover, CPython implements functions taking wchar_t* string by calling PyUnicode_FromWideChar() internally... > PyUnicode_AsEncodedString() converts Unicode objects to a > bytes object. This is not an symmetric replacement for the > PyUnicode_Encode*() APIs, since those go from Py_UNICODE to > a bytes object. I don't see which feature is missing from PyUnicode_AsEncodedString(). If it's about parameters specific to some encodings like UTF-7, I already replied in another email. > Since the C API is not only meant to be used by the CPython interpreter, > we should stick to standards rather than expecting the world to adapt > to our implementations. This also makes the APIs future proof, e.g. > in case we make another transition from the current hybrid internal > data type for Unicode towards UTF-8 buffers as internal data type. Do you know C extensions in the wild which are using wchar_t* on purpose? I haven't seen such a C extension yet. Victor -- Night gathers, and now my watch begins. It shall not end until my death. _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/YW4FDLEHFXA6NKWZYVFOBNGPE33VQJ7U/ Code of Conduct: http://python.org/psf/codeofconduct/