penguin-wwy commented on issue #1887: URL: https://github.com/apache/fury/issues/1887#issuecomment-2431771329
Hi, I conducted comparative experiments, trying out [pybind11](https://github.com/pybind/pybind11), [nanobind](https://github.com/wjakob/nanobind), and directly writing C-API code. - pybind11 had the worst performance, which aligns with my understanding. It doesn't perform any specific optimizations for different scenarios and has relatively complex type conversion operations. However, its maintenance code is the simplest. For code that only requires API binding, it can be written as follows: ```c++ PYBIND11_MODULE(_pyutil, util_mod) { py::class_<fury::Buffer>(util_mod, "Buffer") .def(py::init<>()) .def("own_data", &fury::Buffer::own_data) .def("reserve", &fury::Buffer::Reserve) .def("put_bool", [](fury::Buffer &self, uint32_t offset, bool v) { self.UnsafePutByte(offset, v); }) .def("put_int8", [](fury::Buffer &self, uint32_t offset, int8_t v) { self.UnsafePutByte(offset, v); }) .def("get_bool", &fury::Buffer::GetBool) .def("get_int8", &fury::Buffer::GetInt8) ... .def_static("allocate", [](uint32_t size) { return fury::AllocateBuffer(size); }); } ``` - Nanobind's performance is slightly better than Cython's, and its binding method is not much different from pybind11. However, it only supports Python 3.8+. - Directly writing C-API code can perform better than Cython if optimized for different versions (especially >= 3.11). However, is detrimental to the goal of maintaining code more easily. For example: - Cython generates redundant checks when creating `get_bool`, and due to the unreasonable setting of `ml_flag` (it should choose `METH_O` instead of `METH_FASTCALL | METH_KEYWORDS`), parameter parsing also introduces additional overhead. ```c++ static PyObject * cbuffer_get_bool(CBufferObject *self, PyObject *offset) { long off_val = PyLong_AsLong(offset); assert(off_val <= UINT32_MAX); return self->buffer->GetBool(off_val) ? Py_NewRef(Py_True) : Py_NewRef(Py_False); } static PyMethodDef cbuffer_methods[] = { {"get_bool", (PyCFunction)cbuffer_get_bool, METH_O, nullptr}, ... {NULL, NULL} /* sentinel */ }; ``` Additionally, after analyzing the Cython code, I found that some performance optimizations can be achieved by directly calling certain C-API functions in the .pyx file. The principle behind this is to use some higher-level knowledge to avoid Cython generating certain guard code. I will attempt to submit these optimizations as a PR in the future. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
