gaogaotiantian commented on PR #53391: URL: https://github.com/apache/spark/pull/53391#issuecomment-3634568417
Okay I think the title is a bit misleading - the PR is not about `PyCapsule` (which is a CPython concept to pass raw pointers around), it's about implementing `__dataframe__` and `__arrow_c_stream__` for spark dataframe so it can be directly converted to other dataframes. Also I believe `__dataframe__` and `__arrow_c_stream__` are two different and almost irrelevant protocols that we can treat differently. Before digging into the code details too much, I think we need to discuss about the general direction. If we are going to implement either protocol, we probably want to do it right. The current implementation basically ignores a lot of the required methods and implemented the minimum amount of methods to make the `from_dataframe` work - we can start with that but we need the commitment to make it correct eventually. Then about arrow. Do we want to rely on undocumented arrow classes for spark to work? Both `_PyArrowColumn` and `_PyArrowDataFrame` are undocumented (not very active though) and subject to change anytime. And even take an extra step back - do we even want `pyarrow` to be a hard dependency? I think there was discussions about it when turning on arrow_by_default but it can fallback to the old implementation. What about this? If I understand the protocol correctly, you don't require arrow at all to implement `__dataframe__` protocol - maybe that's a direction we should consider. Eventually, maybe we still need some underlying library to keep the buffer, but the current implementation seems a bit unnecessary to me (I guess it's a proof of concept?). Anyway this is a very interesting feature, but also a big commitment. Users will expect this to work properly if we do this, so I think we need to give it more thoughts before going forward. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
