Re: [PR] [SPARK-54337][PS] Add support for PyCapsule to Pyspark [spark]

via GitHub Tue, 09 Dec 2025 14:33:55 -0800


gaogaotiantian commented on PR #53391:
URL: https://github.com/apache/spark/pull/53391#issuecomment-3634568417


   Okay I think the title is a bit misleading - the PR is not about `PyCapsule` 
(which is a CPython concept to pass raw pointers around), it's about 
implementing `__dataframe__` and `__arrow_c_stream__` for spark dataframe so it 
can be directly converted to other dataframes.
   
   Also I believe `__dataframe__` and `__arrow_c_stream__` are two different 
and almost irrelevant protocols that we can treat differently.
   
   Before digging into the code details too much, I think we need to discuss 
about the general direction.
   
   If we are going to implement either protocol, we probably want to do it 
right. The current implementation basically ignores a lot of the required 
methods and implemented the minimum amount of methods to make the 
`from_dataframe` work - we can start with that but we need the commitment to 
make it correct eventually.
   
   Then about arrow. Do we want to rely on undocumented arrow classes for spark 
to work? Both `_PyArrowColumn` and `_PyArrowDataFrame` are undocumented (not 
very active though) and subject to change anytime. And even take an extra step 
back - do we even want `pyarrow` to be a hard dependency? I think there was 
discussions about it when turning on arrow_by_default but it can fallback to 
the old implementation. What about this?
   
   If I understand the protocol correctly, you don't require arrow at all to 
implement `__dataframe__` protocol - maybe that's a direction we should 
consider. Eventually, maybe we still need some underlying library to keep the 
buffer, but the current implementation seems a bit unnecessary to me (I guess 
it's a proof of concept?).
   
   Anyway this is a very interesting feature, but also a big commitment. Users 
will expect this to work properly if we do this, so I think we need to give it 
more thoughts before going forward.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-54337][PS] Add support for PyCapsule to Pyspark [spark]

Reply via email to