Re: [PR] [SPARK-54337][PS] Add support for PyCapsule to Pyspark [spark]

via GitHub Tue, 09 Dec 2025 17:29:35 -0800


gaogaotiantian commented on PR #53391:
URL: https://github.com/apache/spark/pull/53391#issuecomment-3634972665


   So I'm not that familiar with the standard itself - I don't see 
`__dataframe__` standard is being deprecated. `pandas` is deprecating it yes. 
However, in any case, I think it is helpful to separate the two 
implementations. They do not rely on each other and each of them requires 
plenty of discussion.
   
   From the documentation on Arrow, the Arrow PyCapsule Interface is still 
experimental. I'm not sure what's the policy for features at such stage.
   
   For the implemtation itself, `_get_arrow_array_partition_stream` a bit 
difficult to follow. It seems like the function is transforming a 
`pyarrow.RecordBatch` to a `pyarrow.RecordBatch`. I don't quite understand why 
it needs to convert the data in different formats then yield the same type as 
the input. Maybe it's because my lack of knowledge of `arrow`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-54337][PS] Add support for PyCapsule to Pyspark [spark]

Reply via email to