Re: [I] [VL] Story: Improve PySpark support [incubator-gluten]


zhouyuan commented on issue #9224:
URL: 
https://github.com/apache/incubator-gluten/issues/9224#issuecomment-2805121870

Hi @zjuwangg

In-general there are two parts I learned so far. I will try to update in
this issue once I have more progress

- PySpark itself - ideally gluten should be able to support all dataframe
operator, and do fallback under not supported case. However we do see it will
report failure on some corner cases. This is due to lack of such tests. Based
on previous experience, it will benefit a lot if we port such tests to gluten,
and the issues will occur then:
https://github.com/apache/spark/tree/master/python/pyspark/sql/tests

- The 2nd part is for the Pandas/Arrow support. It is commonly used in AI/ML
related workloads - to do the data cleanup. As pandas is more commonly used.
Gluten should have enabled the general support. The TODO list should be include
more tests and cover the rest of the APIs.
Arrow on pandas

https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html
Using arrow as the transformer layer
Pandas vectorized UDFs

https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#pandas-udfs-a-k-a-vectorized-udfs
Arrow python UDF

https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#arrow-python-udfs

thanks, -yuan

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to