zhouyuan commented on issue #9224:
URL: 
https://github.com/apache/incubator-gluten/issues/9224#issuecomment-2805121870

   Hi @zjuwangg 
   
   In-general there are two parts I learned so far. I will try to update in 
this issue once I have more progress
    
   - PySpark itself - ideally gluten should be able to support all dataframe 
operator, and do fallback under not supported case. However we do see it will 
report failure on some corner cases. This is due to lack of such tests. Based 
on previous experience, it will benefit a lot if we port such tests to gluten, 
and the issues will occur then:
   https://github.com/apache/spark/tree/master/python/pyspark/sql/tests
    
    
   - The 2nd part is for the Pandas/Arrow support. It is commonly used in AI/ML 
related workloads - to do the data cleanup. As pandas is more commonly used. 
Gluten should have enabled the general support. The TODO list should be include 
more tests and cover the rest of the APIs.
   Arrow on pandas
   
https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html
   Using arrow as the transformer layer 
   Pandas vectorized UDFs
   
https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#pandas-udfs-a-k-a-vectorized-udfs
   Arrow python UDF
   
https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#arrow-python-udfs
   
   thanks, -yuan


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to