zhengruifeng commented on code in PR #40896:
URL: https://github.com/apache/spark/pull/40896#discussion_r1182025735


##########
python/pyspark/sql/udf.py:
##########
@@ -249,6 +259,38 @@ def __init__(
         self.evalType = evalType
         self.deterministic = deterministic
 
+        # since 3.5.0, we introduce an internal optional function attribute 
'_is_barrier',
+        # which is dedicated for integration with external ML training 
frameworks including
+        # PyTorch and XGBoost.
+        # It indicates whether this UDF will be executed on barrier mode, and 
is only accepted
+        # in methods 'mapInPandas' and 'mapInArrow'.
+        # For example:
+        #
+        # df = spark.createDataFrame([(1, 21), (2, 30)], ("id", "age"))
+        #
+        # def filter_func(iterator):
+        #     for pdf in iterator:
+        #         yield pdf[pdf.id == 1]
+        #
+        # filter_func._is_barrier = True # Mark this UDF is barrier

Review Comment:
   @WeichenXu123 I add an example here. also add UT to make sure it works.
   
   > Our xgboost users are already aware of the limitations, it is not an issue.
   > note that currently xgboost library( python package) already uses RDD 
barrier API,
   > in future we need to adapt xgboost with spark connect mode,
   > this means the SQL side barrier flag should also be an user-facing 
interface.
   
   I'm aware of the integration with XGBoost. It looks like only developer will 
use it.  Is current implementation of a function attribute enough to support it?
   
   The UDF is used much more widely than the RDDBarrier APIs, and my concern is 
that the end users is likely to abuse it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to