[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40896: [SPARK-43229][ML][PYTHON][CONNECT] Introduce Barrier Python UDF

via GitHub Thu, 04 May 2023 01:43:17 -0700


WeichenXu123 commented on code in PR #40896:
URL: https://github.com/apache/spark/pull/40896#discussion_r1184721358



##########
python/pyspark/sql/udf.py:
##########
@@ -249,6 +259,38 @@ def __init__(
         self.evalType = evalType
         self.deterministic = deterministic
 
+        # since 3.5.0, we introduce an internal optional function attribute 
'_is_barrier',
+        # which is dedicated for integration with external ML training 
frameworks including
+        # PyTorch and XGBoost.
+        # It indicates whether this UDF will be executed on barrier mode, and 
is only accepted
+        # in methods 'mapInPandas' and 'mapInArrow'.
+        # For example:
+        #
+        # df = spark.createDataFrame([(1, 21), (2, 30)], ("id", "age"))
+        #
+        # def filter_func(iterator):
+        #     for pdf in iterator:
+        #         yield pdf[pdf.id == 1]
+        #
+        # filter_func._is_barrier = True # Mark this UDF is barrier

Review Comment:
   Define a `@barrier` decorator looks better, and we can document `barrier` 
doc saying this is a developer API and we should keep it stable



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40896: [SPARK-43229][ML][PYTHON][CONNECT] Introduce Barrier Python UDF

Reply via email to