[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

GitBox Sun, 18 Sep 2022 22:19:36 -0700


HeartSaVioR commented on code in PR #37893:
URL: https://github.com/apache/spark/pull/37893#discussion_r973868967



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -2705,6 +2705,44 @@ object SQLConf {
       .booleanConf
       .createWithDefault(false)
 
+  val MAP_PANDAS_UDF_WITH_STATE_SOFT_LIMIT_SIZE_PER_BATCH =
+    
buildConf("spark.sql.execution.applyInPandasWithState.softLimitSizePerBatch")
+      .internal()
+      .doc("When using applyInPandasWithState, set a soft limit of the 
accumulated size of " +
+        "records that can be written to a single ArrowRecordBatch in memory. 
This is used to " +
+        "restrict the amount of memory being used to materialize the data in 
both executor and " +
+        "Python worker. The accumulated size of records are calculated via 
sampling a set of " +
+        "records. Splitting the ArrowRecordBatch is performed per record, so 
unless a record " +
+        "is quite huge, the size of constructed ArrowRecordBatch will be 
around the " +
+        "configured value.")
+      .version("3.4.0")
+      .bytesConf(ByteUnit.BYTE)
+      .createWithDefaultString("64MB")

Review Comment:
   Ah, SPARK-23258 is about restricting arrow record batch to size, seems 
similar with what we propose in this PR. It's still questionable if we 
calculate in every addition of row (accurate but would be super bad on 
performance) or do sampling as we do here (cannot be accurate and err might be 
non-trivial with variable-length columns).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

Reply via email to