HeartSaVioR commented on code in PR #37893: URL: https://github.com/apache/spark/pull/37893#discussion_r974854298
########## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ########## @@ -2705,6 +2705,44 @@ object SQLConf { .booleanConf .createWithDefault(false) + val MAP_PANDAS_UDF_WITH_STATE_SOFT_LIMIT_SIZE_PER_BATCH = + buildConf("spark.sql.execution.applyInPandasWithState.softLimitSizePerBatch") + .internal() + .doc("When using applyInPandasWithState, set a soft limit of the accumulated size of " + + "records that can be written to a single ArrowRecordBatch in memory. This is used to " + + "restrict the amount of memory being used to materialize the data in both executor and " + + "Python worker. The accumulated size of records are calculated via sampling a set of " + + "records. Splitting the ArrowRecordBatch is performed per record, so unless a record " + + "is quite huge, the size of constructed ArrowRecordBatch will be around the " + + "configured value.") + .version("3.4.0") + .bytesConf(ByteUnit.BYTE) + .createWithDefaultString("64MB") Review Comment: (closing the loop) We decided to simply use the number of rows for the condition of constructing Arrow RecordBatch. This will remove all new configs being introduced here, as well as reduce lots of complexity. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org