igorghi commented on PR #38624: URL: https://github.com/apache/spark/pull/38624#issuecomment-1688166890
@HyukjinKwon this may be a misunderstanding on my part regarding the inner works but for the `repartition(grouping_cols).mapInArrow` workaround, wouldn't the [batch size](https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#setting-arrow-batch-size) present a problem where we would end up not having the full group available in the Arrow RecordBatch depending on the batch size parameter, for example using the default 10K batch size and the data have more than 10K rows in any partition? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
