Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22275#discussion_r231311398 --- Diff: python/pyspark/sql/tests.py --- @@ -4923,6 +4923,28 @@ def test_timestamp_dst(self): self.assertPandasEqual(pdf, df_from_python.toPandas()) self.assertPandasEqual(pdf, df_from_pandas.toPandas()) + def test_toPandas_batch_order(self): + + # Collects Arrow RecordBatches out of order in driver JVM then re-orders in Python + def run_test(num_records, num_parts, max_records): + df = self.spark.range(num_records, numPartitions=num_parts).toDF("a") + with self.sql_conf({"spark.sql.execution.arrow.maxRecordsPerBatch": max_records}): + pdf, pdf_arrow = self._toPandas_arrow_toggle(df) + self.assertPandasEqual(pdf, pdf_arrow) + + cases = [ + (1024, 512, 2), # Try large num partitions for good chance of not collecting in order + (512, 64, 2), # Try medium num partitions to test out of order collection + (64, 8, 2), # Try small number of partitions to test out of order collection + (64, 64, 1), # Test single batch per partition + (64, 1, 64), # Test single partition, single batch + (64, 1, 8), # Test single partition, multiple batches + (30, 7, 2), # Test different sized partitions + ] --- End diff -- Yeah, it's not a guarantee, but with a large num of partitions, it's a pretty slim chance they will all be in order. I can also add a case with some delay. My only concern is how big to make a delay to be sure it's enough without adding wasted time to the tests. How about we keep the case with a large number of partitions and add a case with 100ms delay on the first partition?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org