[GitHub] spark pull request #22275: [SPARK-25274][PYTHON][SQL] In toPandas with Arrow...

BryanCutler Tue, 06 Nov 2018 14:06:41 -0800

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22275#discussion_r231311398
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4923,6 +4923,28 @@ def test_timestamp_dst(self):
             self.assertPandasEqual(pdf, df_from_python.toPandas())
             self.assertPandasEqual(pdf, df_from_pandas.toPandas())
     
    +    def test_toPandas_batch_order(self):
    +
    +        # Collects Arrow RecordBatches out of order in driver JVM then 
re-orders in Python
    +        def run_test(num_records, num_parts, max_records):
    +            df = self.spark.range(num_records, 
numPartitions=num_parts).toDF("a")
    +            with 
self.sql_conf({"spark.sql.execution.arrow.maxRecordsPerBatch": max_records}):
    +                pdf, pdf_arrow = self._toPandas_arrow_toggle(df)
    +                self.assertPandasEqual(pdf, pdf_arrow)
    +
    +        cases = [
    +            (1024, 512, 2),  # Try large num partitions for good chance of 
not collecting in order
    +            (512, 64, 2),    # Try medium num partitions to test out of 
order collection
    +            (64, 8, 2),      # Try small number of partitions to test out 
of order collection
    +            (64, 64, 1),     # Test single batch per partition
    +            (64, 1, 64),     # Test single partition, single batch
    +            (64, 1, 8),      # Test single partition, multiple batches
    +            (30, 7, 2),      # Test different sized partitions
    +        ]
    --- End diff --
    
    Yeah, it's not a guarantee, but with a large num of partitions, it's a 
pretty slim chance they will all be in order. I can also add a case with some 
delay. My only concern is how big to make a delay to be sure it's enough 
without adding wasted time to the tests. 
    
    How about we keep the case with a large number of partitions and add a case 
with 100ms delay on the first partition?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22275: [SPARK-25274][PYTHON][SQL] In toPandas with Arrow...

Reply via email to