[spark] branch branch-2.4 updated: [SPARK-32300][PYTHON][2.4] toPandas should work from a Spark DataFrame with no partitions

cutlerb Tue, 14 Jul 2020 13:33:01 -0700

This is an automated email from the ASF dual-hosted git repository.

cutlerb pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-2.4 by this push:
     new 5084c71  [SPARK-32300][PYTHON][2.4] toPandas should work from a Spark 
DataFrame with no partitions
5084c71 is described below

commit 5084c7100b20c49c423ababdd4fcd59eaebd9909
Author: HyukjinKwon <[email protected]>
AuthorDate: Tue Jul 14 13:28:36 2020 -0700

    [SPARK-32300][PYTHON][2.4] toPandas should work from a Spark DataFrame with 
no partitions
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to just simply by-pass the case when the number of array 
size is negative, when it collects data from Spark DataFrame with no partitions 
for `toPandas` with Arrow optimization enabled.
    
    ```python
    spark.sparkContext.emptyRDD().toDF("col1 int").toPandas()
    ```
    
    In the master and branch-3.0, this was fixed together at 
https://github.com/apache/spark/commit/ecaa495b1fe532c36e952ccac42f4715809476af 
but it's legitimately not ported back.
    
    ### Why are the changes needed?
    
    To make empty Spark DataFrame able to be a pandas DataFrame.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes,
    
    ```python
    spark.sparkContext.emptyRDD().toDF("col1 int").toPandas()
    ```
    
    **Before:**
    
    ```
    ...
    Caused by: java.lang.NegativeArraySizeException
        at 
org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3293)
        at 
org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3287)
        at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
        at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
    ...
    ```
    
    **After:**
    
    ```
    Empty DataFrame
    Columns: [col1]
    Index: []
    ```
    
    ### How was this patch tested?
    
    Manually tested and unittest were added.
    
    Closes #29098 from HyukjinKwon/SPARK-32300.
    
    Authored-by: HyukjinKwon <[email protected]>
    Signed-off-by: Bryan Cutler <[email protected]>
---
 python/pyspark/sql/tests.py                                | 6 ++++++
 sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala | 2 +-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index 49acf04..c144b41 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -4579,6 +4579,12 @@ class ArrowTests(ReusedSQLTestCase):
             self.spark.createDataFrame(
                 pd.DataFrame({'a': [1, 2, 3]}, index=[2., 3., 
4.])).distinct().count(), 3)
 
+    def test_no_partition_toPandas(self):
+        # SPARK-32300: toPandas should work from a Spark DataFrame with no 
partitions
+        pdf = self.spark.sparkContext.emptyRDD().toDF("col1 int").toPandas()
+        self.assertEqual(len(pdf), 0)
+        self.assertEqual(list(pdf.columns), ["col1"])
+
 
 @unittest.skipIf(
     not _have_pandas or not _have_pyarrow,
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
index a755a6f..6e45775 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
@@ -3290,7 +3290,7 @@ class Dataset[T] private[sql](
         val numPartitions = arrowBatchRdd.partitions.length
 
         // Store collection results for worst case of 1 to N-1 partitions
-        val results = new Array[Array[Array[Byte]]](numPartitions - 1)
+        val results = new Array[Array[Array[Byte]]](Math.max(0, numPartitions 
- 1))
         var lastIndex = -1  // index of last partition written
 
         // Handler to eagerly write partitions to Python in order


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[spark] branch branch-2.4 updated: [SPARK-32300][PYTHON][2.4] toPandas should work from a Spark DataFrame with no partitions

Reply via email to