[GitHub] [spark] HyukjinKwon opened a new pull request #29098: [SPARK-32300][PYTHON][2.4] toPandas should work from a Spark DataFrame with no partitions

GitBox Tue, 14 Jul 2020 01:39:37 -0700


HyukjinKwon opened a new pull request #29098:
URL: https://github.com/apache/spark/pull/29098



   ### What changes were proposed in this pull request?
   
   This PR proposes to just simply by-pass the case when the number of array 
size is negative, when it collects data from Spark DataFrame with no partitions 
for `toPandas`.
   
   ```python
   spark.sparkContext.emptyRDD().toDF("col1 int").toPandas()
   ```
   
   In the master and branch-3.0, this was fixed together at 
https://github.com/apache/spark/commit/ecaa495b1fe532c36e952ccac42f4715809476af 
but it's legitimately not ported back.
   
   ### Why are the changes needed?
   
   To make empty Spark DataFrame able to be a pandas DataFrame.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes,
   
   ```python
   spark.sparkContext.emptyRDD().toDF("col1 int").toPandas()
   ```
   
   **Before:**
   
   ```
   ...
   Caused by: java.lang.NegativeArraySizeException
        at 
org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3293)
        at 
org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3287)
        at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
        at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
   ...
   ```
   
   **After:**
   
   ```
   Empty DataFrame
   Columns: [col1]
   Index: []
   ```
   
   ### How was this patch tested?
   
   Manually tested and unittest were added.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon opened a new pull request #29098: [SPARK-32300][PYTHON][2.4] toPandas should work from a Spark DataFrame with no partitions

Reply via email to