[GitHub] spark pull request #18378: [SPARK-21163][SQL] DataFrame.toPandas should resp...

HyukjinKwon Thu, 22 Jun 2017 02:26:15 -0700

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18378#discussion_r123461244
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1721,7 +1721,18 @@ def toPandas(self):
             1    5    Bob
             """
             import pandas as pd
    -        return pd.DataFrame.from_records(self.collect(), 
columns=self.columns)
    +
    +        dtype = {}
    +        for field in self.schema:
    +            pandas_type = _to_corrected_pandas_type(field.dataType)
    +            if pandas_type is not None:
    +                dtype[field.name] = pandas_type
    +
    +        pdf = pd.DataFrame.from_records(self.collect(), 
columns=self.columns)
    +
    +        for f, t in dtype.items():
    +            pdf[f] = pdf[f].astype(t, copy=False)
    --- End diff --
    
    Just, just in case someone blames this in the future, as a little side 
note, it looks `copy` is introduced in 0.11.0 
[here](https://github.com/pandas-dev/pandas/blob/v0.11.0/pandas/core/generic.py#L521).
 So, Pandas 0.10.0 does not work with it (see 
[here](https://github.com/pandas-dev/pandas/blob/v0.10.0/pandas/core/generic.py#L489)).
    
    ```python
    from pyspark.sql.types import *
    
    schema = StructType().add("a", IntegerType()).add("b", StringType())\
                         .add("c", BooleanType()).add("d", FloatType())
    data = [
        (1, "foo", True, 3.0,), (2, "foo", True, 5.0),
        (3, "bar", False, -1.0), (4, "bar", False, 6.0),
    ]
    spark.createDataFrame(data, schema).toPandas().dtypes
    ```
    
    Pandas 0.10.0:
    
    ```
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File 
"/Users/hyukjinkwon/Desktop/workspace/repos/forked/spark/python/pyspark/sql/dataframe.py",
 line 1734, in toPandas
        pdf[f] = pdf[f].astype(t, copy=False)
    TypeError: astype() got an unexpected keyword argument 'copy'
    ```
    
    However, I guess it is really fine becuase:
    
    - 0.10.0 was released in 2012, when Spark was 0.6.x and Java was 6 & 7.
      
      I guess this is really fine. It was 5 years ago.
    
    - It does works without `copy` but the types are not properly set as 
proposed here:
    
      ```
      spark.createDataFrame(data, schema).toPandas().dtypes
      a      int64  # <- this should be 'int32'
      b     object
      c       bool
      d    float64  # <- this should be 'float32'
      ```
    
    I am writing this comment only because, up to my knolwedge, we didn't 
specify Pandas version requirement - 
https://github.com/apache/spark/blob/314cf51ded52834cfbaacf58d3d05a220965ca2a/python/setup.py#L202.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #18378: [SPARK-21163][SQL] DataFrame.toPandas should resp...

Reply via email to