[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...

HyukjinKwon Thu, 19 Oct 2017 19:22:27 -0700

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19459#discussion_r145865969
  
    --- Diff: python/pyspark/sql/session.py ---
    @@ -510,6 +578,12 @@ def createDataFrame(self, data, schema=None, 
samplingRatio=None, verifySchema=Tr
             except Exception:
                 has_pandas = False
             if has_pandas and isinstance(data, pandas.DataFrame):
    +            if self.conf.get("spark.sql.execution.arrow.enabled", 
"false").lower() == "true" \
    +                    and len(data) > 0:
    +                df = self._createFromPandasWithArrow(data, schema)
    --- End diff --
    
    As of https://github.com/apache/spark/pull/19459#issuecomment-337674952, 
`schema` from `_parse_datatype_string` could be not a `StructType`:
    
    
https://github.com/apache/spark/blob/bfc7e1fe1ad5f9777126f2941e29bbe51ea5da7c/python/pyspark/sql/tests.py#L1325
    
    although I don't think we have supported this case with `pd.DataFrame` as 
`int` case resembles `Dataset` with primitive types, up to my knowledge:
    
    ```
    spark.createDataFrame(["a", "b"], "string").show()
    +-----+
    |value|
    +-----+
    |    a|
    |    b|
    +-----+
    ```
    
    For `pd.DataFrame` case, looks we always have a list of list.
    
    
https://github.com/apache/spark/blob/d492cc5a21cd67b3999b85d97f5c41c3734b1ba3/python/pyspark/sql/session.py#L515
    
    So, I think we should only support list of strings maybe with a proper 
exception for `int` case.
    
    Of course, this case should work:
    
    ```
    >>> spark.createDataFrame(pd.DataFrame([1]), "struct<a: int>").show()
    +---+
    |  a|
    +---+
    |  1|
    +---+
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...

Reply via email to