Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/19459#discussion_r145865969
--- Diff: python/pyspark/sql/session.py ---
@@ -510,6 +578,12 @@ def createDataFrame(self, data, schema=None,
samplingRatio=None, verifySchema=Tr
except Exception:
has_pandas = False
if has_pandas and isinstance(data, pandas.DataFrame):
+ if self.conf.get("spark.sql.execution.arrow.enabled",
"false").lower() == "true" \
+ and len(data) > 0:
+ df = self._createFromPandasWithArrow(data, schema)
--- End diff --
As of https://github.com/apache/spark/pull/19459#issuecomment-337674952,
`schema` from `_parse_datatype_string` could be not a `StructType`:
https://github.com/apache/spark/blob/bfc7e1fe1ad5f9777126f2941e29bbe51ea5da7c/python/pyspark/sql/tests.py#L1325
although I don't think we have supported this case with `pd.DataFrame` as
`int` case resembles `Dataset` with primitive types, up to my knowledge:
```
spark.createDataFrame(["a", "b"], "string").show()
+-----+
|value|
+-----+
| a|
| b|
+-----+
```
For `pd.DataFrame` case, looks we always have a list of list.
https://github.com/apache/spark/blob/d492cc5a21cd67b3999b85d97f5c41c3734b1ba3/python/pyspark/sql/session.py#L515
So, I think we should only support list of strings maybe with a proper
exception for `int` case.
Of course, this case should work:
```
>>> spark.createDataFrame(pd.DataFrame([1]), "struct<a: int>").show()
+---+
| a|
+---+
| 1|
+---+
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]