Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/19459#discussion_r144598051
--- Diff: python/pyspark/sql/session.py ---
@@ -510,9 +511,43 @@ def createDataFrame(self, data, schema=None,
samplingRatio=None, verifySchema=Tr
except Exception:
has_pandas = False
if has_pandas and isinstance(data, pandas.DataFrame):
- if schema is None:
- schema = [str(x) for x in data.columns]
- data = [r.tolist() for r in data.to_records(index=False)]
+ if self.conf.get("spark.sql.execution.arrow.enabled",
"false").lower() == "true" \
+ and len(data) > 0:
+ from pyspark.serializers import ArrowSerializer
--- End diff --
That's probably a good idea since it's a big block of code. The other
create functions return a (rdd, schema) pair, then do further processing to
create a DataFrame. Here we would have to just return a DataFrame since we
don't want to do the further processing.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]