zhengruifeng commented on PR #38979: URL: https://github.com/apache/spark/pull/38979#issuecomment-1342247448
PySpark `createDataFrame` infer and validate the data types, create RDD from list, and directly assign the sql schema in JVM. And there are many related configurations including ``` self._jconf.inferDictAsStruct() self._jconf.sessionLocalTimeZone() self._jconf.arrowPySparkEnabled() self._jconf.arrowPySparkFallbackEnabled() self._jconf.arrowMaxRecordsPerBatch() self._jconf.arrowSafeTypeConversion() self._jconf.legacyInferArrayTypeFromFirstElement() is_timestamp_ntz_preferred() ... ``` In Connect, datasets are always convert to a Pandas DataFrame (internally a PyArrow Table). So I simply use `pd.DataFrame(list(data))` for infer its datatypes. The two approaches are so different that I am afraid it is hard to 100% match PySpark's `createDataFrame`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
