[GitHub] [spark] zhengruifeng commented on pull request #38979: [SPARK-41446][CONNECT][PYTHON] Make `createDataFrame` support schema and more input dataset types

GitBox Thu, 08 Dec 2022 00:14:20 -0800


zhengruifeng commented on PR #38979:
URL: https://github.com/apache/spark/pull/38979#issuecomment-1342247448


   PySpark `createDataFrame` infer and validate the data types, create RDD from 
list, and directly assign the sql schema in JVM. And there are many related 
configurations including
   ```
   self._jconf.inferDictAsStruct()
   self._jconf.sessionLocalTimeZone()
   self._jconf.arrowPySparkEnabled()
   self._jconf.arrowPySparkFallbackEnabled()
   self._jconf.arrowMaxRecordsPerBatch()
   self._jconf.arrowSafeTypeConversion()
   self._jconf.legacyInferArrayTypeFromFirstElement()
   is_timestamp_ntz_preferred()
   ...
   ```
   
   In Connect, datasets are always convert to a Pandas DataFrame (internally a 
PyArrow Table). So I simply use `pd.DataFrame(list(data))` for infer its 
datatypes.
   
   The two approaches are so different that I am afraid it is hard to 100% 
match PySpark's `createDataFrame`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] zhengruifeng commented on pull request #38979: [SPARK-41446][CONNECT][PYTHON] Make `createDataFrame` support schema and more input dataset types

Reply via email to