Yicong Huang created SPARK-55312:
------------------------------------
Summary: Always infer Spark types in createDataFrame when schema
is not provided
Key: SPARK-55312
URL: https://issues.apache.org/jira/browse/SPARK-55312
Project: Spark
Issue Type: Improvement
Components: PySpark
Affects Versions: 4.2.0
Reporter: Yicong Huang
Currently, when {{createDataFrame}} is called without a schema, the
{{spark_types}} list contains {{None}} for non-timestamp columns:
{code:python}
# pyspark/sql/pandas/conversion.py
spark_types = [
TimestampType()
if is_datetime64_dtype(t) or isinstance(t, pd.DatetimeTZDtype)
else None # Non-timestamp columns get None
for t in pdf.dtypes
]
{code}
This {{None}} is passed to the serializer, which then lets PyArrow infer the
type. This creates inconsistency because the Spark type is not explicitly
defined.
The fix is to always infer Spark types from Arrow types when no schema is
provided:
{code:python}
arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False)
spark_types = [
TimestampType()
if is_datetime64_dtype(t) or isinstance(t, pd.DatetimeTZDtype)
else from_arrow_type(field.type, prefer_timestamp_ntz)
for t, field in zip(pdf.dtypes, arrow_schema)
]
{code}
This ensures {{spark_type}} is always defined, which:
1. Makes the code more consistent
2. Enables downstream code to always have a valid Spark type
3. Prepares for making {{spark_type}} a required parameter in serializers
Files to change:
- {{pyspark/sql/pandas/conversion.py}}
- {{pyspark/sql/connect/session.py}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]