Yicong Huang created SPARK-55312:
------------------------------------

             Summary: Always infer Spark types in createDataFrame when schema 
is not provided
                 Key: SPARK-55312
                 URL: https://issues.apache.org/jira/browse/SPARK-55312
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
    Affects Versions: 4.2.0
            Reporter: Yicong Huang



Currently, when {{createDataFrame}} is called without a schema, the 
{{spark_types}} list contains {{None}} for non-timestamp columns:

{code:python}
# pyspark/sql/pandas/conversion.py
spark_types = [
    TimestampType()
    if is_datetime64_dtype(t) or isinstance(t, pd.DatetimeTZDtype)
    else None  # Non-timestamp columns get None
    for t in pdf.dtypes
]
{code}

This {{None}} is passed to the serializer, which then lets PyArrow infer the 
type. This creates inconsistency because the Spark type is not explicitly 
defined.

The fix is to always infer Spark types from Arrow types when no schema is 
provided:

{code:python}
arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False)
spark_types = [
    TimestampType()
    if is_datetime64_dtype(t) or isinstance(t, pd.DatetimeTZDtype)
    else from_arrow_type(field.type, prefer_timestamp_ntz)
    for t, field in zip(pdf.dtypes, arrow_schema)
]
{code}

This ensures {{spark_type}} is always defined, which:
1. Makes the code more consistent
2. Enables downstream code to always have a valid Spark type
3. Prepares for making {{spark_type}} a required parameter in serializers

Files to change:
- {{pyspark/sql/pandas/conversion.py}}
- {{pyspark/sql/connect/session.py}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to