[ 
https://issues.apache.org/jira/browse/SPARK-54421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Bode updated SPARK-54421:
--------------------------------
    Summary: LocalDataToArrowConversion fails on Windows for very low and high 
datetime/TimestampType values  (was: LocalDataToArrowConversion fails on 
Windows for for very low and high datetime/TimestampType values)

> LocalDataToArrowConversion fails on Windows for very low and high 
> datetime/TimestampType values
> -----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-54421
>                 URL: https://issues.apache.org/jira/browse/SPARK-54421
>             Project: Spark
>          Issue Type: Bug
>          Components: Connect, PySpark
>    Affects Versions: 4.0.1
>         Environment: Windows 11
>            Reporter: Martin Bode
>            Priority: Major
>
> When trying to create a `DataFrame` from a Python list of dicts where some 
> values are very low (<`1970-01-01`) or high (>`3001-01-19`), this will lead 
> to an error.
> This seem to be specific for {*}Windows OS{*}.
> h1. Reproduce
> {code:python}
> from datetime import datetime
> data = [
>     {"id": 1, "some_datetime": datetime(1970, 1, 1, 3, 4, 5)},  # ❌ causes 
> error
>     {"id": 2, "some_datetime": datetime(1970, 1, 2, 3, 4, 5)},  # ✅ works fine
>     {"id": 3, "some_datetime": datetime(2025, 1, 2, 3, 4, 5)},  # ✅ works fine
>     {"id": 4, "some_datetime": datetime(3001, 1, 19, 3, 4, 5)},  # ✅ works 
> fine
>     {"id": 5, "some_datetime": datetime(3001, 1, 20, 3, 4, 5)},  # ❌ causes 
> error
>     {"id": 6, "some_datetime": datetime(9999, 1, 2, 3, 4, 5)},  # ❌ causes 
> error
> ]
> df_testdata = spark.createDataFrame(data=data, schema="id LONG, some_datetime 
> TIMESTAMP")
> df_testdata.show(truncate=False)
> {code}
> h1. Error
> {code:python}
> ---------------------------------------------------------------------------
> OSError                                   Traceback (most recent call last)
> Cell In[76], line 12
>       1 from datetime import datetime
>       3 data = [
>       4     {"id": 1, "some_datetime": datetime(1970, 1, 1, 3, 4, 5)},  # ❌ 
> causes error
>       5     {"id": 2, "some_datetime": datetime(1970, 1, 2, 3, 4, 5)},  # ✅ 
> works fine
>    (...)      9     {"id": 6, "some_datetime": datetime(9999, 1, 2, 3, 4, 
> 5)},  # ❌ causes error
>      10 ]
> ---> 12 df_testdata = spark.createDataFrame(data=data, schema="id LONG, 
> some_datetime TIMESTAMP")
>      14 df_testdata.show(truncate=False)
> File c:\...\.venv\Lib\site-packages\pyspark\sql\connect\session.py:707, in 
> SparkSession.createDataFrame(self, data, schema, samplingRatio, verifySchema)
>     700     from pyspark.sql.conversion import (
>     701         LocalDataToArrowConversion,
>     702     )
>     704     # Spark Connect will try its best to build the Arrow table with 
> the
>     705     # inferred schema in the client side, and then rename the columns 
> and
>     706     # cast the datatypes in the server side.
> --> 707     _table = LocalDataToArrowConversion.convert(_data, _schema, 
> prefers_large_types)
>     709 # TODO: Beside the validation on number of columns, we should also 
> check
>     710 # whether the Arrow Schema is compatible with the user provided 
> Schema.
>     711 if _num_cols is not None and _num_cols != _table.shape[1]:
> File c:\...\.venv\Lib\site-packages\pyspark\sql\conversion.py:347, in 
> LocalDataToArrowConversion.convert(data, schema, use_large_var_types)
>     345 if isinstance(item, dict):
>     346     for i, col in enumerate(column_names):
> --> 347         pylist[i].append(column_convs[i](item.get(col)))
>     348 else:
>     349     if len(item) != len(column_names):
> File c:\...\.venv\Lib\site-packages\pyspark\sql\conversion.py:222, in 
> LocalDataToArrowConversion._create_converter.<locals>.convert_timestamp(value)
>     220 else:
>     221     assert isinstance(value, datetime.datetime)
> --> 222     return value.astimezone(datetime.timezone.utc)
> OSError: [Errno 22] Invalid argument
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to