Martin Bode created SPARK-54421:
-----------------------------------
Summary: LocalDataToArrowConversion fails on Windows for for very
low and high datetime/TimestampType values
Key: SPARK-54421
URL: https://issues.apache.org/jira/browse/SPARK-54421
Project: Spark
Issue Type: Bug
Components: Connect, PySpark
Affects Versions: 4.0.1
Environment: Windows 11
Reporter: Martin Bode
When trying to create a `DataFrame` from a Python list of dicts where some
values are very low (<`1970-01-01`) or high (>`3001-01-19`), this will lead to
an error.
This seem to be specific for {*}Windows OS{*}.
h1. Reproduce
{code:python}
from datetime import datetime
data = [
{"id": 1, "some_datetime": datetime(1970, 1, 1, 3, 4, 5)}, # ❌ causes error
{"id": 2, "some_datetime": datetime(1970, 1, 2, 3, 4, 5)}, # ✅ works fine
{"id": 3, "some_datetime": datetime(2025, 1, 2, 3, 4, 5)}, # ✅ works fine
{"id": 4, "some_datetime": datetime(3001, 1, 19, 3, 4, 5)}, # ✅ works fine
{"id": 5, "some_datetime": datetime(3001, 1, 20, 3, 4, 5)}, # ❌ causes
error
{"id": 6, "some_datetime": datetime(9999, 1, 2, 3, 4, 5)}, # ❌ causes error
]
df_testdata = spark.createDataFrame(data=data, schema="id LONG, some_datetime
TIMESTAMP")
df_testdata.show(truncate=False)
{code}
h1. Error
{code:python}
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
Cell In[76], line 12
1 from datetime import datetime
3 data = [
4 {"id": 1, "some_datetime": datetime(1970, 1, 1, 3, 4, 5)}, # ❌
causes error
5 {"id": 2, "some_datetime": datetime(1970, 1, 2, 3, 4, 5)}, # ✅
works fine
(...) 9 {"id": 6, "some_datetime": datetime(9999, 1, 2, 3, 4, 5)},
# ❌ causes error
10 ]
---> 12 df_testdata = spark.createDataFrame(data=data, schema="id LONG,
some_datetime TIMESTAMP")
14 df_testdata.show(truncate=False)
File c:\...\.venv\Lib\site-packages\pyspark\sql\connect\session.py:707, in
SparkSession.createDataFrame(self, data, schema, samplingRatio, verifySchema)
700 from pyspark.sql.conversion import (
701 LocalDataToArrowConversion,
702 )
704 # Spark Connect will try its best to build the Arrow table with the
705 # inferred schema in the client side, and then rename the columns
and
706 # cast the datatypes in the server side.
--> 707 _table = LocalDataToArrowConversion.convert(_data, _schema,
prefers_large_types)
709 # TODO: Beside the validation on number of columns, we should also check
710 # whether the Arrow Schema is compatible with the user provided Schema.
711 if _num_cols is not None and _num_cols != _table.shape[1]:
File c:\...\.venv\Lib\site-packages\pyspark\sql\conversion.py:347, in
LocalDataToArrowConversion.convert(data, schema, use_large_var_types)
345 if isinstance(item, dict):
346 for i, col in enumerate(column_names):
--> 347 pylist[i].append(column_convs[i](item.get(col)))
348 else:
349 if len(item) != len(column_names):
File c:\...\.venv\Lib\site-packages\pyspark\sql\conversion.py:222, in
LocalDataToArrowConversion._create_converter.<locals>.convert_timestamp(value)
220 else:
221 assert isinstance(value, datetime.datetime)
--> 222 return value.astimezone(datetime.timezone.utc)
OSError: [Errno 22] Invalid argument
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]