[
https://issues.apache.org/jira/browse/SPARK-54421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Martin Bode updated SPARK-54421:
--------------------------------
Summary: LocalDataToArrowConversion fails on Windows for very low and high
datetime/TimestampType values (was: LocalDataToArrowConversion fails on
Windows for for very low and high datetime/TimestampType values)
> LocalDataToArrowConversion fails on Windows for very low and high
> datetime/TimestampType values
> -----------------------------------------------------------------------------------------------
>
> Key: SPARK-54421
> URL: https://issues.apache.org/jira/browse/SPARK-54421
> Project: Spark
> Issue Type: Bug
> Components: Connect, PySpark
> Affects Versions: 4.0.1
> Environment: Windows 11
> Reporter: Martin Bode
> Priority: Major
>
> When trying to create a `DataFrame` from a Python list of dicts where some
> values are very low (<`1970-01-01`) or high (>`3001-01-19`), this will lead
> to an error.
> This seem to be specific for {*}Windows OS{*}.
> h1. Reproduce
> {code:python}
> from datetime import datetime
> data = [
> {"id": 1, "some_datetime": datetime(1970, 1, 1, 3, 4, 5)}, # ❌ causes
> error
> {"id": 2, "some_datetime": datetime(1970, 1, 2, 3, 4, 5)}, # ✅ works fine
> {"id": 3, "some_datetime": datetime(2025, 1, 2, 3, 4, 5)}, # ✅ works fine
> {"id": 4, "some_datetime": datetime(3001, 1, 19, 3, 4, 5)}, # ✅ works
> fine
> {"id": 5, "some_datetime": datetime(3001, 1, 20, 3, 4, 5)}, # ❌ causes
> error
> {"id": 6, "some_datetime": datetime(9999, 1, 2, 3, 4, 5)}, # ❌ causes
> error
> ]
> df_testdata = spark.createDataFrame(data=data, schema="id LONG, some_datetime
> TIMESTAMP")
> df_testdata.show(truncate=False)
> {code}
> h1. Error
> {code:python}
> ---------------------------------------------------------------------------
> OSError Traceback (most recent call last)
> Cell In[76], line 12
> 1 from datetime import datetime
> 3 data = [
> 4 {"id": 1, "some_datetime": datetime(1970, 1, 1, 3, 4, 5)}, # ❌
> causes error
> 5 {"id": 2, "some_datetime": datetime(1970, 1, 2, 3, 4, 5)}, # ✅
> works fine
> (...) 9 {"id": 6, "some_datetime": datetime(9999, 1, 2, 3, 4,
> 5)}, # ❌ causes error
> 10 ]
> ---> 12 df_testdata = spark.createDataFrame(data=data, schema="id LONG,
> some_datetime TIMESTAMP")
> 14 df_testdata.show(truncate=False)
> File c:\...\.venv\Lib\site-packages\pyspark\sql\connect\session.py:707, in
> SparkSession.createDataFrame(self, data, schema, samplingRatio, verifySchema)
> 700 from pyspark.sql.conversion import (
> 701 LocalDataToArrowConversion,
> 702 )
> 704 # Spark Connect will try its best to build the Arrow table with
> the
> 705 # inferred schema in the client side, and then rename the columns
> and
> 706 # cast the datatypes in the server side.
> --> 707 _table = LocalDataToArrowConversion.convert(_data, _schema,
> prefers_large_types)
> 709 # TODO: Beside the validation on number of columns, we should also
> check
> 710 # whether the Arrow Schema is compatible with the user provided
> Schema.
> 711 if _num_cols is not None and _num_cols != _table.shape[1]:
> File c:\...\.venv\Lib\site-packages\pyspark\sql\conversion.py:347, in
> LocalDataToArrowConversion.convert(data, schema, use_large_var_types)
> 345 if isinstance(item, dict):
> 346 for i, col in enumerate(column_names):
> --> 347 pylist[i].append(column_convs[i](item.get(col)))
> 348 else:
> 349 if len(item) != len(column_names):
> File c:\...\.venv\Lib\site-packages\pyspark\sql\conversion.py:222, in
> LocalDataToArrowConversion._create_converter.<locals>.convert_timestamp(value)
> 220 else:
> 221 assert isinstance(value, datetime.datetime)
> --> 222 return value.astimezone(datetime.timezone.utc)
> OSError: [Errno 22] Invalid argument
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]