Daniel Himmelstein created SPARK-49858:
------------------------------------------
Summary: Pyspark JSON reader incorrectly considers a string of
digits a timestamp and fails
Key: SPARK-49858
URL: https://issues.apache.org/jira/browse/SPARK-49858
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 3.5.0
Reporter: Daniel Himmelstein
With pyspark 3.5.0 the reading the following JSON will fail:
{code:python}
from pyspark.sql import SparkSessionspark =
SparkSession.builder.appName("timestamp_test").getOrCreate()
data = spark.sparkContext.parallelize(['{"field" : "23456"}'])
df = (
spark.read.option("inferTimestamp", True)
# .option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]")
.json(path=data)
)
df.printSchema()
df.collect()
{code}
The printSchema command shows that the field is parsed as a timestamp, causing
the following error:
{code:java}
File
~/miniforge3/envs/facets/lib/python3.11/site-packages/pyspark/sql/types.py:282,
in TimestampType.fromInternal(self, ts)
279 def fromInternal(self, ts: int) -> datetime.datetime:
280 if ts is not None:
281 # using int to avoid precision loss in float
--> 282 return datetime.datetime.fromtimestamp(ts //
1000000).replace(microsecond=ts % 1000000)
ValueError: year 23455 is out of range
{code}
If we uncomment the timestampFormat option, the command succeeds.
I believe there are two issues:
# that a string of digits with length > 4 is inferred to be a timestamp
# that setting timestampFormat to the default according to [the
documentation|https://spark.apache.org/docs/3.5.0/sql-data-sources-json.html]
fixes the problem such that the documented default is not the actual default.
This might be related to SPARK-45424.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]