[
https://issues.apache.org/jira/browse/SPARK-49858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-49858:
-----------------------------------
Labels: pull-request-available (was: )
> Pyspark JSON reader incorrectly considers a string of digits a timestamp and
> fails
> ----------------------------------------------------------------------------------
>
> Key: SPARK-49858
> URL: https://issues.apache.org/jira/browse/SPARK-49858
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.5.0
> Reporter: Daniel Himmelstein
> Priority: Major
> Labels: pull-request-available
> Attachments: image-2025-03-22-17-20-24-495.png,
> image-2025-03-22-17-23-17-473.png
>
>
> With pyspark 3.5.0 the reading the following JSON will fail:
> {code:python}
> from pyspark.sql import SparkSessionspark =
> SparkSession.builder.appName("timestamp_test").getOrCreate()
> data = spark.sparkContext.parallelize(['{"field" : "23456"}'])
> df = (
> spark.read.option("inferTimestamp", True)
> # .option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]")
> .json(path=data)
> )
> df.printSchema()
> df.collect()
> {code}
> The printSchema command shows that the field is parsed as a timestamp,
> causing the following error:
> {code:java}
> File
> ~/miniforge3/envs/facets/lib/python3.11/site-packages/pyspark/sql/types.py:282,
> in TimestampType.fromInternal(self, ts)
> 279 def fromInternal(self, ts: int) -> datetime.datetime:
> 280 if ts is not None:
> 281 # using int to avoid precision loss in float
> --> 282 return datetime.datetime.fromtimestamp(ts //
> 1000000).replace(microsecond=ts % 1000000)
> ValueError: year 23455 is out of range
> {code}
> If we uncomment the timestampFormat option, the command succeeds.
> I believe there are two issues:
> # that a string of digits with length > 4 is inferred to be a timestamp
> # that setting timestampFormat to the default according to [the
> documentation|https://spark.apache.org/docs/3.5.0/sql-data-sources-json.html]
> fixes the problem such that the documented default is not the actual default.
> This might be related to SPARK-45424.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]