[jira] [Created] (SPARK-49858) Pyspark JSON reader incorrectly considers a string of digits a timestamp and fails

Daniel Himmelstein (Jira) Wed, 02 Oct 2024 11:53:04 -0700

Daniel Himmelstein created SPARK-49858:
------------------------------------------


             Summary: Pyspark JSON reader incorrectly considers a string of 
digits a timestamp and fails
                 Key: SPARK-49858
                 URL: https://issues.apache.org/jira/browse/SPARK-49858
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.5.0
            Reporter: Daniel Himmelstein


With pyspark 3.5.0 the reading the following JSON will fail:
{code:python}
from pyspark.sql import SparkSessionspark = 
SparkSession.builder.appName("timestamp_test").getOrCreate()
data = spark.sparkContext.parallelize(['{"field" : "23456"}'])
df = (
    spark.read.option("inferTimestamp", True)
    # .option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]")
    .json(path=data)
)
df.printSchema()
df.collect()
{code}
The printSchema command shows that the field is parsed as a timestamp, causing 
the following error:
{code:java}
File 
~/miniforge3/envs/facets/lib/python3.11/site-packages/pyspark/sql/types.py:282, 
in TimestampType.fromInternal(self, ts)
    279 def fromInternal(self, ts: int) -> datetime.datetime:
    280     if ts is not None:
    281         # using int to avoid precision loss in float
--> 282         return datetime.datetime.fromtimestamp(ts // 
1000000).replace(microsecond=ts % 1000000)

ValueError: year 23455 is out of range
{code}
If we uncomment the timestampFormat option, the command succeeds.

I believe there are two issues:
 # that a string of digits with length > 4 is inferred to be a timestamp
 # that setting timestampFormat to the default according to [the 
documentation|https://spark.apache.org/docs/3.5.0/sql-data-sources-json.html] 
fixes the problem such that the documented default is not the actual default.

This might be related to SPARK-45424.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-49858) Pyspark JSON reader incorrectly considers a string of digits a timestamp and fails

Reply via email to