ueshin opened a new pull request #23795: [SPARK-26887][SQL][PYTHON] Create 
datetime.date directly instead of creating datetime64[ns] as intermediate data.
URL: https://github.com/apache/spark/pull/23795
 
 
   ## What changes were proposed in this pull request?
   
   Currently `DataFrame.toPandas()` with arrow enabled or 
`ArrowStreamPandasSerializer` for pandas UDF with pyarrow<0.12 creates 
`datetime64[ns]` type series as intermediate data and then convert to 
`datetime.date` series, but the intermediate `datetime64[ns]` might cause an 
overflow even if the date is valid.
   
   ```
   >>> import datetime
   >>>
   >>> t = [datetime.date(2262, 4, 12), datetime.date(2263, 4, 12)]
   >>>
   >>> df = spark.createDataFrame(t, 'date')
   >>> df.show()
   +----------+
   |     value|
   +----------+
   |2262-04-12|
   |2263-04-12|
   +----------+
   
   >>>
   >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
   >>>
   >>> df.toPandas()
           value
   0  1677-09-21
   1  1678-09-21
   ```
   
   We should avoid creating such intermediate data and create `datetime.date` 
series directly instead.
   
   ## How was this patch tested?
   
   Modified some tests to include the date which overflow caused by the 
intermediate conversion.
   Run tests with pyarrow 0.8, 0.10, 0.11, 0.12 in my local environment.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to