holdenk commented on a change in pull request #23795:
[SPARK-26887][SQL][PYTHON] Create datetime.date directly instead of creating
datetime64[ns] as intermediate data.
URL: https://github.com/apache/spark/pull/23795#discussion_r257439925
##########
File path: python/pyspark/sql/types.py
##########
@@ -1681,38 +1681,53 @@ def from_arrow_schema(arrow_schema):
for field in arrow_schema])
-def _check_series_convert_date(series, data_type):
- """
- Cast the series to datetime.date if it's a date type, otherwise returns
the original series.
+def _arrow_column_to_pandas(column, data_type):
+ """ Convert Arrow Column to pandas Series.
+
+ If the given column is a date type column, creates a series of
datetime.date directly instead
+ of creating datetime64[ns] as intermediate data.
- :param series: pandas.Series
- :param data_type: a Spark data type for the series
+ :param series: pyarrow.lib.Column
+ :param data_type: a Spark data type for the column
"""
- import pyarrow
+ import pandas as pd
+ import pyarrow as pa
from distutils.version import LooseVersion
- # As of Arrow 0.12.0, date_as_objects is True by default, see ARROW-3910
- if LooseVersion(pyarrow.__version__) < LooseVersion("0.12.0") and
type(data_type) == DateType:
- return series.dt.date
+ # Since Arrow 0.11.0, support date_as_object to return datetime.date
instead of np.datetime64.
Review comment:
Include a comment about the overflow here so we know why we are avoiding
`np.datetime64`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]