[
https://issues.apache.org/jira/browse/SPARK-23290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16349796#comment-16349796
]
Takuya Ueshin commented on SPARK-23290:
---------------------------------------
Thanks for the report!
I'm afraid I couldn't figure out what's going on because your example is
something wrong.
In your first example, the dtype of {{pdf['date']}} seems {{object}}, but the
actual type is {{str}}:
{code:python}
>>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
>>> pdf.dtypes
date object
num int64
dtype: object
>>> type(pdf['date'][0])
<type 'str'>
{code}
So the lambda should work because the function in the lambda is for string type:
{code:python}
>>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
0 2015-01-01
Name: date, dtype: object
{code}
Whereas Spark returns {{datetime.date}} in 2.2 and {{pd.Timestamp}} in 2.3:
{code:python}
>>> df = spark.createDataFrame([('2015-01-01', 1)], ['date',
>>> 'num']).selectExpr("cast(date as date)", "num")
>>> df.printSchema()
root
|-- date: date (nullable = true)
|-- num: long (nullable = true)
>>> df.show()
+----------+---+
| date|num|
+----------+---+
|2015-01-01| 1|
+----------+---+
{code}
in 2.2:
{code:python}
>>> pdf = df.toPandas()
>>> pdf.dtypes
date object
num int64
dtype: object
>>> type(pdf['date'][0])
<type 'datetime.date'>
{code}
in 2.3:
{code:python}
>>> pdf = df.toPandas()
>>> pdf.dtypes
date datetime64[ns]
num int64
dtype: object
>>> type(pdf['date'][0])
<class 'pandas._libs.tslib.Timestamp'>
{code}
In this case, the lambda shouldn't work anyway.
Could you provide some other example to elaborate the problem?
IIUC, {{datetime.date}} and {{pd.Timestamp}} are kind of compatible, so we can
handle them in the same way. cc: [~bryanc]
Thanks!
> inadvertent change in handling of DateType when converting to pandas dataframe
> ------------------------------------------------------------------------------
>
> Key: SPARK-23290
> URL: https://issues.apache.org/jira/browse/SPARK-23290
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.3.0
> Reporter: Andre Menck
> Priority: Blocker
>
> In [this
> PR|https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968]
> there was a change in how `DateType` is being returned to users (line 1968
> in dataframe.py). This can cause client code to fail, as in the following
> example from a python terminal:
> {code:python}
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> date object
> num int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> 0 2015-01-01
> Name: date, dtype: object
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> date object
> num int64
> dtype: object
> >>> pdf['date'] = pd.to_datetime(pdf['date'])
> >>> pdf.dtypes
> date datetime64[ns]
> num int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File
> "/Users/amenck/anaconda2/lib/python2.7/site-packages/pandas/core/series.py",
> line 2355, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
> File "pandas/_libs/src/inference.pyx", line 1574, in
> pandas._libs.lib.map_infer
> File "<stdin>", line 1, in <lambda>
> TypeError: strptime() argument 1 must be string, not Timestamp
> >>>
> {code}
> Above we show both the old behavior (returning an "object" col) and the new
> behavior (returning a datetime column). Since there may be user code relying
> on the old behavior, I'd suggest reverting this specific part of this change.
> Also note that the NOTE on the docstring for the "_to_corrected_pandas_type"
> seems to be off, referring to the old behavior and not the current one.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]