GitHub user HyukjinKwon opened a pull request:
https://github.com/apache/spark/pull/18403
[SPARK-21193][PYTHON] Specify Pandas version in setup.py
## What changes were proposed in this pull request?
It looks we missed specifying the Pandas version. This PR proposes to fix
it. For the current state, it should be Pandas 0.13.0 given my test.
This could be fixed to 0.11.0 if we remove `copy` option used in `astype`.
This looks actually not recommended
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.astype.html
> Return a copy when copy = True (be really careful with this!)
I guess this code path looks not quite hot. It looks probably slightly
better to not use it for now.
In Pandas 0.10.0, it starts to work incorrectly. So, this PR proposes to
remove `copy` and set the version to 0.11.0.
**With Pandas 0.13.0** - released, 2014-01
```
a int32
b object
c bool
d float32
dtype: object
```
**With Pandas 0.12.0** - - released, 2013-06
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../spark/python/pyspark/sql/dataframe.py", line 1734, in toPandas
pdf[f] = pdf[f].astype(t, copy=False)
TypeError: astype() got an unexpected keyword argument 'copy'
```
without `copy`
```
a int32
b object
c bool
d float32
dtype: object
```
**With Pandas 0.11.0** - released, 2013-03
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../spark/python/pyspark/sql/dataframe.py", line 1734, in toPandas
pdf[f] = pdf[f].astype(t, copy=False)
TypeError: astype() got an unexpected keyword argument 'copy'
```
without `copy`
```
a int32
b object
c bool
d float32
dtype: object
```
**With Pandas 0.10.0** - released, 2012-12
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../spark/python/pyspark/sql/dataframe.py", line 1734, in toPandas
pdf[f] = pdf[f].astype(t, copy=False)
TypeError: astype() got an unexpected keyword argument 'copy'
```
without `copy`
```
a int64 # <- this should be 'int32'
b object
c bool
d float64 # <- this should be 'float32'
```
## How was this patch tested?
Manually tested with Pandas from 0.10.0 to 0.13.0.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/HyukjinKwon/spark SPARK-21193
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/18403.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #18403
----
commit 4ddc54b19b3ab036b49cbc9ce955c34bb6625c3a
Author: hyukjinkwon <[email protected]>
Date: 2017-06-23T10:43:42Z
Specify Pandas version in setup.py
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]