Hi all,

I'd like to raise a discussion about Pandas version.
Originally we are discussing it at
https://github.com/apache/spark/pull/19607 but we'd like to ask for
feedback from community.


Currently we don't explicitly specify the Pandas version we are supporting
but we need to decide what version we should support because:

  - There have been a number of API evolutions around extension dtypes that
make supporting pandas 0.18.x and lower challenging.

  - Sometimes Pandas older than 0.19.2 doesn't handle timestamp values
properly. We want to provide properer support for timestamp values.

  - If users want to use vectorized UDFs, or toPandas / createDataFrame
from Pandas DataFrame with Arrow which will be released in Spark 2.3, users
have to upgrade Pandas 0.19.2 or upper anyway because we need pyarrow
internally, which supports only 0.19.2 or upper.


The point I'd like to ask is:

Can we drop support old Pandas (<0.19.2)?
If not, what version should we support?


References:

- vectorized UDF
  - https://github.com/apache/spark/pull/18659
  - https://github.com/apache/spark/pull/18732
- toPandas with Arrow
  - https://github.com/apache/spark/pull/18459
- createDataFrame from pandas DataFrame with Arrow
  - https://github.com/apache/spark/pull/19646


Any comments are welcome!

Thanks.

-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin

Reply via email to