Thanks for feedback. Hyukjin Kwon: > My only worry is, users who depends on lower pandas versions
That's what I worried and one of the reasons I moved this discussion here. Li Jin: > how complicated it is to support pandas < 0.19.2 with old non-Arrow interops In my original PR (https://github.com/apache/spark/pull/19607) we will fix the behavior of timestamp values for Pandas. If we need to support old Pandas, we will need at least some workarounds like in the following link: https://github.com/apache/spark/blob/e919ed55758f75733d56287d5a49326b1067a44c/python/pyspark/sql/types.py#L1718-L1774 Thanks. On Wed, Nov 15, 2017 at 12:59 AM, Li Jin <ice.xell...@gmail.com> wrote: > I think this makes sense. PySpark/Pandas interops in 2.3 are new anyway, I > don't think we need to support the new functionality with older version of > pandas (Takuya's reason 3) > > One thing I am not sure is how complicated it is to support pandas < > 0.19.2 with old non-Arrow interops and require pandas >= 0.19.2 for new > Arrow interops. Maybe it makes sense to allow user keep using their PySpark > code if they don't want to use any of the new stuff. If this is still > complicated, I would be leaning towards not supporting < 0.19.2. > > > On Tue, Nov 14, 2017 at 6:04 AM, Hyukjin Kwon <gurwls...@gmail.com> wrote: > >> +0 to drop it as I said in the PR. I am seeing It brings a lot of hard >> time to get the cool changes through, and is slowing down them to get >> pushed. >> >> My only worry is, users who depends on lower pandas versions (Pandas >> 0.19.2 seems released less then a year before. In the similar time, Spark >> 2.1.0 was released). >> >> If this worry is less than I expected, I definitely support it. It should >> speed up those cool changes. >> >> >> On 14 Nov 2017 7:14 pm, "Takuya UESHIN" <ues...@happy-camper.st> wrote: >> >> Hi all, >> >> I'd like to raise a discussion about Pandas version. >> Originally we are discussing it at https://github.com/apache/s >> park/pull/19607 but we'd like to ask for feedback from community. >> >> >> Currently we don't explicitly specify the Pandas version we are >> supporting but we need to decide what version we should support because: >> >> - There have been a number of API evolutions around extension dtypes >> that make supporting pandas 0.18.x and lower challenging. >> >> - Sometimes Pandas older than 0.19.2 doesn't handle timestamp values >> properly. We want to provide properer support for timestamp values. >> >> - If users want to use vectorized UDFs, or toPandas / createDataFrame >> from Pandas DataFrame with Arrow which will be released in Spark 2.3, users >> have to upgrade Pandas 0.19.2 or upper anyway because we need pyarrow >> internally, which supports only 0.19.2 or upper. >> >> >> The point I'd like to ask is: >> >> Can we drop support old Pandas (<0.19.2)? >> If not, what version should we support? >> >> >> References: >> >> - vectorized UDF >> - https://github.com/apache/spark/pull/18659 >> - https://github.com/apache/spark/pull/18732 >> - toPandas with Arrow >> - https://github.com/apache/spark/pull/18459 >> - createDataFrame from pandas DataFrame with Arrow >> - https://github.com/apache/spark/pull/19646 >> >> >> Any comments are welcome! >> >> Thanks. >> >> -- >> Takuya UESHIN >> Tokyo, Japan >> >> http://twitter.com/ueshin >> >> >> > -- Takuya UESHIN Tokyo, Japan http://twitter.com/ueshin