I think this makes sense. PySpark/Pandas interops in 2.3 are new anyway, I don't think we need to support the new functionality with older version of pandas (Takuya's reason 3)
One thing I am not sure is how complicated it is to support pandas < 0.19.2 with old non-Arrow interops and require pandas >= 0.19.2 for new Arrow interops. Maybe it makes sense to allow user keep using their PySpark code if they don't want to use any of the new stuff. If this is still complicated, I would be leaning towards not supporting < 0.19.2. On Tue, Nov 14, 2017 at 6:04 AM, Hyukjin Kwon <gurwls...@gmail.com> wrote: > +0 to drop it as I said in the PR. I am seeing It brings a lot of hard > time to get the cool changes through, and is slowing down them to get > pushed. > > My only worry is, users who depends on lower pandas versions (Pandas > 0.19.2 seems released less then a year before. In the similar time, Spark > 2.1.0 was released). > > If this worry is less than I expected, I definitely support it. It should > speed up those cool changes. > > > On 14 Nov 2017 7:14 pm, "Takuya UESHIN" <ues...@happy-camper.st> wrote: > > Hi all, > > I'd like to raise a discussion about Pandas version. > Originally we are discussing it at https://github.com/apache/s > park/pull/19607 but we'd like to ask for feedback from community. > > > Currently we don't explicitly specify the Pandas version we are supporting > but we need to decide what version we should support because: > > - There have been a number of API evolutions around extension dtypes > that make supporting pandas 0.18.x and lower challenging. > > - Sometimes Pandas older than 0.19.2 doesn't handle timestamp values > properly. We want to provide properer support for timestamp values. > > - If users want to use vectorized UDFs, or toPandas / createDataFrame > from Pandas DataFrame with Arrow which will be released in Spark 2.3, users > have to upgrade Pandas 0.19.2 or upper anyway because we need pyarrow > internally, which supports only 0.19.2 or upper. > > > The point I'd like to ask is: > > Can we drop support old Pandas (<0.19.2)? > If not, what version should we support? > > > References: > > - vectorized UDF > - https://github.com/apache/spark/pull/18659 > - https://github.com/apache/spark/pull/18732 > - toPandas with Arrow > - https://github.com/apache/spark/pull/18459 > - createDataFrame from pandas DataFrame with Arrow > - https://github.com/apache/spark/pull/19646 > > > Any comments are welcome! > > Thanks. > > -- > Takuya UESHIN > Tokyo, Japan > > http://twitter.com/ueshin > > >