Yeah, PyArrow is the only other PySpark dependency we check for a minimum version. We updated that not too long ago to be 0.12.1, which I think we are still good on for now.
On Fri, Jun 14, 2019 at 11:36 AM Felix Cheung <felixcheun...@hotmail.com> wrote: > How about pyArrow? > > ------------------------------ > *From:* Holden Karau <hol...@pigscanfly.ca> > *Sent:* Friday, June 14, 2019 11:06:15 AM > *To:* Felix Cheung > *Cc:* Bryan Cutler; Dongjoon Hyun; Hyukjin Kwon; dev; shane knapp > *Subject:* Re: [DISCUSS] Increasing minimum supported version of Pandas > > Are there other Python dependencies we should consider upgrading at the > same time? > > On Fri, Jun 14, 2019 at 7:45 PM Felix Cheung <felixcheun...@hotmail.com> > wrote: > >> So to be clear, min version check is 0.23 >> Jenkins test is 0.24 >> >> I’m ok with this. I hope someone will test 0.23 on releases though before >> we sign off? >> > We should maybe add this to the release instruction notes? > >> >> ------------------------------ >> *From:* shane knapp <skn...@berkeley.edu> >> *Sent:* Friday, June 14, 2019 10:23:56 AM >> *To:* Bryan Cutler >> *Cc:* Dongjoon Hyun; Holden Karau; Hyukjin Kwon; dev >> *Subject:* Re: [DISCUSS] Increasing minimum supported version of Pandas >> >> excellent. i shall not touch anything. :) >> >> On Fri, Jun 14, 2019 at 10:22 AM Bryan Cutler <cutl...@gmail.com> wrote: >> >>> Shane, I think 0.24.2 is probably more common right now, so if we were >>> to pick one to test against, I still think it should be that one. Our >>> Pandas usage in PySpark is pretty conservative, so it's pretty unlikely >>> that we will add something that would break 0.23.X. >>> >>> On Fri, Jun 14, 2019 at 10:10 AM shane knapp <skn...@berkeley.edu> >>> wrote: >>> >>>> ah, ok... should we downgrade the testing env on jenkins then? any >>>> specific version? >>>> >>>> shane, who is loathe (and i mean LOATHE) to touch python envs ;) >>>> >>>> On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler <cutl...@gmail.com> >>>> wrote: >>>> >>>>> I should have stated this earlier, but when the user does something >>>>> that requires Pandas, the minimum version is checked against what was >>>>> imported and will raise an exception if it is a lower version. So I'm >>>>> concerned that using 0.24.2 might be a little too new for users running >>>>> older clusters. To give some release dates, 0.23.2 was released about a >>>>> year ago, 0.24.0 in January and 0.24.2 in March. >>>>> >>>> I think given that we’re switching to requiring Python 3 and also a bit > of a way from cutting a release 0.24 could be Ok as a min version > requirement > >> >>>>> >>>>> On Fri, Jun 14, 2019 at 9:27 AM shane knapp <skn...@berkeley.edu> >>>>> wrote: >>>>> >>>>>> just to everyone knows, our python 3.6 testing infra is currently on >>>>>> 0.24.2... >>>>>> >>>>>> On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun < >>>>>> dongjoon.h...@gmail.com> wrote: >>>>>> >>>>>>> +1 >>>>>>> >>>>>>> Thank you for this effort, Bryan! >>>>>>> >>>>>>> Bests, >>>>>>> Dongjoon. >>>>>>> >>>>>>> On Fri, Jun 14, 2019 at 4:24 AM Holden Karau <hol...@pigscanfly.ca> >>>>>>> wrote: >>>>>>> >>>>>>>> I’m +1 for upgrading, although since this is probably the last easy >>>>>>>> chance we’ll have to bump version numbers easily I’d suggest 0.24.2 >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon <gurwls...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow >>>>>>>>> and pandas combinations. Spark 3 should be good time to increase. >>>>>>>>> >>>>>>>>> 2019년 6월 14일 (금) 오전 9:46, Bryan Cutler <cutl...@gmail.com>님이 작성: >>>>>>>>> >>>>>>>>>> Hi All, >>>>>>>>>> >>>>>>>>>> We would like to discuss increasing the minimum supported version >>>>>>>>>> of Pandas in Spark, which is currently 0.19.2. >>>>>>>>>> >>>>>>>>>> Pandas 0.19.2 was released nearly 3 years ago and there are some >>>>>>>>>> workarounds in PySpark that could be removed if such an old version >>>>>>>>>> is not >>>>>>>>>> required. This will help to keep code clean and reduce maintenance >>>>>>>>>> effort. >>>>>>>>>> >>>>>>>>>> The change is targeted for Spark 3.0.0 release, see >>>>>>>>>> https://issues.apache.org/jira/browse/SPARK-28041. The current >>>>>>>>>> thought is to bump the version to 0.23.2, but we would like to >>>>>>>>>> discuss >>>>>>>>>> before making a change. Does anyone else have thoughts on this? >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Bryan >>>>>>>>>> >>>>>>>>> -- >>>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Shane Knapp >>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>>>> https://rise.cs.berkeley.edu >>>>>> >>>>> >>>> >>>> -- >>>> Shane Knapp >>>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>> https://rise.cs.berkeley.edu >>>> >>> >> >> -- >> Shane Knapp >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >> > -- > Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau >