Re: [DISCUSS] Increasing minimum supported version of Pandas

Bryan Cutler Fri, 14 Jun 2019 14:56:30 -0700

Yeah, PyArrow is the only other PySpark dependency we check for a minimum
version. We updated that not too long ago to be 0.12.1, which I think we
are still good on for now.


On Fri, Jun 14, 2019 at 11:36 AM Felix Cheung <felixcheun...@hotmail.com>
wrote:

> How about pyArrow?
>
> ------------------------------
> *From:* Holden Karau <hol...@pigscanfly.ca>
> *Sent:* Friday, June 14, 2019 11:06:15 AM
> *To:* Felix Cheung
> *Cc:* Bryan Cutler; Dongjoon Hyun; Hyukjin Kwon; dev; shane knapp
> *Subject:* Re: [DISCUSS] Increasing minimum supported version of Pandas
>
> Are there other Python dependencies we should consider upgrading at the
> same time?
>
> On Fri, Jun 14, 2019 at 7:45 PM Felix Cheung <felixcheun...@hotmail.com>
> wrote:
>
>> So to be clear, min version check is 0.23
>> Jenkins test is 0.24
>>
>> I’m ok with this. I hope someone will test 0.23 on releases though before
>> we sign off?
>>
> We should maybe add this to the release instruction notes?
>
>>
>> ------------------------------
>> *From:* shane knapp <skn...@berkeley.edu>
>> *Sent:* Friday, June 14, 2019 10:23:56 AM
>> *To:* Bryan Cutler
>> *Cc:* Dongjoon Hyun; Holden Karau; Hyukjin Kwon; dev
>> *Subject:* Re: [DISCUSS] Increasing minimum supported version of Pandas
>>
>> excellent.  i shall not touch anything.  :)
>>
>> On Fri, Jun 14, 2019 at 10:22 AM Bryan Cutler <cutl...@gmail.com> wrote:
>>
>>> Shane, I think 0.24.2 is probably more common right now, so if we were
>>> to pick one to test against, I still think it should be that one. Our
>>> Pandas usage in PySpark is pretty conservative, so it's pretty unlikely
>>> that we will add something that would break 0.23.X.
>>>
>>> On Fri, Jun 14, 2019 at 10:10 AM shane knapp <skn...@berkeley.edu>
>>> wrote:
>>>
>>>> ah, ok...  should we downgrade the testing env on jenkins then?  any
>>>> specific version?
>>>>
>>>> shane, who is loathe (and i mean LOATHE) to touch python envs ;)
>>>>
>>>> On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler <cutl...@gmail.com>
>>>> wrote:
>>>>
>>>>> I should have stated this earlier, but when the user does something
>>>>> that requires Pandas, the minimum version is checked against what was
>>>>> imported and will raise an exception if it is a lower version. So I'm
>>>>> concerned that using 0.24.2 might be a little too new for users running
>>>>> older clusters. To give some release dates, 0.23.2 was released about a
>>>>> year ago, 0.24.0 in January and 0.24.2 in March.
>>>>>
>>>> I think given that we’re switching to requiring Python 3 and also a bit
> of a way from cutting a release 0.24 could be Ok as a min version
> requirement
>
>>
>>>>>
>>>>> On Fri, Jun 14, 2019 at 9:27 AM shane knapp <skn...@berkeley.edu>
>>>>> wrote:
>>>>>
>>>>>> just to everyone knows, our python 3.6 testing infra is currently on
>>>>>> 0.24.2...
>>>>>>
>>>>>> On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun <
>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> Thank you for this effort, Bryan!
>>>>>>>
>>>>>>> Bests,
>>>>>>> Dongjoon.
>>>>>>>
>>>>>>> On Fri, Jun 14, 2019 at 4:24 AM Holden Karau <hol...@pigscanfly.ca>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I’m +1 for upgrading, although since this is probably the last easy
>>>>>>>> chance we’ll have to bump version numbers easily I’d suggest 0.24.2
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon <gurwls...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow
>>>>>>>>> and pandas combinations. Spark 3 should be good time to increase.
>>>>>>>>>
>>>>>>>>> 2019년 6월 14일 (금) 오전 9:46, Bryan Cutler <cutl...@gmail.com>님이 작성:
>>>>>>>>>
>>>>>>>>>> Hi All,
>>>>>>>>>>
>>>>>>>>>> We would like to discuss increasing the minimum supported version
>>>>>>>>>> of Pandas in Spark, which is currently 0.19.2.
>>>>>>>>>>
>>>>>>>>>> Pandas 0.19.2 was released nearly 3 years ago and there are some
>>>>>>>>>> workarounds in PySpark that could be removed if such an old version 
>>>>>>>>>> is not
>>>>>>>>>> required. This will help to keep code clean and reduce maintenance 
>>>>>>>>>> effort.
>>>>>>>>>>
>>>>>>>>>> The change is targeted for Spark 3.0.0 release, see
>>>>>>>>>> https://issues.apache.org/jira/browse/SPARK-28041. The current
>>>>>>>>>> thought is to bump the version to 0.23.2, but we would like to 
>>>>>>>>>> discuss
>>>>>>>>>> before making a change. Does anyone else have thoughts on this?
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Bryan
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Shane Knapp
>>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>>>> https://rise.cs.berkeley.edu
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Shane Knapp
>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>> https://rise.cs.berkeley.edu
>>>>
>>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: [DISCUSS] Increasing minimum supported version of Pandas

Reply via email to