Hi all,

I want to revive this old thread since no action was taken so far. If we
plan to mark Python 2 as deprecated in Spark 3.0, we should do it as early
as possible and let users know ahead. PySpark depends on Python, numpy,
pandas, and pyarrow, all of which are sunsetting Python 2 support by
2020/01/01 per https://python3statement.org/. At that time we cannot really
support Python 2 because the dependent libraries do not plan to make new
releases, even for security reasons. So I suggest the following:

1. Update Spark website and state that Python 2 is deprecated in Spark 3.0
and its support will be removed in a release after 2020/01/01.
2. Make a formal announcement to dev@ and users@.
3. Add Apache Spark project to https://python3statement.org/ timeline.
4. Update PySpark, check python version and print a deprecation warning if
version < 3.

Any thoughts and suggestions?

Best,
Xiangrui

On Mon, Sep 17, 2018 at 6:54 PM Erik Erlandson <eerla...@redhat.com> wrote:

>
> I think that makes sense. The main benefit of deprecating *prior* to 3.0
> would be informational - making the community aware of the upcoming
> transition earlier. But there are other ways to start informing the
> community between now and 3.0, besides formal deprecation.
>
> I have some residual curiosity about what it might mean for a release like
> 2.4 to still be in its support lifetime after Py2 goes EOL. I asked Apache
> Legal <https://issues.apache.org/jira/browse/LEGAL-407> to comment. It is
> possible there are no issues with this at all.
>
>
> On Mon, Sep 17, 2018 at 4:26 PM, Reynold Xin <r...@databricks.com> wrote:
>
>> i'd like to second that.
>>
>> if we want to communicate timeline, we can add to the release notes
>> saying py2 will be deprecated in 3.0, and removed in a 3.x release.
>>
>> --
>> excuse the brevity and lower case due to wrist injury
>>
>>
>> On Mon, Sep 17, 2018 at 4:24 PM Matei Zaharia <matei.zaha...@gmail.com>
>> wrote:
>>
>>> That’s a good point — I’d say there’s just a risk of creating a
>>> perception issue. First, some users might feel that this means they have to
>>> migrate now, which is before Python itself drops support; they might also
>>> be surprised that we did this in a minor release (e.g. might we drop Python
>>> 2 altogether in a Spark 2.5 if that later comes out?). Second, contributors
>>> might feel that this means new features no longer have to work with Python
>>> 2, which would be confusing. Maybe it’s OK on both fronts, but it just
>>> seems scarier for users to do this now if we do plan to have Spark 3.0 in
>>> the next 6 months anyway.
>>>
>>> Matei
>>>
>>> > On Sep 17, 2018, at 1:04 PM, Mark Hamstra <m...@clearstorydata.com>
>>> wrote:
>>> >
>>> > What is the disadvantage to deprecating now in 2.4.0? I mean, it
>>> doesn't change the code at all; it's just a notification that we will
>>> eventually cease supporting Py2. Wouldn't users prefer to get that
>>> notification sooner rather than later?
>>> >
>>> > On Mon, Sep 17, 2018 at 12:58 PM Matei Zaharia <
>>> matei.zaha...@gmail.com> wrote:
>>> > I’d like to understand the maintenance burden of Python 2 before
>>> deprecating it. Since it is not EOL yet, it might make sense to only
>>> deprecate it once it’s EOL (which is still over a year from now).
>>> Supporting Python 2+3 seems less burdensome than supporting, say, multiple
>>> Scala versions in the same codebase, so what are we losing out?
>>> >
>>> > The other thing is that even though Python core devs might not support
>>> 2.x later, it’s quite possible that various Linux distros will if moving
>>> from 2 to 3 remains painful. In that case, we may want Apache Spark to
>>> continue releasing for it despite the Python core devs not supporting it.
>>> >
>>> > Basically, I’d suggest to deprecate this in Spark 3.0 and then remove
>>> it later in 3.x instead of deprecating it in 2.4. I’d also consider looking
>>> at what other data science tools are doing before fully removing it: for
>>> example, if Pandas and TensorFlow no longer support Python 2 past some
>>> point, that might be a good point to remove it.
>>> >
>>> > Matei
>>> >
>>> > > On Sep 17, 2018, at 11:01 AM, Mark Hamstra <m...@clearstorydata.com>
>>> wrote:
>>> > >
>>> > > If we're going to do that, then we need to do it right now, since
>>> 2.4.0 is already in release candidates.
>>> > >
>>> > > On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson <eerla...@redhat.com>
>>> wrote:
>>> > > I like Mark’s concept for deprecating Py2 starting with 2.4: It may
>>> seem like a ways off but even now there may be some spark versions
>>> supporting Py2 past the point where Py2 is no longer receiving security
>>> patches
>>> > >
>>> > >
>>> > > On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra <
>>> m...@clearstorydata.com> wrote:
>>> > > We could also deprecate Py2 already in the 2.4.0 release.
>>> > >
>>> > > On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson <eerla...@redhat.com>
>>> wrote:
>>> > > In case this didn't make it onto this thread:
>>> > >
>>> > > There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and
>>> remove it entirely on a later 3.x release.
>>> > >
>>> > > On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson <
>>> eerla...@redhat.com> wrote:
>>> > > On a separate dev@spark thread, I raised a question of whether or
>>> not to support python 2 in Apache Spark, going forward into Spark 3.0.
>>> > >
>>> > > Python-2 is going EOL at the end of 2019. The upcoming release of
>>> Spark 3.0 is an opportunity to make breaking changes to Spark's APIs, and
>>> so it is a good time to consider support for Python-2 on PySpark.
>>> > >
>>> > > Key advantages to dropping Python 2 are:
>>> > >       • Support for PySpark becomes significantly easier.
>>> > >       • Avoid having to support Python 2 until Spark 4.0, which is
>>> likely to imply supporting Python 2 for some time after it goes EOL.
>>> > > (Note that supporting python 2 after EOL means, among other things,
>>> that PySpark would be supporting a version of python that was no longer
>>> receiving security patches)
>>> > >
>>> > > The main disadvantage is that PySpark users who have legacy python-2
>>> code would have to migrate their code to python 3 to take advantage of
>>> Spark 3.0
>>> > >
>>> > > This decision obviously has large implications for the Apache Spark
>>> community and we want to solicit community feedback.
>>> > >
>>> > >
>>> >
>>>
>>>
>

Reply via email to