Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Holden Karau Fri, 28 Feb 2020 10:03:43 -0800

On Fri, Feb 28, 2020 at 9:48 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
wrote:


> Hi, Matei and Michael.
>
> I'm also a big supporter for policy-based project management.
>
> Before going further,
>
>     1. Could you estimate how many revert commits are required in
> `branch-3.0` for new rubric?
>     2. Are you going to revert all removed test cases for the deprecated
> ones?
>
This is a good point, making sure we keep the tests as well is important
(worse than removing a deprecated API is shipping it broken),.

>     3. Does it make any delay for Apache Spark 3.0.0 release?
>         (I believe it was previously scheduled on June before Spark Summit
> 2020)
>
I think if we need to delay to make a better release this is ok, especially
given our current preview releases being available to gather community
feedback.

>
> Although there was a discussion already, I want to make the following
> tough parts sure.
>
>     4. We are not going to add Scala 2.11 API, right?
>
I hope not.

>     5. We are not going to support Python 2.x in Apache Spark 3.1+, right?
>
I think doing that would be bad, it's already end of lifed elsewhere.

>     6. Do we have enough resource for testing the deprecated ones?
>         (Currently, we have 8 heavy Jenkins jobs for `branch-3.0` already.)
>
> Especially, for (2) and (6), we know that keeping deprecated ones without
> testings doesn't give us any support for the new rubric.
>
> Bests,
> Dongjoon.
>
> On Thu, Feb 27, 2020 at 5:31 PM Matei Zaharia <matei.zaha...@gmail.com>
> wrote:
>
>> +1 on this new rubric. It definitely captures the issues I’ve seen in
>> Spark and in other projects. If we write down this rubric (or something
>> like it), it will also be easier to refer to it during code reviews or in
>> proposals of new APIs (we could ask “do you expect to have to change this
>> API in the future, and if so, how”).
>>
>> Matei
>>
>> On Feb 24, 2020, at 3:02 PM, Michael Armbrust <mich...@databricks.com>
>> wrote:
>>
>> Hello Everyone,
>>
>> As more users have started upgrading to Spark 3.0 preview (including
>> myself), there have been many discussions around APIs that have been broken
>> compared with Spark 2.x. In many of these discussions, one of the
>> rationales for breaking an API seems to be "Spark follows semantic
>> versioning <https://spark.apache.org/versioning-policy.html>, so this
>> major release is our chance to get it right [by breaking APIs]". Similarly,
>> in many cases the response to questions about why an API was completely
>> removed has been, "this API has been deprecated since x.x, so we have to
>> remove it".
>>
>> As a long time contributor to and user of Spark this interpretation of
>> the policy is concerning to me. This reasoning misses the intention of the
>> original policy, and I am worried that it will hurt the long-term success
>> of the project.
>>
>> I definitely understand that these are hard decisions, and I'm not
>> proposing that we never remove anything from Spark. However, I would like
>> to give some additional context and also propose a different rubric for
>> thinking about API breakage moving forward.
>>
>> Spark adopted semantic versioning back in 2014 during the preparations
>> for the 1.0 release. As this was the first major release -- and as, up
>> until fairly recently, Spark had only been an academic project -- no real
>> promises had been made about API stability ever.
>>
>> During the discussion, some committers suggested that this was an
>> opportunity to clean up cruft and give the Spark APIs a once-over, making
>> cosmetic changes to improve consistency. However, in the end, it was
>> decided that in many cases it was not in the best interests of the Spark
>> community to break things just because we could. Matei actually said it
>> pretty forcefully
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/Proposal-for-Spark-Release-Strategy-td464i20.html#a503>
>> :
>>
>> I know that some names are suboptimal, but I absolutely detest breaking
>> APIs, config names, etc. I’ve seen it happen way too often in other
>> projects (even things we depend on that are officially post-1.0, like Akka
>> or Protobuf or Hadoop), and it’s very painful. I think that we as fairly
>> cutting-edge users are okay with libraries occasionally changing, but many
>> others will consider it a show-stopper. Given this, I think that any
>> cosmetic change now, even though it might improve clarity slightly, is not
>> worth the tradeoff in terms of creating an update barrier for existing
>> users.
>>
>> In the end, while some changes were made, most APIs remained the same and
>> users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think
>> this served the project very well, as compatibility means users are able to
>> upgrade and we keep as many people on the latest versions of Spark (though
>> maybe not the latest APIs of Spark) as possible.
>>
>> As Spark grows, I think compatibility actually becomes more important and
>> we should be more conservative rather than less. Today, there are very
>> likely more Spark programs running than there were at any other time in the
>> past. Spark is no longer a tool only used by advanced hackers, it is now
>> also running "traditional enterprise workloads.'' In many cases these jobs
>> are powering important processes long after the original author leaves.
>>
>> Broken APIs can also affect libraries that extend Spark. This dependency
>> can be even harder for users, as if the library has not been upgraded to
>> use new APIs and they need that library, they are stuck.
>>
>> Given all of this, I'd like to propose the following rubric as an
>> addition to our semantic versioning policy. After discussion and if
>> people agree this is a good idea, I'll call a vote of the PMC to ratify its
>> inclusion in the official policy.
>>
>> Considerations When Breaking APIs
>> The Spark project strives to avoid breaking APIs or silently changing
>> behavior, even at major versions. While this is not always possible, the
>> balance of the following factors should be considered before choosing to
>> break an API.
>>
>> Cost of Breaking an API
>> Breaking an API almost always has a non-trivial cost to the users of
>> Spark. A broken API means that Spark programs need to be rewritten before
>> they can be upgraded. However, there are a few considerations when thinking
>> about what the cost will be:
>>
>>    - Usage - an API that is actively used in many different places, is
>>    always very costly to break. While it is hard to know usage for sure, 
>> there
>>    are a bunch of ways that we can estimate:
>>    - How long has the API been in Spark?
>>       - Is the API common even for basic programs?
>>       - How often do we see recent questions in JIRA or mailing lists?
>>       - How often does it appear in StackOverflow or blogs?
>>       - Behavior after the break - How will a program that works today,
>>    work after the break? The following are listed roughly in order of
>>    increasing severity:
>>    - Will there be a compiler or linker error?
>>       - Will there be a runtime exception?
>>       - Will that exception happen after significant processing has been
>>       done?
>>       - Will we silently return different answers? (very hard to debug,
>>       might not even notice!)
>>
>>
>> Cost of Maintaining an API
>> Of course, the above does not mean that we will never break any APIs. We
>> must also consider the cost both to the project and to our users of keeping
>> the API in question.
>>
>>    - Project Costs - Every API we have needs to be tested and needs to
>>    keep working as other parts of the project changes. These costs are
>>    significantly exacerbated when external dependencies change (the JVM,
>>    Scala, etc). In some cases, while not completely technically infeasible,
>>    the cost of maintaining a particular API can become too high.
>>    - User Costs - APIs also have a cognitive cost to users learning
>>    Spark or trying to understand Spark programs. This cost becomes even 
>> higher
>>    when the API in question has confusing or undefined semantics.
>>
>>
>> Alternatives to Breaking an API
>> In cases where there is a "Bad API", but where the cost of removal is
>> also high, there are alternatives that should be considered that do not
>> hurt existing users but do address some of the maintenance costs.
>>
>>
>>    - Avoid Bad APIs - While this is a bit obvious, it is an important
>>    point. Anytime we are adding a new interface to Spark we should consider
>>    that we might be stuck with this API forever. Think deeply about how
>>    new APIs relate to existing ones, as well as how you expect them to evolve
>>    over time.
>>    - Deprecation Warnings - All deprecation warnings should point to a
>>    clear alternative and should never just say that an API is deprecated.
>>    - Updated Docs - Documentation should point to the "best" recommended
>>    way of performing a given task. In the cases where we maintain legacy
>>    documentation, we should clearly point to newer APIs and suggest to users
>>    the "right" way.
>>    - Community Work - Many people learn Spark by reading blogs and other
>>    sites such as StackOverflow. However, many of these resources are out of
>>    date. Update them, to reduce the cost of eventually removing deprecated
>>    APIs.
>>
>>
>> Examples
>>
>> Here are some examples of how I think the policy above could be applied
>> to different issues that have been discussed recently. These are only to
>> illustrate how to apply the above rubric, but are not intended to be part
>> of the official policy.
>>
>> [SPARK-26362] Remove 'spark.driver.allowMultipleContexts' to disallow
>> multiple creation of SparkContexts #23311
>> <https://github.com/apache/spark/pull/23311>
>>
>>
>>    - Cost to Break - Multiple Contexts in a single JVM never worked
>>    properly. When users tried it they would nearly always report that Spark
>>    was broken (SPARK-2243
>>    <https://issues.apache.org/jira/browse/SPARK-2243>), due to the
>>    confusing set of logs messages. Given this, I think it is very unlikely
>>    that there are many real world use cases active today. Even those cases
>>    likely suffer from undiagnosed issues as there are many areas of Spark 
>> that
>>    assume a single context per JVM.
>>    - Cost to Maintain - We have recently had users ask on the mailing
>>    list if this was supported, as the conf led them to believe it was, and 
>> the
>>    existence of this configuration as "supported" makes it harder to reason
>>    about certain global state in SparkContext.
>>
>>
>> Decision: Remove this configuration and related code.
>>
>> [SPARK-25908] Remove registerTempTable #22921
>> <https://github.com/apache/spark/pull/22921/> (only looking at one API
>> of this PR)
>>
>>
>>    - Cost to Break - This is a wildly popular API of Spark SQL that has
>>    been there since the first release. There are tons of blog posts and
>>    examples that use this syntax if you google "dataframe
>>    registerTempTable
>>    
>> <https://www.google.com/search?q=dataframe+registertemptable&rlz=1C5CHFA_enUS746US746&oq=dataframe+registertemptable&aqs=chrome.0.0l8.3040j1j7&sourceid=chrome&ie=UTF-8>"
>>    (even more than the "correct" API "dataframe createOrReplaceView
>>    
>> <https://www.google.com/search?rlz=1C5CHFA_enUS746US746&ei=TkZMXrj1ObzA0PEPpLKR2A4&q=dataframe+createorreplacetempview&oq=dataframe+createor&gs_l=psy-ab.3.0.0j0i22i30l7.663.1303..2750...0.3..1.212.782.7j0j1......0....1..gws-wiz.......0i71j0i131.zP34wH1novM>").
>>    All of these will be invalid for users of Spark 3.0
>>    - Cost to Maintain - This is just an alias, so there is not a lot of
>>    extra machinery required to keep the API. Users have two ways to do the
>>    same thing, but we can note that this is just an alias in the docs.
>>
>>
>> Decision: Do not remove this API, I would even consider un-deprecating
>> it. I anecdotally asked several users and this is the API they prefer over
>> the "correct" one.
>>
>> [SPARK-25496] Deprecate from_utc_timestamp and to_utc_timestamp #24195
>> <https://github.com/apache/spark/pull/24195>
>>
>>    - Cost to Break - I think that this case actually exemplifies several
>>    anti-patterns in breaking APIs. In some languages, the deprecation warning
>>    gives you no help, other than what version the function was removed in. In
>>    R, it points users to a really deep conversation on the semantics of time
>>    in Spark SQL. None of the messages tell you how you should correctly be
>>    parsing a timestamp that is given to you in a format other than UTC. My
>>    guess is all users will blindly flip the flag to true (to keep using this
>>    function), so you've only succeeded in annoying them.
>>    - Cost to Maintain - These are two relatively isolated expressions,
>>    there should be little cost to keeping them. Users can be confused by 
>> their
>>    semantics, so we probably should update the docs to point them to a best
>>    practice (I learned only by complaining on the PR, that a good practice is
>>    to parse timestamps including the timezone in the format expression, which
>>    naturally shifts them to UTC).
>>
>>
>> Decision: Do not deprecate these two functions. We should update the
>> docs to talk about best practices for parsing timestamps, including how to
>> correctly shift them to UTC for storage.
>>
>> [SPARK-28093] Fix TRIM/LTRIM/RTRIM function parameter order issue #24902
>> <https://github.com/apache/spark/pull/24902>
>>
>>
>>    - Cost to Break - The TRIM function takes two string parameters. If
>>    we switch the parameter order, queries that use the TRIM function would
>>    silently get different results on different versions of Spark. Users may
>>    not notice it for a long time and wrong query results may cause serious
>>    problems to users.
>>    - Cost to Maintain - We will have some inconsistency inside Spark, as
>>    the TRIM function in Scala API and in SQL have different parameter order.
>>
>>
>> Decision: Do not switch the parameter order. Promote the TRIM(trimStr
>> FROM srcStr) syntax our SQL docs as it's the SQL standard. Deprecate
>> (with a warning, not by removing) the SQL TRIM function and move users to
>> the SQL standard TRIM syntax.
>>
>> Thanks for taking the time to read this! Happy to discuss the specifics
>> and amend this policy as the community sees fit.
>>
>> Michael
>>
>>
>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Reply via email to