Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Jules Damji Wed, 26 Feb 2020 10:39:18 -0800

+1 

Well said!


Sent from my iPhone
Pardon the dumb thumb typos :)

> On Feb 24, 2020, at 3:03 PM, Michael Armbrust <mich...@databricks.com> wrote:
> 
> 
> Hello Everyone,
> 
> As more users have started upgrading to Spark 3.0 preview (including myself), 
> there have been many discussions around APIs that have been broken compared 
> with Spark 2.x. In many of these discussions, one of the rationales for 
> breaking an API seems to be "Spark follows semantic versioning, so this major 
> release is our chance to get it right [by breaking APIs]". Similarly, in many 
> cases the response to questions about why an API was completely removed has 
> been, "this API has been deprecated since x.x, so we have to remove it".
> 
> As a long time contributor to and user of Spark this interpretation of the 
> policy is concerning to me. This reasoning misses the intention of the 
> original policy, and I am worried that it will hurt the long-term success of 
> the project.
> 
> I definitely understand that these are hard decisions, and I'm not proposing 
> that we never remove anything from Spark. However, I would like to give some 
> additional context and also propose a different rubric for thinking about API 
> breakage moving forward.
> 
> Spark adopted semantic versioning back in 2014 during the preparations for 
> the 1.0 release. As this was the first major release -- and as, up until 
> fairly recently, Spark had only been an academic project -- no real promises 
> had been made about API stability ever.
> 
> During the discussion, some committers suggested that this was an opportunity 
> to clean up cruft and give the Spark APIs a once-over, making cosmetic 
> changes to improve consistency. However, in the end, it was decided that in 
> many cases it was not in the best interests of the Spark community to break 
> things just because we could. Matei actually said it pretty forcefully:
> 
> I know that some names are suboptimal, but I absolutely detest breaking APIs, 
> config names, etc. I’ve seen it happen way too often in other projects (even 
> things we depend on that are officially post-1.0, like Akka or Protobuf or 
> Hadoop), and it’s very painful. I think that we as fairly cutting-edge users 
> are okay with libraries occasionally changing, but many others will consider 
> it a show-stopper. Given this, I think that any cosmetic change now, even 
> though it might improve clarity slightly, is not worth the tradeoff in terms 
> of creating an update barrier for existing users.
> 
> In the end, while some changes were made, most APIs remained the same and 
> users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think this 
> served the project very well, as compatibility means users are able to 
> upgrade and we keep as many people on the latest versions of Spark (though 
> maybe not the latest APIs of Spark) as possible.
> 
> As Spark grows, I think compatibility actually becomes more important and we 
> should be more conservative rather than less. Today, there are very likely 
> more Spark programs running than there were at any other time in the past. 
> Spark is no longer a tool only used by advanced hackers, it is now also 
> running "traditional enterprise workloads.'' In many cases these jobs are 
> powering important processes long after the original author leaves.
> 
> Broken APIs can also affect libraries that extend Spark. This dependency can 
> be even harder for users, as if the library has not been upgraded to use new 
> APIs and they need that library, they are stuck.
> 
> Given all of this, I'd like to propose the following rubric as an addition to 
> our semantic versioning policy. After discussion and if people agree this is 
> a good idea, I'll call a vote of the PMC to ratify its inclusion in the 
> official policy.
> 
> Considerations When Breaking APIs
> The Spark project strives to avoid breaking APIs or silently changing 
> behavior, even at major versions. While this is not always possible, the 
> balance of the following factors should be considered before choosing to 
> break an API.
> 
> Cost of Breaking an API
> Breaking an API almost always has a non-trivial cost to the users of Spark. A 
> broken API means that Spark programs need to be rewritten before they can be 
> upgraded. However, there are a few considerations when thinking about what 
> the cost will be:
> Usage - an API that is actively used in many different places, is always very 
> costly to break. While it is hard to know usage for sure, there are a bunch 
> of ways that we can estimate: 
> How long has the API been in Spark?
> Is the API common even for basic programs?
> How often do we see recent questions in JIRA or mailing lists?
> How often does it appear in StackOverflow or blogs?
> Behavior after the break - How will a program that works today, work after 
> the break? The following are listed roughly in order of increasing severity:
> Will there be a compiler or linker error?
> Will there be a runtime exception?
> Will that exception happen after significant processing has been done?
> Will we silently return different answers? (very hard to debug, might not 
> even notice!)
> 
> Cost of Maintaining an API
> Of course, the above does not mean that we will never break any APIs. We must 
> also consider the cost both to the project and to our users of keeping the 
> API in question.
> Project Costs - Every API we have needs to be tested and needs to keep 
> working as other parts of the project changes. These costs are significantly 
> exacerbated when external dependencies change (the JVM, Scala, etc). In some 
> cases, while not completely technically infeasible, the cost of maintaining a 
> particular API can become too high.
> User Costs - APIs also have a cognitive cost to users learning Spark or 
> trying to understand Spark programs. This cost becomes even higher when the 
> API in question has confusing or undefined semantics.
> 
> Alternatives to Breaking an API
> In cases where there is a "Bad API", but where the cost of removal is also 
> high, there are alternatives that should be considered that do not hurt 
> existing users but do address some of the maintenance costs.
> 
> Avoid Bad APIs - While this is a bit obvious, it is an important point. 
> Anytime we are adding a new interface to Spark we should consider that we 
> might be stuck with this API forever. Think deeply about how new APIs relate 
> to existing ones, as well as how you expect them to evolve over time.
> Deprecation Warnings - All deprecation warnings should point to a clear 
> alternative and should never just say that an API is deprecated.
> Updated Docs - Documentation should point to the "best" recommended way of 
> performing a given task. In the cases where we maintain legacy documentation, 
> we should clearly point to newer APIs and suggest to users the "right" way.
> Community Work - Many people learn Spark by reading blogs and other sites 
> such as StackOverflow. However, many of these resources are out of date. 
> Update them, to reduce the cost of eventually removing deprecated APIs.
> 
> Examples
> 
> Here are some examples of how I think the policy above could be applied to 
> different issues that have been discussed recently. These are only to 
> illustrate how to apply the above rubric, but are not intended to be part of 
> the official policy.
> 
> [SPARK-26362] Remove 'spark.driver.allowMultipleContexts' to disallow 
> multiple creation of SparkContexts #23311
> 
> Cost to Break - Multiple Contexts in a single JVM never worked properly. When 
> users tried it they would nearly always report that Spark was broken 
> (SPARK-2243), due to the confusing set of logs messages. Given this, I think 
> it is very unlikely that there are many real world use cases active today. 
> Even those cases likely suffer from undiagnosed issues as there are many 
> areas of Spark that assume a single context per JVM.
> Cost to Maintain - We have recently had users ask on the mailing list if this 
> was supported, as the conf led them to believe it was, and the existence of 
> this configuration as "supported" makes it harder to reason about certain 
> global state in SparkContext.
> 
> Decision: Remove this configuration and related code.
> 
> [SPARK-25908] Remove registerTempTable #22921 (only looking at one API of 
> this PR)
> 
> Cost to Break - This is a wildly popular API of Spark SQL that has been there 
> since the first release. There are tons of blog posts and examples that use 
> this syntax if you google "dataframe registerTempTable" (even more than the 
> "correct" API "dataframe createOrReplaceView"). All of these will be invalid 
> for users of Spark 3.0
> Cost to Maintain - This is just an alias, so there is not a lot of extra 
> machinery required to keep the API. Users have two ways to do the same thing, 
> but we can note that this is just an alias in the docs.
> 
> Decision: Do not remove this API, I would even consider un-deprecating it. I 
> anecdotally asked several users and this is the API they prefer over the 
> "correct" one.
> [SPARK-25496] Deprecate from_utc_timestamp and to_utc_timestamp #24195
> 
> Cost to Break - I think that this case actually exemplifies several 
> anti-patterns in breaking APIs. In some languages, the deprecation warning 
> gives you no help, other than what version the function was removed in. In R, 
> it points users to a really deep conversation on the semantics of time in 
> Spark SQL. None of the messages tell you how you should correctly be parsing 
> a timestamp that is given to you in a format other than UTC. My guess is all 
> users will blindly flip the flag to true (to keep using this function), so 
> you've only succeeded in annoying them.
> Cost to Maintain - These are two relatively isolated expressions, there 
> should be little cost to keeping them. Users can be confused by their 
> semantics, so we probably should update the docs to point them to a best 
> practice (I learned only by complaining on the PR, that a good practice is to 
> parse timestamps including the timezone in the format expression, which 
> naturally shifts them to UTC).
> 
> Decision: Do not deprecate these two functions. We should update the docs to 
> talk about best practices for parsing timestamps, including how to correctly 
> shift them to UTC for storage.
> 
> [SPARK-28093] Fix TRIM/LTRIM/RTRIM function parameter order issue #24902
> 
> Cost to Break - The TRIM function takes two string parameters. If we switch 
> the parameter order, queries that use the TRIM function would silently get 
> different results on different versions of Spark. Users may not notice it for 
> a long time and wrong query results may cause serious problems to users.
> Cost to Maintain - We will have some inconsistency inside Spark, as the TRIM 
> function in Scala API and in SQL have different parameter order.
> 
> Decision: Do not switch the parameter order. Promote the TRIM(trimStr FROM 
> srcStr) syntax our SQL docs as it's the SQL standard. Deprecate (with a 
> warning, not by removing) the SQL TRIM function and move users to the SQL 
> standard TRIM syntax.
> 
> Thanks for taking the time to read this! Happy to discuss the specifics and 
> amend this policy as the community sees fit.
> 
> Michael
>

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Reply via email to