Those are all quite reasonable guidelines and I'd put them into the contributing or developer guide, sure. Although not argued here, I think we should go further than codifying and enforcing common-sense guidelines like these. I think bias should shift in favor of retaining APIs going forward, and even retroactively shift for 3.0 somewhat. (Hence some reverts currently in progress.) It's a natural evolution from 1.x to 2.x to 3.x. The API surface area stops expanding and changing and getting fixed as much; years more experience prove out what APIs make sense.
On Mon, Feb 24, 2020 at 5:03 PM Michael Armbrust <mich...@databricks.com> wrote: > > Hello Everyone, > > > As more users have started upgrading to Spark 3.0 preview (including myself), > there have been many discussions around APIs that have been broken compared > with Spark 2.x. In many of these discussions, one of the rationales for > breaking an API seems to be "Spark follows semantic versioning, so this major > release is our chance to get it right [by breaking APIs]". Similarly, in many > cases the response to questions about why an API was completely removed has > been, "this API has been deprecated since x.x, so we have to remove it". > > > As a long time contributor to and user of Spark this interpretation of the > policy is concerning to me. This reasoning misses the intention of the > original policy, and I am worried that it will hurt the long-term success of > the project. > > > I definitely understand that these are hard decisions, and I'm not proposing > that we never remove anything from Spark. However, I would like to give some > additional context and also propose a different rubric for thinking about API > breakage moving forward. > > > Spark adopted semantic versioning back in 2014 during the preparations for > the 1.0 release. As this was the first major release -- and as, up until > fairly recently, Spark had only been an academic project -- no real promises > had been made about API stability ever. > > > During the discussion, some committers suggested that this was an opportunity > to clean up cruft and give the Spark APIs a once-over, making cosmetic > changes to improve consistency. However, in the end, it was decided that in > many cases it was not in the best interests of the Spark community to break > things just because we could. Matei actually said it pretty forcefully: > > > I know that some names are suboptimal, but I absolutely detest breaking APIs, > config names, etc. I’ve seen it happen way too often in other projects (even > things we depend on that are officially post-1.0, like Akka or Protobuf or > Hadoop), and it’s very painful. I think that we as fairly cutting-edge users > are okay with libraries occasionally changing, but many others will consider > it a show-stopper. Given this, I think that any cosmetic change now, even > though it might improve clarity slightly, is not worth the tradeoff in terms > of creating an update barrier for existing users. > > > In the end, while some changes were made, most APIs remained the same and > users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think this > served the project very well, as compatibility means users are able to > upgrade and we keep as many people on the latest versions of Spark (though > maybe not the latest APIs of Spark) as possible. > > > As Spark grows, I think compatibility actually becomes more important and we > should be more conservative rather than less. Today, there are very likely > more Spark programs running than there were at any other time in the past. > Spark is no longer a tool only used by advanced hackers, it is now also > running "traditional enterprise workloads.'' In many cases these jobs are > powering important processes long after the original author leaves. > > > Broken APIs can also affect libraries that extend Spark. This dependency can > be even harder for users, as if the library has not been upgraded to use new > APIs and they need that library, they are stuck. > > > Given all of this, I'd like to propose the following rubric as an addition to > our semantic versioning policy. After discussion and if people agree this is > a good idea, I'll call a vote of the PMC to ratify its inclusion in the > official policy. > > > Considerations When Breaking APIs > > The Spark project strives to avoid breaking APIs or silently changing > behavior, even at major versions. While this is not always possible, the > balance of the following factors should be considered before choosing to > break an API. > > > Cost of Breaking an API > > Breaking an API almost always has a non-trivial cost to the users of Spark. A > broken API means that Spark programs need to be rewritten before they can be > upgraded. However, there are a few considerations when thinking about what > the cost will be: > > Usage - an API that is actively used in many different places, is always very > costly to break. While it is hard to know usage for sure, there are a bunch > of ways that we can estimate: > > How long has the API been in Spark? > > Is the API common even for basic programs? > > How often do we see recent questions in JIRA or mailing lists? > > How often does it appear in StackOverflow or blogs? > > Behavior after the break - How will a program that works today, work after > the break? The following are listed roughly in order of increasing severity: > > Will there be a compiler or linker error? > > Will there be a runtime exception? > > Will that exception happen after significant processing has been done? > > Will we silently return different answers? (very hard to debug, might not > even notice!) > > > Cost of Maintaining an API > > Of course, the above does not mean that we will never break any APIs. We must > also consider the cost both to the project and to our users of keeping the > API in question. > > Project Costs - Every API we have needs to be tested and needs to keep > working as other parts of the project changes. These costs are significantly > exacerbated when external dependencies change (the JVM, Scala, etc). In some > cases, while not completely technically infeasible, the cost of maintaining a > particular API can become too high. > > User Costs - APIs also have a cognitive cost to users learning Spark or > trying to understand Spark programs. This cost becomes even higher when the > API in question has confusing or undefined semantics. > > > Alternatives to Breaking an API > > In cases where there is a "Bad API", but where the cost of removal is also > high, there are alternatives that should be considered that do not hurt > existing users but do address some of the maintenance costs. > > > Avoid Bad APIs - While this is a bit obvious, it is an important point. > Anytime we are adding a new interface to Spark we should consider that we > might be stuck with this API forever. Think deeply about how new APIs relate > to existing ones, as well as how you expect them to evolve over time. > > Deprecation Warnings - All deprecation warnings should point to a clear > alternative and should never just say that an API is deprecated. > > Updated Docs - Documentation should point to the "best" recommended way of > performing a given task. In the cases where we maintain legacy documentation, > we should clearly point to newer APIs and suggest to users the "right" way. > > Community Work - Many people learn Spark by reading blogs and other sites > such as StackOverflow. However, many of these resources are out of date. > Update them, to reduce the cost of eventually removing deprecated APIs. > > > Examples > > > Here are some examples of how I think the policy above could be applied to > different issues that have been discussed recently. These are only to > illustrate how to apply the above rubric, but are not intended to be part of > the official policy. > > > [SPARK-26362] Remove 'spark.driver.allowMultipleContexts' to disallow > multiple creation of SparkContexts #23311 > > > Cost to Break - Multiple Contexts in a single JVM never worked properly. When > users tried it they would nearly always report that Spark was broken > (SPARK-2243), due to the confusing set of logs messages. Given this, I think > it is very unlikely that there are many real world use cases active today. > Even those cases likely suffer from undiagnosed issues as there are many > areas of Spark that assume a single context per JVM. > > Cost to Maintain - We have recently had users ask on the mailing list if this > was supported, as the conf led them to believe it was, and the existence of > this configuration as "supported" makes it harder to reason about certain > global state in SparkContext. > > > Decision: Remove this configuration and related code. > > > [SPARK-25908] Remove registerTempTable #22921 (only looking at one API of > this PR) > > > Cost to Break - This is a wildly popular API of Spark SQL that has been there > since the first release. There are tons of blog posts and examples that use > this syntax if you google "dataframe registerTempTable" (even more than the > "correct" API "dataframe createOrReplaceView"). All of these will be invalid > for users of Spark 3.0 > > Cost to Maintain - This is just an alias, so there is not a lot of extra > machinery required to keep the API. Users have two ways to do the same thing, > but we can note that this is just an alias in the docs. > > > Decision: Do not remove this API, I would even consider un-deprecating it. I > anecdotally asked several users and this is the API they prefer over the > "correct" one. > > [SPARK-25496] Deprecate from_utc_timestamp and to_utc_timestamp #24195 > > Cost to Break - I think that this case actually exemplifies several > anti-patterns in breaking APIs. In some languages, the deprecation warning > gives you no help, other than what version the function was removed in. In R, > it points users to a really deep conversation on the semantics of time in > Spark SQL. None of the messages tell you how you should correctly be parsing > a timestamp that is given to you in a format other than UTC. My guess is all > users will blindly flip the flag to true (to keep using this function), so > you've only succeeded in annoying them. > > Cost to Maintain - These are two relatively isolated expressions, there > should be little cost to keeping them. Users can be confused by their > semantics, so we probably should update the docs to point them to a best > practice (I learned only by complaining on the PR, that a good practice is to > parse timestamps including the timezone in the format expression, which > naturally shifts them to UTC). > > > Decision: Do not deprecate these two functions. We should update the docs to > talk about best practices for parsing timestamps, including how to correctly > shift them to UTC for storage. > > > [SPARK-28093] Fix TRIM/LTRIM/RTRIM function parameter order issue #24902 > > > Cost to Break - The TRIM function takes two string parameters. If we switch > the parameter order, queries that use the TRIM function would silently get > different results on different versions of Spark. Users may not notice it for > a long time and wrong query results may cause serious problems to users. > > Cost to Maintain - We will have some inconsistency inside Spark, as the TRIM > function in Scala API and in SQL have different parameter order. > > > Decision: Do not switch the parameter order. Promote the TRIM(trimStr FROM > srcStr) syntax our SQL docs as it's the SQL standard. Deprecate (with a > warning, not by removing) the SQL TRIM function and move users to the SQL > standard TRIM syntax. > > > Thanks for taking the time to read this! Happy to discuss the specifics and > amend this policy as the community sees fit. > > > Michael > > --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org