On Fri, Feb 28, 2020 at 9:48 AM Dongjoon Hyun <dongjoon.h...@gmail.com> wrote:
> Hi, Matei and Michael. > > I'm also a big supporter for policy-based project management. > > Before going further, > > 1. Could you estimate how many revert commits are required in > `branch-3.0` for new rubric? > 2. Are you going to revert all removed test cases for the deprecated > ones? > This is a good point, making sure we keep the tests as well is important (worse than removing a deprecated API is shipping it broken),. > 3. Does it make any delay for Apache Spark 3.0.0 release? > (I believe it was previously scheduled on June before Spark Summit > 2020) > I think if we need to delay to make a better release this is ok, especially given our current preview releases being available to gather community feedback. > > Although there was a discussion already, I want to make the following > tough parts sure. > > 4. We are not going to add Scala 2.11 API, right? > I hope not. > 5. We are not going to support Python 2.x in Apache Spark 3.1+, right? > I think doing that would be bad, it's already end of lifed elsewhere. > 6. Do we have enough resource for testing the deprecated ones? > (Currently, we have 8 heavy Jenkins jobs for `branch-3.0` already.) > > Especially, for (2) and (6), we know that keeping deprecated ones without > testings doesn't give us any support for the new rubric. > > Bests, > Dongjoon. > > On Thu, Feb 27, 2020 at 5:31 PM Matei Zaharia <matei.zaha...@gmail.com> > wrote: > >> +1 on this new rubric. It definitely captures the issues I’ve seen in >> Spark and in other projects. If we write down this rubric (or something >> like it), it will also be easier to refer to it during code reviews or in >> proposals of new APIs (we could ask “do you expect to have to change this >> API in the future, and if so, how”). >> >> Matei >> >> On Feb 24, 2020, at 3:02 PM, Michael Armbrust <mich...@databricks.com> >> wrote: >> >> Hello Everyone, >> >> As more users have started upgrading to Spark 3.0 preview (including >> myself), there have been many discussions around APIs that have been broken >> compared with Spark 2.x. In many of these discussions, one of the >> rationales for breaking an API seems to be "Spark follows semantic >> versioning <https://spark.apache.org/versioning-policy.html>, so this >> major release is our chance to get it right [by breaking APIs]". Similarly, >> in many cases the response to questions about why an API was completely >> removed has been, "this API has been deprecated since x.x, so we have to >> remove it". >> >> As a long time contributor to and user of Spark this interpretation of >> the policy is concerning to me. This reasoning misses the intention of the >> original policy, and I am worried that it will hurt the long-term success >> of the project. >> >> I definitely understand that these are hard decisions, and I'm not >> proposing that we never remove anything from Spark. However, I would like >> to give some additional context and also propose a different rubric for >> thinking about API breakage moving forward. >> >> Spark adopted semantic versioning back in 2014 during the preparations >> for the 1.0 release. As this was the first major release -- and as, up >> until fairly recently, Spark had only been an academic project -- no real >> promises had been made about API stability ever. >> >> During the discussion, some committers suggested that this was an >> opportunity to clean up cruft and give the Spark APIs a once-over, making >> cosmetic changes to improve consistency. However, in the end, it was >> decided that in many cases it was not in the best interests of the Spark >> community to break things just because we could. Matei actually said it >> pretty forcefully >> <http://apache-spark-developers-list.1001551.n3.nabble.com/Proposal-for-Spark-Release-Strategy-td464i20.html#a503> >> : >> >> I know that some names are suboptimal, but I absolutely detest breaking >> APIs, config names, etc. I’ve seen it happen way too often in other >> projects (even things we depend on that are officially post-1.0, like Akka >> or Protobuf or Hadoop), and it’s very painful. I think that we as fairly >> cutting-edge users are okay with libraries occasionally changing, but many >> others will consider it a show-stopper. Given this, I think that any >> cosmetic change now, even though it might improve clarity slightly, is not >> worth the tradeoff in terms of creating an update barrier for existing >> users. >> >> In the end, while some changes were made, most APIs remained the same and >> users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think >> this served the project very well, as compatibility means users are able to >> upgrade and we keep as many people on the latest versions of Spark (though >> maybe not the latest APIs of Spark) as possible. >> >> As Spark grows, I think compatibility actually becomes more important and >> we should be more conservative rather than less. Today, there are very >> likely more Spark programs running than there were at any other time in the >> past. Spark is no longer a tool only used by advanced hackers, it is now >> also running "traditional enterprise workloads.'' In many cases these jobs >> are powering important processes long after the original author leaves. >> >> Broken APIs can also affect libraries that extend Spark. This dependency >> can be even harder for users, as if the library has not been upgraded to >> use new APIs and they need that library, they are stuck. >> >> Given all of this, I'd like to propose the following rubric as an >> addition to our semantic versioning policy. After discussion and if >> people agree this is a good idea, I'll call a vote of the PMC to ratify its >> inclusion in the official policy. >> >> Considerations When Breaking APIs >> The Spark project strives to avoid breaking APIs or silently changing >> behavior, even at major versions. While this is not always possible, the >> balance of the following factors should be considered before choosing to >> break an API. >> >> Cost of Breaking an API >> Breaking an API almost always has a non-trivial cost to the users of >> Spark. A broken API means that Spark programs need to be rewritten before >> they can be upgraded. However, there are a few considerations when thinking >> about what the cost will be: >> >> - Usage - an API that is actively used in many different places, is >> always very costly to break. While it is hard to know usage for sure, >> there >> are a bunch of ways that we can estimate: >> - How long has the API been in Spark? >> - Is the API common even for basic programs? >> - How often do we see recent questions in JIRA or mailing lists? >> - How often does it appear in StackOverflow or blogs? >> - Behavior after the break - How will a program that works today, >> work after the break? The following are listed roughly in order of >> increasing severity: >> - Will there be a compiler or linker error? >> - Will there be a runtime exception? >> - Will that exception happen after significant processing has been >> done? >> - Will we silently return different answers? (very hard to debug, >> might not even notice!) >> >> >> Cost of Maintaining an API >> Of course, the above does not mean that we will never break any APIs. We >> must also consider the cost both to the project and to our users of keeping >> the API in question. >> >> - Project Costs - Every API we have needs to be tested and needs to >> keep working as other parts of the project changes. These costs are >> significantly exacerbated when external dependencies change (the JVM, >> Scala, etc). In some cases, while not completely technically infeasible, >> the cost of maintaining a particular API can become too high. >> - User Costs - APIs also have a cognitive cost to users learning >> Spark or trying to understand Spark programs. This cost becomes even >> higher >> when the API in question has confusing or undefined semantics. >> >> >> Alternatives to Breaking an API >> In cases where there is a "Bad API", but where the cost of removal is >> also high, there are alternatives that should be considered that do not >> hurt existing users but do address some of the maintenance costs. >> >> >> - Avoid Bad APIs - While this is a bit obvious, it is an important >> point. Anytime we are adding a new interface to Spark we should consider >> that we might be stuck with this API forever. Think deeply about how >> new APIs relate to existing ones, as well as how you expect them to evolve >> over time. >> - Deprecation Warnings - All deprecation warnings should point to a >> clear alternative and should never just say that an API is deprecated. >> - Updated Docs - Documentation should point to the "best" recommended >> way of performing a given task. In the cases where we maintain legacy >> documentation, we should clearly point to newer APIs and suggest to users >> the "right" way. >> - Community Work - Many people learn Spark by reading blogs and other >> sites such as StackOverflow. However, many of these resources are out of >> date. Update them, to reduce the cost of eventually removing deprecated >> APIs. >> >> >> Examples >> >> Here are some examples of how I think the policy above could be applied >> to different issues that have been discussed recently. These are only to >> illustrate how to apply the above rubric, but are not intended to be part >> of the official policy. >> >> [SPARK-26362] Remove 'spark.driver.allowMultipleContexts' to disallow >> multiple creation of SparkContexts #23311 >> <https://github.com/apache/spark/pull/23311> >> >> >> - Cost to Break - Multiple Contexts in a single JVM never worked >> properly. When users tried it they would nearly always report that Spark >> was broken (SPARK-2243 >> <https://issues.apache.org/jira/browse/SPARK-2243>), due to the >> confusing set of logs messages. Given this, I think it is very unlikely >> that there are many real world use cases active today. Even those cases >> likely suffer from undiagnosed issues as there are many areas of Spark >> that >> assume a single context per JVM. >> - Cost to Maintain - We have recently had users ask on the mailing >> list if this was supported, as the conf led them to believe it was, and >> the >> existence of this configuration as "supported" makes it harder to reason >> about certain global state in SparkContext. >> >> >> Decision: Remove this configuration and related code. >> >> [SPARK-25908] Remove registerTempTable #22921 >> <https://github.com/apache/spark/pull/22921/> (only looking at one API >> of this PR) >> >> >> - Cost to Break - This is a wildly popular API of Spark SQL that has >> been there since the first release. There are tons of blog posts and >> examples that use this syntax if you google "dataframe >> registerTempTable >> >> <https://www.google.com/search?q=dataframe+registertemptable&rlz=1C5CHFA_enUS746US746&oq=dataframe+registertemptable&aqs=chrome.0.0l8.3040j1j7&sourceid=chrome&ie=UTF-8>" >> (even more than the "correct" API "dataframe createOrReplaceView >> >> <https://www.google.com/search?rlz=1C5CHFA_enUS746US746&ei=TkZMXrj1ObzA0PEPpLKR2A4&q=dataframe+createorreplacetempview&oq=dataframe+createor&gs_l=psy-ab.3.0.0j0i22i30l7.663.1303..2750...0.3..1.212.782.7j0j1......0....1..gws-wiz.......0i71j0i131.zP34wH1novM>"). >> All of these will be invalid for users of Spark 3.0 >> - Cost to Maintain - This is just an alias, so there is not a lot of >> extra machinery required to keep the API. Users have two ways to do the >> same thing, but we can note that this is just an alias in the docs. >> >> >> Decision: Do not remove this API, I would even consider un-deprecating >> it. I anecdotally asked several users and this is the API they prefer over >> the "correct" one. >> >> [SPARK-25496] Deprecate from_utc_timestamp and to_utc_timestamp #24195 >> <https://github.com/apache/spark/pull/24195> >> >> - Cost to Break - I think that this case actually exemplifies several >> anti-patterns in breaking APIs. In some languages, the deprecation warning >> gives you no help, other than what version the function was removed in. In >> R, it points users to a really deep conversation on the semantics of time >> in Spark SQL. None of the messages tell you how you should correctly be >> parsing a timestamp that is given to you in a format other than UTC. My >> guess is all users will blindly flip the flag to true (to keep using this >> function), so you've only succeeded in annoying them. >> - Cost to Maintain - These are two relatively isolated expressions, >> there should be little cost to keeping them. Users can be confused by >> their >> semantics, so we probably should update the docs to point them to a best >> practice (I learned only by complaining on the PR, that a good practice is >> to parse timestamps including the timezone in the format expression, which >> naturally shifts them to UTC). >> >> >> Decision: Do not deprecate these two functions. We should update the >> docs to talk about best practices for parsing timestamps, including how to >> correctly shift them to UTC for storage. >> >> [SPARK-28093] Fix TRIM/LTRIM/RTRIM function parameter order issue #24902 >> <https://github.com/apache/spark/pull/24902> >> >> >> - Cost to Break - The TRIM function takes two string parameters. If >> we switch the parameter order, queries that use the TRIM function would >> silently get different results on different versions of Spark. Users may >> not notice it for a long time and wrong query results may cause serious >> problems to users. >> - Cost to Maintain - We will have some inconsistency inside Spark, as >> the TRIM function in Scala API and in SQL have different parameter order. >> >> >> Decision: Do not switch the parameter order. Promote the TRIM(trimStr >> FROM srcStr) syntax our SQL docs as it's the SQL standard. Deprecate >> (with a warning, not by removing) the SQL TRIM function and move users to >> the SQL standard TRIM syntax. >> >> Thanks for taking the time to read this! Happy to discuss the specifics >> and amend this policy as the community sees fit. >> >> Michael >> >> >> -- Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau