Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Tom Graves Thu, 27 Feb 2020 07:24:32 -0800

 In general +1 I think these are good guidelines and making it easier to 
upgrade is beneficial to everyone.  The decision needs to happen at api/config 
change time, otherwise the deprecated warning has no purpose if we are never 
going to remove them.That said we still need to be able to remove deprecated 
things and change APIs in major releases, otherwise why do a  major release in 
the first place.  Is it purely to support newer Scala/python/java versions.
I think the hardest part listed here is what the impact is.  Who's call is 
that, it's hard to know how everyone is using things and I think it's been 
harder to get feedback on SPIPs and API changes in general as people are busy 
with other things. Like you mention, I think stackoverflow is unreliable, the 
posts could be many years old and no longer relevant. 
Tom    On Monday, February 24, 2020, 05:03:44 PM CST, Michael Armbrust 
<mich...@databricks.com> wrote:  
 
 
Hello Everyone,



As more users have started upgrading to Spark 3.0 preview (including myself), 
there have been many discussions around APIs that have been broken compared 
with Spark 2.x. In many of these discussions, one of the rationales for 
breaking an API seems to be "Spark follows semantic versioning, so this major 
release is our chance to get it right [by breaking APIs]". Similarly, in many 
cases the response to questions about why an API was completely removed has 
been, "this API has been deprecated since x.x, so we have to remove it".


As a long time contributor to and user of Spark this interpretation of the 
policy is concerning to me. This reasoning misses the intention of the original 
policy, and I am worried that it will hurt the long-term success of the project.


I definitely understand that these are hard decisions, and I'm not proposing 
that we never remove anything from Spark. However, I would like to give some 
additional context and also propose a different rubric for thinking about API 
breakage moving forward.


Spark adopted semantic versioning back in 2014 during the preparations for the 
1.0 release. As this was the first major release -- and as, up until fairly 
recently, Spark had only been an academic project -- no real promises had been 
made about API stability ever.


During the discussion, some committers suggested that this was an opportunity 
to clean up cruft and give the Spark APIs a once-over, making cosmetic changes 
to improve consistency. However, in the end, it was decided that in many cases 
it was not in the best interests of the Spark community to break things just 
because we could. Matei actually said it pretty forcefully:


I know that some names are suboptimal, but I absolutely detest breaking APIs, 
config names, etc. I’ve seen it happen way too often in other projects (even 
things we depend on that are officially post-1.0, like Akka or Protobuf or 
Hadoop), and it’s very painful. I think that we as fairly cutting-edge users 
are okay with libraries occasionally changing, but many others will consider it 
a show-stopper. Given this, I think that any cosmetic change now, even though 
it might improve clarity slightly, is not worth the tradeoff in terms of 
creating an update barrier for existing users.


In the end, while some changes were made, most APIs remained the same and users 
of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think this served 
the project very well, as compatibility means users are able to upgrade and we 
keep as many people on the latest versions of Spark (though maybe not the 
latest APIs of Spark) as possible.


As Spark grows, I think compatibility actually becomes more important and we 
should be more conservative rather than less. Today, there are very likely more 
Spark programs running than there were at any other time in the past. Spark is 
no longer a tool only used by advanced hackers, it is now also running 
"traditional enterprise workloads.'' In many cases these jobs are powering 
important processes long after the original author leaves.


Broken APIs can also affect libraries that extend Spark. This dependency can be 
even harder for users, as if the library has not been upgraded to use new APIs 
and they need that library, they are stuck.


Given all of this, I'd like to propose the following rubric as an addition to 
our semantic versioning policy. After discussion and if people agree this is a 
good idea, I'll call a vote of the PMC to ratify its inclusion in the official 
policy.


Considerations When Breaking APIs

The Spark project strives to avoid breaking APIs or silently changing behavior, 
even at major versions. While this is not always possible, the balance of the 
following factors should be considered before choosing to break an API.


Cost of Breaking an API

Breaking an API almost always has a non-trivial cost to the users of Spark. A 
broken API means that Spark programs need to be rewritten before they can be 
upgraded. However, there are a few considerations when thinking about what the 
cost will be:
   
   -    
Usage - an API that is actively used in many different places, is always very 
costly to break. While it is hard to know usage for sure, there are a bunch of 
ways that we can estimate: 

   
   -    
How long has the API been in Spark?

   -    
Is the API common even for basic programs?

   -    
How often do we see recent questions in JIRA or mailing lists?

   -    
How often does it appear in StackOverflow or blogs?

   
   -    
Behavior after the break - How will a program that works today, work after the 
break? The following are listed roughly in order of increasing severity:

   
   -    
Will there be a compiler or linker error?

   -    
Will there be a runtime exception?

   -    
Will that exception happen after significant processing has been done?

   -    
Will we silently return different answers? (very hard to debug, might not even 
notice!)



Cost of Maintaining an API

Of course, the above does not mean that we will never break any APIs. We must 
also consider the cost both to the project and to our users of keeping the API 
in question.
   
   -    
Project Costs - Every API we have needs to be tested and needs to keep working 
as other parts of the project changes. These costs are significantly 
exacerbated when external dependencies change (the JVM, Scala, etc). In some 
cases, while not completely technically infeasible, the cost of maintaining a 
particular API can become too high.

   -    
User Costs - APIs also have a cognitive cost to users learning Spark or trying 
to understand Spark programs. This cost becomes even higher when the API in 
question has confusing or undefined semantics.



Alternatives to Breaking an API

In cases where there is a "Bad API", but where the cost of removal is also 
high, there are alternatives that should be considered that do not hurt 
existing users but do address some of the maintenance costs.

   
   -    
Avoid Bad APIs - While this is a bit obvious, it is an important point. Anytime 
we are adding a new interface to Spark we should consider that we might be 
stuck with this API forever. Think deeply about how new APIs relate to existing 
ones, as well as how you expect them to evolve over time.

   -    
Deprecation Warnings - All deprecation warnings should point to a clear 
alternative and should never just say that an API is deprecated.

   -    
Updated Docs - Documentation should point to the "best" recommended way of 
performing a given task. In the cases where we maintain legacy documentation, 
we should clearly point to newer APIs and suggest to users the "right" way.

   -    
Community Work - Many people learn Spark by reading blogs and other sites such 
as StackOverflow. However, many of these resources are out of date. Update 
them, to reduce the cost of eventually removing deprecated APIs.



Examples


Here are some examples of how I think the policy above could be applied to 
different issues that have been discussed recently. These are only to 
illustrate how to apply the above rubric, but are not intended to be part of 
the official policy.


[SPARK-26362] Remove 'spark.driver.allowMultipleContexts' to disallow multiple 
creation of SparkContexts #23311

   
   -    
Cost to Break - Multiple Contexts in a single JVM never worked properly. When 
users tried it they would nearly always report that Spark was broken 
(SPARK-2243), due to the confusing set of logs messages. Given this, I think it 
is very unlikely that there are many real world use cases active today. Even 
those cases likely suffer from undiagnosed issues as there are many areas of 
Spark that assume a single context per JVM.

   -    
Cost to Maintain - We have recently had users ask on the mailing list if this 
was supported, as the conf led them to believe it was, and the existence of 
this configuration as "supported" makes it harder to reason about certain 
global state in SparkContext.



Decision: Remove this configuration and related code.


[SPARK-25908] Remove registerTempTable #22921 (only looking at one API of this 
PR)

   
   -    
Cost to Break - This is a wildly popular API of Spark SQL that has been there 
since the first release. There are tons of blog posts and examples that use 
this syntax if you google "dataframe registerTempTable" (even more than the 
"correct" API "dataframe createOrReplaceView"). All of these will be invalid 
for users of Spark 3.0

   -    
Cost to Maintain - This is just an alias, so there is not a lot of extra 
machinery required to keep the API. Users have two ways to do the same thing, 
but we can note that this is just an alias in the docs.



Decision: Do not remove this API, I would even consider un-deprecating it. I 
anecdotally asked several users and this is the API they prefer over the 
"correct" one.

[SPARK-25496] Deprecate from_utc_timestamp and to_utc_timestamp #24195
   
   -    
Cost to Break - I think that this case actually exemplifies several 
anti-patterns in breaking APIs. In some languages, the deprecation warning 
gives you no help, other than what version the function was removed in. In R, 
it points users to a really deep conversation on the semantics of time in Spark 
SQL. None of the messages tell you how you should correctly be parsing a 
timestamp that is given to you in a format other than UTC. My guess is all 
users will blindly flip the flag to true (to keep using this function), so 
you've only succeeded in annoying them.

   -    
Cost to Maintain - These are two relatively isolated expressions, there should 
be little cost to keeping them. Users can be confused by their semantics, so we 
probably should update the docs to point them to a best practice (I learned 
only by complaining on the PR, that a good practice is to parse timestamps 
including the timezone in the format expression, which naturally shifts them to 
UTC).



Decision: Do not deprecate these two functions. We should update the docs to 
talk about best practices for parsing timestamps, including how to correctly 
shift them to UTC for storage.


[SPARK-28093] Fix TRIM/LTRIM/RTRIM function parameter order issue #24902

   
   -    
Cost to Break - The TRIM function takes two string parameters. If we switch the 
parameter order, queries that use the TRIM function would silently get 
different results on different versions of Spark. Users may not notice it for a 
long time and wrong query results may cause serious problems to users.

   -    
Cost to Maintain - We will have some inconsistency inside Spark, as the TRIM 
function in Scala API and in SQL have different parameter order.



Decision: Do not switch the parameter order. Promote the TRIM(trimStr FROM 
srcStr) syntax our SQL docs as it's the SQL standard. Deprecate (with a 
warning, not by removing) the SQL TRIM function and move users to the SQL 
standard TRIM syntax.


Thanks for taking the time to read this! Happy to discuss the specifics and 
amend this policy as the community sees fit.


Michael

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Reply via email to