Re: Spark Improvement Proposals

Cody Koeninger Sun, 09 Oct 2016 13:59:19 -0700

If there's confusion there, the document is specifically what I'm
proposing.  The email is just by way of introduction.


On Sun, Oct 9, 2016 at 3:47 PM, Nicholas Chammas <nicholas.cham...@gmail.com
> wrote:

> Oh, hmm… I guess I’m a little confused on the relation between Cody’s
> email and the document he linked to, which says:
>
> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-
> improvement-proposals.md#when
>
> SIPs should be used for significant user-facing or cross-cutting changes,
> not day-to-day improvements. When in doubt, if a committer thinks a change
> needs an SIP, it does.
>
> Nick
> 
>
> On Sun, Oct 9, 2016 at 4:40 PM Matei Zaharia <matei.zaha...@gmail.com>
> wrote:
>
>> Yup, but the example you gave is for alternatives about *user-facing
>> behavior*, not implementation. The current SIP doc describes "strategy"
>> more as implementation strategy. I'm just saying there are different
>> possible goals for these types of docs.
>>
>> BTW, PEPs and Scala SIPs focus primarily on user-facing behavior, but
>> also require a reference implementation. This is a bit different from what
>> Cody had in mind, I think.
>>
>>
>> Matei
>>
>> On Oct 9, 2016, at 1:25 PM, Nicholas Chammas <nicholas.cham...@gmail.com>
>> wrote:
>>
>>
>>    - Rejected strategies: I personally wouldn’t put this, because what’s
>>    the point of voting to reject a strategy before you’ve really begun
>>    designing and implementing something? What if you discover that the
>>    strategy is actually better when you start doing stuff?
>>
>> I would guess the point is to document alternatives that were discussed
>> and rejected, so that later on people can be pointed to that discussion and
>> the devs don’t have to repeat themselves unnecessarily every time someone
>> comes along and asks “Why didn’t you do this other thing?” That doesn’t
>> mean a rejected proposal can’t later be revisited and the SIP can’t be
>> updated.
>>
>> For reference from the Python community, PEP 492
>> <https://www.python.org/dev/peps/pep-0492/>, a Python Enhancement
>> Proposal for adding async and await syntax and “first-class” coroutines
>> to Python, has a section on rejected ideas
>> <https://www.python.org/dev/peps/pep-0492/#why-async-def> for the new
>> syntax. It captures a summary of what the devs discussed, but it doesn’t
>> mean the PEP can’t be updated and a previously rejected proposal can’t be
>> revived.
>>
>> At least in the Python community, a PEP serves not just as formal
>> starting point for a proposal (the “real” starting point is usually a
>> discussion on python-ideas or python-dev), but also as documentation of
>> what was agreed on and a living “spec” of sorts. So PEPs sometimes get
>> updated years after they are approved when revisions are agreed upon. PEPs
>> are also intended for wide consumption, vs. bug tracker issues which the
>> broader Python dev community are not expected to follow closely.
>>
>> Dunno if we want to follow a similar pattern for Spark, since the
>> project’s needs are different. But the Python community has used PEPs to
>> help organize and steer development since 2000; there are plenty of
>> examples there we can probably take inspiration from.
>>
>> By the way, can we call these things something other than Spark
>> Improvement Proposals? The acronym, SIP, conflicts with Scala SIPs
>> <http://docs.scala-lang.org/sips/index.html>. Since the Scala and Spark
>> communities have a lot of overlap, we don’t want, for example, names like
>> “SIP-10” to have an ambiguous meaning.
>>
>> Nick
>> 
>>
>> On Sun, Oct 9, 2016 at 3:34 PM Matei Zaharia <matei.zaha...@gmail.com>
>> wrote:
>>
>>> Hi Cody,
>>>
>>> I think this would be a lot more concrete if we had a more detailed
>>> template for SIPs. Right now, it's not super clear what's in scope -- e.g.
>>> are  they a way to solicit feedback on the user-facing behavior or on the
>>> internals? "Goals" can cover both things. I've been thinking of SIPs more
>>> as Product Requirements Docs (PRDs), which focus on *what* a code change
>>> should do as opposed to how.
>>>
>>> In particular, here are some things that you may or may not consider in
>>> scope for SIPs:
>>>
>>> - Goals and non-goals: This is definitely in scope, and IMO should focus
>>> on user-visible behavior (e.g. "system supports SQL window functions" or
>>> "system continues working if one node fails"). BTW I wouldn't say "rejected
>>> goals" because some of them might become goals later, so we're not
>>> definitively rejecting them.
>>>
>>> - Public API: Probably should be included in most SIPs unless it's too
>>> large to fully specify then (e.g. "let's add an ML library").
>>>
>>> - Use cases: I usually find this very useful in PRDs to better
>>> communicate the goals.
>>>
>>> - Internal architecture: This is usually *not* a thing users can easily
>>> comment on and it sounds more like a design doc item. Of course it's
>>> important to show that the SIP is feasible to implement. One exception,
>>> however, is that I think we'll have some SIPs primarily on internals (e.g.
>>> if somebody wants to refactor Spark's query optimizer or something).
>>>
>>> - Rejected strategies: I personally wouldn't put this, because what's
>>> the point of voting to reject a strategy before you've really begun
>>> designing and implementing something? What if you discover that the
>>> strategy is actually better when you start doing stuff?
>>>
>>> At a super high level, it depends on whether you want the SIPs to be
>>> PRDs for getting some quick feedback on the goals of a feature before it is
>>> designed, or something more like full-fledged design docs (just a more
>>> visible design doc for bigger changes). I looked at Kafka's KIPs, and they
>>> actually seem to be more like design docs. This can work too but it does
>>> require more work from the proposer and it can lead to the same problems
>>> you mentioned with people already having a design and implementation in
>>> mind.
>>>
>>> Basically, the question is, are you trying to iterate faster on design
>>> by adding a step for user feedback earlier? Or are you just trying to make
>>> design docs for key features more visible (and their approval more formal)?
>>>
>>> BTW note that in either case, I'd like to have a template for design
>>> docs too, which should also include goals. I think that would've avoided
>>> some of the issues you brought up.
>>>
>>> Matei
>>>
>>> On Oct 9, 2016, at 10:40 AM, Cody Koeninger <c...@koeninger.org> wrote:
>>>
>>> Here's my specific proposal (meta-proposal?)
>>>
>>> Spark Improvement Proposals (SIP)
>>>
>>>
>>> Background:
>>>
>>> The current problem is that design and implementation of large features
>>> are often done in private, before soliciting user feedback.
>>>
>>> When feedback is solicited, it is often as to detailed design specifics,
>>> not focused on goals.
>>>
>>> When implementation does take place after design, there is often
>>> disagreement as to what goals are or are not in scope.
>>>
>>> This results in commits that don't fully meet user needs.
>>>
>>>
>>> Goals:
>>>
>>> - Ensure user, contributor, and committer goals are clearly identified
>>> and agreed upon, before implementation takes place.
>>>
>>> - Ensure that a technically feasible strategy is chosen that is likely
>>> to meet the goals.
>>>
>>>
>>> Rejected Goals:
>>>
>>> - SIPs are not for detailed design.  Design by committee doesn't work.
>>>
>>> - SIPs are not for every change.  We dont need that much process.
>>>
>>>
>>> Strategy:
>>>
>>> My suggestion is outlined as a Spark Improvement Proposal process
>>> documented at
>>>
>>> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-
>>> improvement-proposals.md
>>>
>>> Specifics of Jira manipulation are an implementation detail we can
>>> figure out.
>>>
>>> I'm suggesting voting; the need here is for a _clear_ outcome.
>>>
>>>
>>> Rejected Strategies:
>>>
>>> Having someone who understands the problem implement it first works, but
>>> only if significant iteration after user feedback is allowed.
>>>
>>> Historically this has been problematic due to pressure to limit public
>>> api changes.
>>>
>>> On Fri, Oct 7, 2016 at 5:16 PM, Reynold Xin <r...@databricks.com> wrote:
>>>
>>>> Alright looks like there are quite a bit of support. We should wait to
>>>> hear from more people too.
>>>>
>>>> To push this forward, Cody and I will be working together in the next
>>>> couple of weeks to come up with a concrete, detailed proposal on what this
>>>> entails, and then we can discuss this the specific proposal as well.
>>>>
>>>>
>>>> On Fri, Oct 7, 2016 at 2:29 PM, Cody Koeninger <c...@koeninger.org>
>>>> wrote:
>>>>
>>>>> Yeah, in case it wasn't clear, I was talking about SIPs for major
>>>>> user-facing or cross-cutting changes, not minor feature adds.
>>>>>
>>>>> On Fri, Oct 7, 2016 at 3:58 PM, Stavros Kontopoulos <
>>>>> stavros.kontopou...@lightbend.com> wrote:
>>>>>
>>>>>> +1 to the SIP label as long as it does not slow down things and it
>>>>>> targets optimizing efforts, coordination etc. For example really small
>>>>>> features should not need to go through this process (assuming they dont
>>>>>> touch public interfaces)  or re-factorings and hope it will be kept this
>>>>>> way. So as a guideline doc should be provided, like in the KIP case.
>>>>>>
>>>>>> IMHO so far aside from tagging things and linking them elsewhere
>>>>>> simply having design docs and prototypes implementations in PRs is not
>>>>>> something that has not worked so far. What is really a pain in many
>>>>>> projects out there is discontinuity in progress of PRs, missing features,
>>>>>> slow reviews which is understandable to some extent... it is not only 
>>>>>> about
>>>>>> Spark but things can be improved for sure for this project in particular 
>>>>>> as
>>>>>> already stated.
>>>>>>
>>>>>> On Fri, Oct 7, 2016 at 11:14 PM, Cody Koeninger <c...@koeninger.org>
>>>>>> wrote:
>>>>>>
>>>>>>> +1 to adding an SIP label and linking it from the website.  I think
>>>>>>> it needs
>>>>>>>
>>>>>>> - template that focuses it towards soliciting user goals / non goals
>>>>>>> - clear resolution as to which strategy was chosen to pursue.  I'd
>>>>>>> recommend a vote.
>>>>>>>
>>>>>>> Matei asked me to clarify what I meant by changing interfaces, I
>>>>>>> think
>>>>>>> it's directly relevant to the SIP idea so I'll clarify here, and
>>>>>>> split
>>>>>>> a thread for the other discussion per Nicholas' request.
>>>>>>>
>>>>>>> I meant changing public user interfaces.  I think the first design is
>>>>>>> unlikely to be right, because it's done at a time when you have the
>>>>>>> least information.  As a user, I find it considerably more
>>>>>>> frustrating
>>>>>>> to be unable to use a tool to get my job done, than I do having to
>>>>>>> make minor changes to my code in order to take advantage of features.
>>>>>>> I've seen committers be seriously reluctant to allow changes to
>>>>>>> @experimental code that are needed in order for it to really work
>>>>>>> right.  You need to be able to iterate, and if people on both sides
>>>>>>> of
>>>>>>> the fence aren't going to respect that some newer apis are subject to
>>>>>>> change, then why even mark them as such?
>>>>>>>
>>>>>>> Ideally a finished SIP should give me a checklist of things that an
>>>>>>> implementation must do, and things that it doesn't need to do.
>>>>>>> Contributors/committers should be seriously discouraged from putting
>>>>>>> out a version 0.1 that doesn't have at least a prototype
>>>>>>> implementation of all those things, especially if they're then going
>>>>>>> to argue against interface changes necessary to get the the rest of
>>>>>>> the things done in the 0.2 version.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin <r...@databricks.com>
>>>>>>> wrote:
>>>>>>> > I like the lightweight proposal to add a SIP label.
>>>>>>> >
>>>>>>> > During Spark 2.0 development, Tom (Graves) and I suggested using
>>>>>>> wiki to
>>>>>>> > track the list of major changes, but that never really
>>>>>>> materialized due to
>>>>>>> > the overhead. Adding a SIP label on major JIRAs and then link to
>>>>>>> them
>>>>>>> > prominently on the Spark website makes a lot of sense.
>>>>>>> >
>>>>>>> >
>>>>>>> > On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia <
>>>>>>> matei.zaha...@gmail.com>
>>>>>>> > wrote:
>>>>>>> >>
>>>>>>> >> For the improvement proposals, I think one major point was to
>>>>>>> make them
>>>>>>> >> really visible to users who are not contributors, so we should do
>>>>>>> more than
>>>>>>> >> sending stuff to dev@. One very lightweight idea is to have a
>>>>>>> new type of
>>>>>>> >> JIRA called a SIP and have a link to a filter that shows all such
>>>>>>> JIRAs from
>>>>>>> >> http://spark.apache.org. I also like the idea of SIP and design
>>>>>>> doc
>>>>>>> >> templates (in fact many projects have them).
>>>>>>> >>
>>>>>>> >> Matei
>>>>>>> >>
>>>>>>> >> On Oct 7, 2016, at 10:38 AM, Reynold Xin <r...@databricks.com>
>>>>>>> wrote:
>>>>>>> >>
>>>>>>> >> I called Cody last night and talked about some of the topics in
>>>>>>> his email.
>>>>>>> >> It became clear to me Cody genuinely cares about the project.
>>>>>>> >>
>>>>>>> >> Some of the frustrations come from the success of the project
>>>>>>> itself
>>>>>>> >> becoming very "hot", and it is difficult to get clarity from
>>>>>>> people who
>>>>>>> >> don't dedicate all their time to Spark. In fact, it is in some
>>>>>>> ways similar
>>>>>>> >> to scaling an engineering team in a successful startup: old
>>>>>>> processes that
>>>>>>> >> worked well might not work so well when it gets to a certain
>>>>>>> size, cultures
>>>>>>> >> can get diluted, building culture vs building process, etc.
>>>>>>> >>
>>>>>>> >> I also really like to have a more visible process for larger
>>>>>>> changes,
>>>>>>> >> especially major user facing API changes. Historically we upload
>>>>>>> design docs
>>>>>>> >> for major changes, but it is not always consistent and difficult
>>>>>>> to quality
>>>>>>> >> of the docs, due to the volunteering nature of the organization.
>>>>>>> >>
>>>>>>> >> Some of the more concrete ideas we discussed focus on building a
>>>>>>> culture
>>>>>>> >> to improve clarity:
>>>>>>> >>
>>>>>>> >> - Process: Large changes should have design docs posted on JIRA.
>>>>>>> One thing
>>>>>>> >> Cody and I didn't discuss but an idea that just came to me is we
>>>>>>> should
>>>>>>> >> create a design doc template for the project and ask everybody to
>>>>>>> follow.
>>>>>>> >> The design doc template should also explicitly list goals and
>>>>>>> non-goals, to
>>>>>>> >> make design doc more consistent.
>>>>>>> >>
>>>>>>> >> - Process: Email dev@ to solicit feedback. We have some this
>>>>>>> with some
>>>>>>> >> changes, but again very inconsistent. Just posting something on
>>>>>>> JIRA isn't
>>>>>>> >> sufficient, because there are simply too many JIRAs and the
>>>>>>> signal get lost
>>>>>>> >> in the noise. While this is generally impossible to enforce
>>>>>>> because we can't
>>>>>>> >> force all volunteers to conform to a process (or they might not
>>>>>>> even be
>>>>>>> >> aware of this),  those who are more familiar with the project can
>>>>>>> help by
>>>>>>> >> emailing the dev@ when they see something that hasn't been.
>>>>>>> >>
>>>>>>> >> - Culture: The design doc author(s) should be open to feedback. A
>>>>>>> design
>>>>>>> >> doc should serve as the base for discussion and is by no means
>>>>>>> the final
>>>>>>> >> design. Of course, this does not mean the author has to accept
>>>>>>> every
>>>>>>> >> feedback. They should also be comfortable accepting / rejecting
>>>>>>> ideas on
>>>>>>> >> technical grounds.
>>>>>>> >>
>>>>>>> >> - Process / Culture: For major ongoing projects, it can be useful
>>>>>>> to have
>>>>>>> >> some monthly Google hangouts that are open to the world. I am
>>>>>>> actually not
>>>>>>> >> sure how well this will work, because of the volunteering nature
>>>>>>> and we need
>>>>>>> >> to adjust for timezones for people across the globe, but it seems
>>>>>>> worth
>>>>>>> >> trying.
>>>>>>> >>
>>>>>>> >> - Culture: Contributors (including committers) should be more
>>>>>>> direct in
>>>>>>> >> setting expectations, including whether they are working on a
>>>>>>> specific
>>>>>>> >> issue, whether they will be working on a specific issue, and
>>>>>>> whether an
>>>>>>> >> issue or pr or jira should be rejected. Most people I know in
>>>>>>> this community
>>>>>>> >> are nice and don't enjoy telling other people no, but it is often
>>>>>>> more
>>>>>>> >> annoying to a contributor to not know anything than getting a no.
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> On Fri, Oct 7, 2016 at 10:03 AM, Matei Zaharia <
>>>>>>> matei.zaha...@gmail.com>
>>>>>>> >> wrote:
>>>>>>> >>>
>>>>>>> >>>
>>>>>>> >>> Love the idea of a more visible "Spark Improvement Proposal"
>>>>>>> process that
>>>>>>> >>> solicits user input on new APIs. For what it's worth, I don't
>>>>>>> think
>>>>>>> >>> committers are trying to minimize their own work -- every
>>>>>>> committer cares
>>>>>>> >>> about making the software useful for users. However, it is
>>>>>>> always hard to
>>>>>>> >>> get user input and so it helps to have this kind of process.
>>>>>>> I've certainly
>>>>>>> >>> looked at the *IPs a lot in other software I use just to see the
>>>>>>> biggest
>>>>>>> >>> things on the roadmap.
>>>>>>> >>>
>>>>>>> >>> When you're talking about "changing interfaces", are you talking
>>>>>>> about
>>>>>>> >>> public or internal APIs? I do think many people hate changing
>>>>>>> public APIs
>>>>>>> >>> and I actually think that's for the best of the project. That's
>>>>>>> a technical
>>>>>>> >>> debate, but basically, the worst thing when you're using a piece
>>>>>>> of software
>>>>>>> >>> is that the developers constantly ask you to rewrite your app to
>>>>>>> update to a
>>>>>>> >>> new version (and thus benefit from bug fixes, etc). Cue anyone
>>>>>>> who's used
>>>>>>> >>> Protobuf, or Guava. The "let's get everyone to change their code
>>>>>>> this
>>>>>>> >>> release" model works well within a single large company, but
>>>>>>> doesn't work
>>>>>>> >>> well for a community, which is why nearly all *very* widely used
>>>>>>> programming
>>>>>>> >>> interfaces (I'm talking things like Java standard library,
>>>>>>> Windows API, etc)
>>>>>>> >>> almost *never* break backwards compatibility. All this is done
>>>>>>> within reason
>>>>>>> >>> though, e.g. we do change things in major releases (2.x, 3.x,
>>>>>>> etc).
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >
>>>>>>>
>>>>>>> ------------------------------------------------------------
>>>>>>> ---------
>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Stavros Kontopoulos
>>>>>>
>>>>>> *Senior Software Engineer*
>>>>>> *Lightbend, Inc.*
>>>>>>
>>>>>> *p:  +30 6977967274 <%2B1%20650%20678%200020>*
>>>>>> *e: stavros.kontopou...@lightbend.com* <dave.mar...@lightbend.com>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>

Re: Spark Improvement Proposals

Reply via email to