Re: Spark Improvement Proposals

Cody Koeninger Mon, 10 Oct 2016 12:08:13 -0700

I think this is closer to a procedural issue than a code modification
issue, hence why majority.  If everyone thinks consensus is better, I
don't care.  Again, I don't feel strongly about the way we achieve
clarity, just that we achieve clarity.


On Mon, Oct 10, 2016 at 2:02 PM, Ryan Blue <[email protected]> wrote:
> Sorry, I missed that the proposal includes majority approval. Why majority
> instead of consensus? I think we want to build consensus around these
> proposals and it makes sense to discuss until no one would veto.
>
> rb
>
> On Mon, Oct 10, 2016 at 11:54 AM, Ryan Blue <[email protected]> wrote:
>>
>> +1 to votes to approve proposals. I agree that proposals should have an
>> official mechanism to be accepted, and a vote is an established means of
>> doing that well. I like that it includes a period to review the proposal and
>> I think proposals should have been discussed enough ahead of a vote to
>> survive the possibility of a veto.
>>
>> I also like the names that are short and (mostly) unique, like SEP.
>>
>> Where I disagree is with the requirement that a committer must formally
>> propose an enhancement. I don't see the value of restricting this: if
>> someone has the will to write up a proposal then they should be encouraged
>> to do so and start a discussion about it. Even if there is a political
>> reality as Cody says, what is the value of codifying that in our process? I
>> think restricting who can submit proposals would only undermine them by
>> pushing contributors out. Maybe I'm missing something here?
>>
>> rb
>>
>>
>>
>> On Mon, Oct 10, 2016 at 7:41 AM, Cody Koeninger <[email protected]>
>> wrote:
>>>
>>> Yes, users suggesting SIPs is a good thing and is explicitly called
>>> out in the linked document under the Who? section.  Formally proposing
>>> them, not so much, because of the political realities.
>>>
>>> Yes, implementation strategy definitely affects goals.  There are all
>>> kinds of examples of this, I'll pick one that's my fault so as to
>>> avoid sounding like I'm blaming:
>>>
>>> When I implemented the Kafka DStream, one of my (not explicitly agreed
>>> upon by the community) goals was to make sure people could use the
>>> Dstream with however they were already using Kafka at work.  The lack
>>> of explicit agreement on that goal led to all kinds of fighting with
>>> committers, that could have been avoided.  The lack of explicit
>>> up-front strategy discussion led to the DStream not really working
>>> with compacted topics.  I knew about compacted topics, but don't have
>>> a use for them, so had a blind spot there.  If there was explicit
>>> up-front discussion that my strategy was "assume that batches can be
>>> defined on the driver solely by beginning and ending offsets", there's
>>> a greater chance that a user would have seen that and said, "hey, what
>>> about non-contiguous offsets in a compacted topic".
>>>
>>> This kind of thing is only going to happen smoothly if we have a
>>> lightweight user-visible process with clear outcomes.
>>>
>>> On Mon, Oct 10, 2016 at 1:34 AM, assaf.mendelson
>>> <[email protected]> wrote:
>>> > I agree with most of what Cody said.
>>> >
>>> > Two things:
>>> >
>>> > First we can always have other people suggest SIPs but mark them as
>>> > “unreviewed” and have committers basically move them forward. The
>>> > problem is
>>> > that writing a good document takes time. This way we can leverage non
>>> > committers to do some of this work (it is just another way to
>>> > contribute).
>>> >
>>> >
>>> >
>>> > As for strategy, in many cases implementation strategy can affect the
>>> > goals.
>>> > I will give  a small example: In the current structured streaming
>>> > strategy,
>>> > we group by the time to achieve a sliding window. This is definitely an
>>> > implementation decision and not a goal. However, I can think of several
>>> > aggregation functions which have the time inside their calculation
>>> > buffer.
>>> > For example, let’s say we want to return a set of all distinct values.
>>> > One
>>> > way to implement this would be to make the set into a map and have the
>>> > value
>>> > contain the last time seen. Multiplying it across the groupby would
>>> > cost a
>>> > lot in performance. So adding such a strategy would have a great effect
>>> > on
>>> > the type of aggregations and their performance which does affect the
>>> > goal.
>>> > Without adding the strategy, it is easy for whoever goes to the design
>>> > document to not think about these cases. Furthermore, it might be
>>> > decided
>>> > that these cases are rare enough so that the strategy is still good
>>> > enough
>>> > but how would we know it without user feedback?
>>> >
>>> > I believe this example is exactly what Cody was talking about. Since
>>> > many
>>> > times implementation strategies have a large effect on the goal, we
>>> > should
>>> > have it discussed when discussing the goals. In addition, while it is
>>> > often
>>> > easy to throw out completely infeasible goals, it is often much harder
>>> > to
>>> > figure out that the goals are unfeasible without fine tuning.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > Assaf.
>>> >
>>> >
>>> >
>>> > From: Cody Koeninger-2 [via Apache Spark Developers List]
>>> > [mailto:ml-node+[hidden email]]
>>> > Sent: Monday, October 10, 2016 2:25 AM
>>> > To: Mendelson, Assaf
>>> > Subject: Re: Spark Improvement Proposals
>>> >
>>> >
>>> >
>>> > Only committers should formally submit SIPs because in an apache
>>> > project only commiters have explicit political power.  If a user can't
>>> > find a commiter willing to sponsor an SIP idea, they have no way to
>>> > get the idea passed in any case.  If I can't find a committer to
>>> > sponsor this meta-SIP idea, I'm out of luck.
>>> >
>>> > I do not believe unrealistic goals can be found solely by inspection.
>>> > We've managed to ignore unrealistic goals even after implementation!
>>> > Focusing on APIs can allow people to think they've solved something,
>>> > when there's really no way of implementing that API while meeting the
>>> > goals.  Rapid iteration is clearly the best way to address this, but
>>> > we've already talked about why that hasn't really worked.  If adding a
>>> > non-binding API section to the template is important to you, I'm not
>>> > against it, but I don't think it's sufficient.
>>> >
>>> > On your PRD vs design doc spectrum, I'm saying this is closer to a
>>> > PRD.  Clear agreement on goals is the most important thing and that's
>>> > why it's the thing I want binding agreement on.  But I cannot agree to
>>> > goals unless I have enough minimal technical info to judge whether the
>>> > goals are likely to actually be accomplished.
>>> >
>>> >
>>> >
>>> > On Sun, Oct 9, 2016 at 5:35 PM, Matei Zaharia <[hidden email]> wrote:
>>> >
>>> >
>>> >> Well, I think there are a few things here that don't make sense.
>>> >> First,
>>> >> why
>>> >> should only committers submit SIPs? Development in the project should
>>> >> be
>>> >> open to all contributors, whether they're committers or not. Second, I
>>> >> think
>>> >> unrealistic goals can be found just by inspecting the goals, and I'm
>>> >> not
>>> >> super worried that we'll accept a lot of SIPs that are then infeasible
>>> >> --
>>> >> we
>>> >> can then submit new ones. But this depends on whether you want this
>>> >> process
>>> >> to be a "design doc lite", where people also agree on implementation
>>> >> strategy, or just a way to agree on goals. This is what I asked
>>> >> earlier
>>> >> about PRDs vs design docs (and I'm open to either one but I'd just
>>> >> like
>>> >> clarity). Finally, both as a user and designer of software, I always
>>> >> want
>>> >> to
>>> >> give feedback on APIs, so I'd really like a culture of having those
>>> >> early.
>>> >> People don't argue about prettiness when they discuss APIs, they argue
>>> >> about
>>> >> the core concepts to expose in order to meet various goals, and then
>>> >> they're
>>> >> stuck maintaining those for a long time.
>>> >>
>>> >> Matei
>>> >>
>>> >> On Oct 9, 2016, at 3:10 PM, Cody Koeninger <[hidden email]> wrote:
>>> >>
>>> >> Users instead of people, sure.  Commiters and contributors are (or at
>>> >> least
>>> >> should be) a subset of users.
>>> >>
>>> >> Non goals, sure. I don't care what the name is, but we need to clearly
>>> >> say
>>> >> e.g. 'no we are not maintaining compatibility with XYZ right now'.
>>> >>
>>> >> API, what I care most about is whether it allows me to accomplish the
>>> >> goals.
>>> >> Arguing about how ugly or pretty it is can be saved for design/
>>> >> implementation imho.
>>> >>
>>> >> Strategy, this is necessary because otherwise goals can be out of line
>>> >> with
>>> >> reality.  Don't propose goals you don't have at least some idea of how
>>> >> to
>>> >> implement.
>>> >>
>>> >> Rejected strategies, given that commiters are the only ones I'm saying
>>> >> should formally submit SPARKLIs or SIPs, if they put junk in a
>>> >> required
>>> >> section then slap them down for it and tell them to fix it.
>>> >>
>>> >>
>>> >> On Oct 9, 2016 4:36 PM, "Matei Zaharia" <[hidden email]> wrote:
>>> >>>
>>> >>> Yup, this is the stuff that I found unclear. Thanks for clarifying
>>> >>> here,
>>> >>> but we should also clarify it in the writeup. In particular:
>>> >>>
>>> >>> - Goals needs to be about user-facing behavior ("people" is broad)
>>> >>>
>>> >>> - I'd rename Rejected Goals to Non-Goals. Otherwise someone will dig
>>> >>> up
>>> >>> one of these and say "Spark's developers have officially rejected X,
>>> >>> which
>>> >>> our awesome system has".
>>> >>>
>>> >>> - For user-facing stuff, I think you need a section on API. Virtually
>>> >>> all
>>> >>> other *IPs I've seen have that.
>>> >>>
>>> >>> - I'm still not sure why the strategy section is needed if the
>>> >>> purpose is
>>> >>> to define user-facing behavior -- unless this is the strategy for
>>> >>> setting
>>> >>> the goals or for defining the API. That sounds squarely like a design
>>> >>> doc
>>> >>> issue. In some sense, who cares whether the proposal is technically
>>> >>> feasible
>>> >>> right now? If it's infeasible, that will be discovered later during
>>> >>> design
>>> >>> and implementation. Same thing with rejected strategies -- listing
>>> >>> some
>>> >>> of
>>> >>> those is definitely useful sometimes, but if you make this a
>>> >>> *required*
>>> >>> section, people are just going to fill it in with bogus stuff (I've
>>> >>> seen
>>> >>> this happen before).
>>> >>>
>>> >>> Matei
>>> >>>
>>> >
>>> >>> > On Oct 9, 2016, at 2:14 PM, Cody Koeninger <[hidden email]> wrote:
>>> >>> >
>>> >>> > So to focus the discussion on the specific strategy I'm suggesting,
>>> >>> > documented at
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>> >>> >
>>> >>> > "Goals: What must this allow people to do, that they can't
>>> >>> > currently?"
>>> >>> >
>>> >>> > Is it unclear that this is focusing specifically on people-visible
>>> >>> > behavior?
>>> >>> >
>>> >>> > Rejected goals -  are important because otherwise people keep
>>> >>> > trying
>>> >>> > to argue about scope.  Of course you can change things later with a
>>> >>> > different SIP and different vote, the point is to focus.
>>> >>> >
>>> >>> > Use cases - are something that people are going to bring up in
>>> >>> > discussion.  If they aren't clearly documented as a goal ("This
>>> >>> > must
>>> >>> > allow me to connect using SSL"), they should be added.
>>> >>> >
>>> >>> > Internal architecture - if the people who need specific behavior
>>> >>> > are
>>> >>> > implementers of other parts of the system, that's fine.
>>> >>> >
>>> >>> > Rejected strategies - If you have none of these, you have no
>>> >>> > evidence
>>> >>> > that the proponent didn't just go with the first thing they had in
>>> >>> > mind (or have already implemented), which is a big problem
>>> >>> > currently.
>>> >>> > Approval isn't binding as to specifics of implementation, so these
>>> >>> > aren't handcuffs.  The goals are the contract, the strategy is
>>> >>> > evidence that contract can actually be met.
>>> >>> >
>>> >>> > Design docs - I'm not touching design docs.  The markdown file I
>>> >>> > linked specifically says of the strategy section "This is not a
>>> >>> > full
>>> >>> > design document."  Is this unclear?  Design docs can be worked on
>>> >>> > obviously, but that's not what I'm concerned with here.
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > On Sun, Oct 9, 2016 at 2:34 PM, Matei Zaharia <[hidden email]>
>>> >>> > wrote:
>>> >>> >> Hi Cody,
>>> >>> >>
>>> >>> >> I think this would be a lot more concrete if we had a more
>>> >>> >> detailed
>>> >>> >> template
>>> >>> >> for SIPs. Right now, it's not super clear what's in scope -- e.g.
>>> >>> >> are
>>> >>> >> they
>>> >>> >> a way to solicit feedback on the user-facing behavior or on the
>>> >>> >> internals?
>>> >>> >> "Goals" can cover both things. I've been thinking of SIPs more as
>>> >>> >> Product
>>> >>> >> Requirements Docs (PRDs), which focus on *what* a code change
>>> >>> >> should
>>> >>> >> do
>>> >>> >> as
>>> >>> >> opposed to how.
>>> >>> >>
>>> >>> >> In particular, here are some things that you may or may not
>>> >>> >> consider
>>> >>> >> in
>>> >>> >> scope for SIPs:
>>> >>> >>
>>> >>> >> - Goals and non-goals: This is definitely in scope, and IMO should
>>> >>> >> focus on
>>> >>> >> user-visible behavior (e.g. "system supports SQL window functions"
>>> >>> >> or
>>> >>> >> "system continues working if one node fails"). BTW I wouldn't say
>>> >>> >> "rejected
>>> >>> >> goals" because some of them might become goals later, so we're not
>>> >>> >> definitively rejecting them.
>>> >>> >>
>>> >>> >> - Public API: Probably should be included in most SIPs unless it's
>>> >>> >> too
>>> >>> >> large
>>> >>> >> to fully specify then (e.g. "let's add an ML library").
>>> >>> >>
>>> >>> >> - Use cases: I usually find this very useful in PRDs to better
>>> >>> >> communicate
>>> >>> >> the goals.
>>> >>> >>
>>> >>> >> - Internal architecture: This is usually *not* a thing users can
>>> >>> >> easily
>>> >>> >> comment on and it sounds more like a design doc item. Of course
>>> >>> >> it's
>>> >>> >> important to show that the SIP is feasible to implement. One
>>> >>> >> exception,
>>> >>> >> however, is that I think we'll have some SIPs primarily on
>>> >>> >> internals
>>> >>> >> (e.g.
>>> >>> >> if somebody wants to refactor Spark's query optimizer or
>>> >>> >> something).
>>> >>> >>
>>> >>> >> - Rejected strategies: I personally wouldn't put this, because
>>> >>> >> what's
>>> >>> >> the
>>> >>> >> point of voting to reject a strategy before you've really begun
>>> >>> >> designing
>>> >>> >> and implementing something? What if you discover that the strategy
>>> >>> >> is
>>> >>> >> actually better when you start doing stuff?
>>> >>> >>
>>> >>> >> At a super high level, it depends on whether you want the SIPs to
>>> >>> >> be
>>> >>> >> PRDs
>>> >>> >> for getting some quick feedback on the goals of a feature before
>>> >>> >> it is
>>> >>> >> designed, or something more like full-fledged design docs (just a
>>> >>> >> more
>>> >>> >> visible design doc for bigger changes). I looked at Kafka's KIPs,
>>> >>> >> and
>>> >>> >> they
>>> >>> >> actually seem to be more like design docs. This can work too but
>>> >>> >> it
>>> >>> >> does
>>> >>> >> require more work from the proposer and it can lead to the same
>>> >>> >> problems you
>>> >>> >> mentioned with people already having a design and implementation
>>> >>> >> in
>>> >>> >> mind.
>>> >>> >>
>>> >>> >> Basically, the question is, are you trying to iterate faster on
>>> >>> >> design
>>> >>> >> by
>>> >>> >> adding a step for user feedback earlier? Or are you just trying to
>>> >>> >> make
>>> >>> >> design docs for key features more visible (and their approval more
>>> >>> >> formal)?
>>> >>> >>
>>> >>> >> BTW note that in either case, I'd like to have a template for
>>> >>> >> design
>>> >>> >> docs
>>> >>> >> too, which should also include goals. I think that would've
>>> >>> >> avoided
>>> >>> >> some of
>>> >>> >> the issues you brought up.
>>> >>> >>
>>> >>> >> Matei
>>> >>> >>
>>> >>> >> On Oct 9, 2016, at 10:40 AM, Cody Koeninger <[hidden email]>
>>> >>> >> wrote:
>>> >>> >>
>>> >>> >> Here's my specific proposal (meta-proposal?)
>>> >>> >>
>>> >>> >> Spark Improvement Proposals (SIP)
>>> >>> >>
>>> >>> >>
>>> >>> >> Background:
>>> >>> >>
>>> >>> >> The current problem is that design and implementation of large
>>> >>> >> features
>>> >>> >> are
>>> >>> >> often done in private, before soliciting user feedback.
>>> >>> >>
>>> >>> >> When feedback is solicited, it is often as to detailed design
>>> >>> >> specifics, not
>>> >>> >> focused on goals.
>>> >>> >>
>>> >>> >> When implementation does take place after design, there is often
>>> >>> >> disagreement as to what goals are or are not in scope.
>>> >>> >>
>>> >>> >> This results in commits that don't fully meet user needs.
>>> >>> >>
>>> >>> >>
>>> >>> >> Goals:
>>> >>> >>
>>> >>> >> - Ensure user, contributor, and committer goals are clearly
>>> >>> >> identified
>>> >>> >> and
>>> >>> >> agreed upon, before implementation takes place.
>>> >>> >>
>>> >>> >> - Ensure that a technically feasible strategy is chosen that is
>>> >>> >> likely
>>> >>> >> to
>>> >>> >> meet the goals.
>>> >>> >>
>>> >>> >>
>>> >>> >> Rejected Goals:
>>> >>> >>
>>> >>> >> - SIPs are not for detailed design.  Design by committee doesn't
>>> >>> >> work.
>>> >>> >>
>>> >>> >> - SIPs are not for every change.  We dont need that much process.
>>> >>> >>
>>> >>> >>
>>> >>> >> Strategy:
>>> >>> >>
>>> >>> >> My suggestion is outlined as a Spark Improvement Proposal process
>>> >>> >> documented
>>> >>> >> at
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>> >> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>> >>> >>
>>> >>> >> Specifics of Jira manipulation are an implementation detail we can
>>> >>> >> figure
>>> >>> >> out.
>>> >>> >>
>>> >>> >> I'm suggesting voting; the need here is for a _clear_ outcome.
>>> >>> >>
>>> >>> >>
>>> >>> >> Rejected Strategies:
>>> >>> >>
>>> >>> >> Having someone who understands the problem implement it first
>>> >>> >> works,
>>> >>> >> but
>>> >>> >> only if significant iteration after user feedback is allowed.
>>> >>> >>
>>> >>> >> Historically this has been problematic due to pressure to limit
>>> >>> >> public
>>> >>> >> api
>>> >>> >> changes.
>>> >>> >>
>>> >>> >>
>>> >>> >> On Fri, Oct 7, 2016 at 5:16 PM, Reynold Xin <[hidden email]>
>>> >>> >> wrote:
>>> >>> >>>
>>> >>> >>> Alright looks like there are quite a bit of support. We should
>>> >>> >>> wait
>>> >>> >>> to
>>> >>> >>> hear from more people too.
>>> >>> >>>
>>> >>> >>> To push this forward, Cody and I will be working together in the
>>> >>> >>> next
>>> >>> >>> couple of weeks to come up with a concrete, detailed proposal on
>>> >>> >>> what
>>> >>> >>> this
>>> >>> >>> entails, and then we can discuss this the specific proposal as
>>> >>> >>> well.
>>> >>> >>>
>>> >>> >>>
>>> >>> >>> On Fri, Oct 7, 2016 at 2:29 PM, Cody Koeninger <[hidden email]>
>>> >>> >>> wrote:
>>> >>> >>>>
>>> >>> >>>> Yeah, in case it wasn't clear, I was talking about SIPs for
>>> >>> >>>> major
>>> >>> >>>> user-facing or cross-cutting changes, not minor feature adds.
>>> >>> >>>>
>>> >>> >>>> On Fri, Oct 7, 2016 at 3:58 PM, Stavros Kontopoulos
>>> >>> >>>> <[hidden email]> wrote:
>>> >>> >>>>>
>>> >>> >>>>> +1 to the SIP label as long as it does not slow down things and
>>> >>> >>>>> it
>>> >>> >>>>> targets optimizing efforts, coordination etc. For example
>>> >>> >>>>> really
>>> >>> >>>>> small
>>> >>> >>>>> features should not need to go through this process (assuming
>>> >>> >>>>> they
>>> >>> >>>>> dont
>>> >>> >>>>> touch public interfaces)  or re-factorings and hope it will be
>>> >>> >>>>> kept
>>> >>> >>>>> this
>>> >>> >>>>> way. So as a guideline doc should be provided, like in the KIP
>>> >>> >>>>> case.
>>> >>> >>>>>
>>> >>> >>>>> IMHO so far aside from tagging things and linking them
>>> >>> >>>>> elsewhere
>>> >>> >>>>> simply
>>> >>> >>>>> having design docs and prototypes implementations in PRs is not
>>> >>> >>>>> something
>>> >>> >>>>> that has not worked so far. What is really a pain in many
>>> >>> >>>>> projects
>>> >>> >>>>> out there
>>> >>> >>>>> is discontinuity in progress of PRs, missing features, slow
>>> >>> >>>>> reviews
>>> >>> >>>>> which is
>>> >>> >>>>> understandable to some extent... it is not only about Spark but
>>> >>> >>>>> things can
>>> >>> >>>>> be improved for sure for this project in particular as already
>>> >>> >>>>> stated.
>>> >>> >>>>>
>>> >>> >>>>> On Fri, Oct 7, 2016 at 11:14 PM, Cody Koeninger <[hidden
>>> >>> >>>>> email]>
>>> >>> >>>>> wrote:
>>> >>> >>>>>>
>>> >>> >>>>>> +1 to adding an SIP label and linking it from the website.  I
>>> >>> >>>>>> think
>>> >>> >>>>>> it
>>> >>> >>>>>> needs
>>> >>> >>>>>>
>>> >>> >>>>>> - template that focuses it towards soliciting user goals / non
>>> >>> >>>>>> goals
>>> >>> >>>>>> - clear resolution as to which strategy was chosen to pursue.
>>> >>> >>>>>> I'd
>>> >>> >>>>>> recommend a vote.
>>> >>> >>>>>>
>>> >>> >>>>>> Matei asked me to clarify what I meant by changing interfaces,
>>> >>> >>>>>> I
>>> >>> >>>>>> think
>>> >>> >>>>>> it's directly relevant to the SIP idea so I'll clarify here,
>>> >>> >>>>>> and
>>> >>> >>>>>> split
>>> >>> >>>>>> a thread for the other discussion per Nicholas' request.
>>> >>> >>>>>>
>>> >>> >>>>>> I meant changing public user interfaces.  I think the first
>>> >>> >>>>>> design
>>> >>> >>>>>> is
>>> >>> >>>>>> unlikely to be right, because it's done at a time when you
>>> >>> >>>>>> have
>>> >>> >>>>>> the
>>> >>> >>>>>> least information.  As a user, I find it considerably more
>>> >>> >>>>>> frustrating
>>> >>> >>>>>> to be unable to use a tool to get my job done, than I do
>>> >>> >>>>>> having to
>>> >>> >>>>>> make minor changes to my code in order to take advantage of
>>> >>> >>>>>> features.
>>> >>> >>>>>> I've seen committers be seriously reluctant to allow changes
>>> >>> >>>>>> to
>>> >>> >>>>>> @experimental code that are needed in order for it to really
>>> >>> >>>>>> work
>>> >>> >>>>>> right.  You need to be able to iterate, and if people on both
>>> >>> >>>>>> sides
>>> >>> >>>>>> of
>>> >>> >>>>>> the fence aren't going to respect that some newer apis are
>>> >>> >>>>>> subject
>>> >>> >>>>>> to
>>> >>> >>>>>> change, then why even mark them as such?
>>> >>> >>>>>>
>>> >>> >>>>>> Ideally a finished SIP should give me a checklist of things
>>> >>> >>>>>> that
>>> >>> >>>>>> an
>>> >>> >>>>>> implementation must do, and things that it doesn't need to do.
>>> >>> >>>>>> Contributors/committers should be seriously discouraged from
>>> >>> >>>>>> putting
>>> >>> >>>>>> out a version 0.1 that doesn't have at least a prototype
>>> >>> >>>>>> implementation of all those things, especially if they're then
>>> >>> >>>>>> going
>>> >>> >>>>>> to argue against interface changes necessary to get the the
>>> >>> >>>>>> rest
>>> >>> >>>>>> of
>>> >>> >>>>>> the things done in the 0.2 version.
>>> >>> >>>>>>
>>> >>> >>>>>>
>>> >>> >>>>>> On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin <[hidden email]>
>>> >>> >>>>>> wrote:
>>> >>> >>>>>>> I like the lightweight proposal to add a SIP label.
>>> >>> >>>>>>>
>>> >>> >>>>>>> During Spark 2.0 development, Tom (Graves) and I suggested
>>> >>> >>>>>>> using
>>> >>> >>>>>>> wiki
>>> >>> >>>>>>> to
>>> >>> >>>>>>> track the list of major changes, but that never really
>>> >>> >>>>>>> materialized
>>> >>> >>>>>>> due to
>>> >>> >>>>>>> the overhead. Adding a SIP label on major JIRAs and then link
>>> >>> >>>>>>> to
>>> >>> >>>>>>> them
>>> >>> >>>>>>> prominently on the Spark website makes a lot of sense.
>>> >>> >>>>>>>
>>> >>> >>>>>>>
>>> >>> >>>>>>> On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia
>>> >>> >>>>>>> <[hidden email]>
>>> >>> >>>>>>> wrote:
>>> >>> >>>>>>>>
>>> >>> >>>>>>>> For the improvement proposals, I think one major point was
>>> >>> >>>>>>>> to
>>> >>> >>>>>>>> make
>>> >>> >>>>>>>> them
>>> >>> >>>>>>>> really visible to users who are not contributors, so we
>>> >>> >>>>>>>> should
>>> >>> >>>>>>>> do
>>> >>> >>>>>>>> more than
>>> >>> >>>>>>>> sending stuff to dev@. One very lightweight idea is to have
>>> >>> >>>>>>>> a
>>> >>> >>>>>>>> new
>>> >>> >>>>>>>> type of
>>> >>> >>>>>>>> JIRA called a SIP and have a link to a filter that shows all
>>> >>> >>>>>>>> such
>>> >>> >>>>>>>> JIRAs from
>>> >>> >>>>>>>> http://spark.apache.org. I also like the idea of SIP and
>>> >>> >>>>>>>> design
>>> >>> >>>>>>>> doc
>>> >>> >>>>>>>> templates (in fact many projects have them).
>>> >>> >>>>>>>>
>>> >>> >>>>>>>> Matei
>>> >>> >>>>>>>>
>>> >>> >>>>>>>> On Oct 7, 2016, at 10:38 AM, Reynold Xin <[hidden email]>
>>> >>> >>>>>>>> wrote:
>>> >>> >>>>>>>>
>>> >>> >>>>>>>> I called Cody last night and talked about some of the topics
>>> >>> >>>>>>>> in
>>> >>> >>>>>>>> his
>>> >>> >>>>>>>> email.
>>> >>> >>>>>>>> It became clear to me Cody genuinely cares about the
>>> >>> >>>>>>>> project.
>>> >>> >>>>>>>>
>>> >>> >>>>>>>> Some of the frustrations come from the success of the
>>> >>> >>>>>>>> project
>>> >>> >>>>>>>> itself
>>> >>> >>>>>>>> becoming very "hot", and it is difficult to get clarity from
>>> >>> >>>>>>>> people
>>> >>> >>>>>>>> who
>>> >>> >>>>>>>> don't dedicate all their time to Spark. In fact, it is in
>>> >>> >>>>>>>> some
>>> >>> >>>>>>>> ways
>>> >>> >>>>>>>> similar
>>> >>> >>>>>>>> to scaling an engineering team in a successful startup: old
>>> >>> >>>>>>>> processes that
>>> >>> >>>>>>>> worked well might not work so well when it gets to a certain
>>> >>> >>>>>>>> size,
>>> >>> >>>>>>>> cultures
>>> >>> >>>>>>>> can get diluted, building culture vs building process, etc.
>>> >>> >>>>>>>>
>>> >>> >>>>>>>> I also really like to have a more visible process for larger
>>> >>> >>>>>>>> changes,
>>> >>> >>>>>>>> especially major user facing API changes. Historically we
>>> >>> >>>>>>>> upload
>>> >>> >>>>>>>> design docs
>>> >>> >>>>>>>> for major changes, but it is not always consistent and
>>> >>> >>>>>>>> difficult
>>> >>> >>>>>>>> to
>>> >>> >>>>>>>> quality
>>> >>> >>>>>>>> of the docs, due to the volunteering nature of the
>>> >>> >>>>>>>> organization.
>>> >>> >>>>>>>>
>>> >>> >>>>>>>> Some of the more concrete ideas we discussed focus on
>>> >>> >>>>>>>> building a
>>> >>> >>>>>>>> culture
>>> >>> >>>>>>>> to improve clarity:
>>> >>> >>>>>>>>
>>> >>> >>>>>>>> - Process: Large changes should have design docs posted on
>>> >>> >>>>>>>> JIRA.
>>> >>> >>>>>>>> One
>>> >>> >>>>>>>> thing
>>> >>> >>>>>>>> Cody and I didn't discuss but an idea that just came to me
>>> >>> >>>>>>>> is we
>>> >>> >>>>>>>> should
>>> >>> >>>>>>>> create a design doc template for the project and ask
>>> >>> >>>>>>>> everybody
>>> >>> >>>>>>>> to
>>> >>> >>>>>>>> follow.
>>> >>> >>>>>>>> The design doc template should also explicitly list goals
>>> >>> >>>>>>>> and
>>> >>> >>>>>>>> non-goals, to
>>> >>> >>>>>>>> make design doc more consistent.
>>> >>> >>>>>>>>
>>> >>> >>>>>>>> - Process: Email dev@ to solicit feedback. We have some this
>>> >>> >>>>>>>> with
>>> >>> >>>>>>>> some
>>> >>> >>>>>>>> changes, but again very inconsistent. Just posting something
>>> >>> >>>>>>>> on
>>> >>> >>>>>>>> JIRA
>>> >>> >>>>>>>> isn't
>>> >>> >>>>>>>> sufficient, because there are simply too many JIRAs and the
>>> >>> >>>>>>>> signal
>>> >>> >>>>>>>> get lost
>>> >>> >>>>>>>> in the noise. While this is generally impossible to enforce
>>> >>> >>>>>>>> because
>>> >>> >>>>>>>> we can't
>>> >>> >>>>>>>> force all volunteers to conform to a process (or they might
>>> >>> >>>>>>>> not
>>> >>> >>>>>>>> even
>>> >>> >>>>>>>> be
>>> >>> >>>>>>>> aware of this),  those who are more familiar with the
>>> >>> >>>>>>>> project
>>> >>> >>>>>>>> can
>>> >>> >>>>>>>> help by
>>> >>> >>>>>>>> emailing the dev@ when they see something that hasn't been.
>>> >>> >>>>>>>>
>>> >>> >>>>>>>> - Culture: The design doc author(s) should be open to
>>> >>> >>>>>>>> feedback.
>>> >>> >>>>>>>> A
>>> >>> >>>>>>>> design
>>> >>> >>>>>>>> doc should serve as the base for discussion and is by no
>>> >>> >>>>>>>> means
>>> >>> >>>>>>>> the
>>> >>> >>>>>>>> final
>>> >>> >>>>>>>> design. Of course, this does not mean the author has to
>>> >>> >>>>>>>> accept
>>> >>> >>>>>>>> every
>>> >>> >>>>>>>> feedback. They should also be comfortable accepting /
>>> >>> >>>>>>>> rejecting
>>> >>> >>>>>>>> ideas on
>>> >>> >>>>>>>> technical grounds.
>>> >>> >>>>>>>>
>>> >>> >>>>>>>> - Process / Culture: For major ongoing projects, it can be
>>> >>> >>>>>>>> useful
>>> >>> >>>>>>>> to
>>> >>> >>>>>>>> have
>>> >>> >>>>>>>> some monthly Google hangouts that are open to the world. I
>>> >>> >>>>>>>> am
>>> >>> >>>>>>>> actually not
>>> >>> >>>>>>>> sure how well this will work, because of the volunteering
>>> >>> >>>>>>>> nature
>>> >>> >>>>>>>> and
>>> >>> >>>>>>>> we need
>>> >>> >>>>>>>> to adjust for timezones for people across the globe, but it
>>> >>> >>>>>>>> seems
>>> >>> >>>>>>>> worth
>>> >>> >>>>>>>> trying.
>>> >>> >>>>>>>>
>>> >>> >>>>>>>> - Culture: Contributors (including committers) should be
>>> >>> >>>>>>>> more
>>> >>> >>>>>>>> direct
>>> >>> >>>>>>>> in
>>> >>> >>>>>>>> setting expectations, including whether they are working on
>>> >>> >>>>>>>> a
>>> >>> >>>>>>>> specific
>>> >>> >>>>>>>> issue, whether they will be working on a specific issue, and
>>> >>> >>>>>>>> whether
>>> >>> >>>>>>>> an
>>> >>> >>>>>>>> issue or pr or jira should be rejected. Most people I know
>>> >>> >>>>>>>> in
>>> >>> >>>>>>>> this
>>> >>> >>>>>>>> community
>>> >>> >>>>>>>> are nice and don't enjoy telling other people no, but it is
>>> >>> >>>>>>>> often
>>> >>> >>>>>>>> more
>>> >>> >>>>>>>> annoying to a contributor to not know anything than getting
>>> >>> >>>>>>>> a
>>> >>> >>>>>>>> no.
>>> >>> >>>>>>>>
>>> >>> >>>>>>>>
>>> >>> >>>>>>>> On Fri, Oct 7, 2016 at 10:03 AM, Matei Zaharia
>>> >>> >>>>>>>> <[hidden email]>
>>> >>> >>>>>>>> wrote:
>>> >>> >>>>>>>>>
>>> >>> >>>>>>>>>
>>> >>> >>>>>>>>> Love the idea of a more visible "Spark Improvement
>>> >>> >>>>>>>>> Proposal"
>>> >>> >>>>>>>>> process that
>>> >>> >>>>>>>>> solicits user input on new APIs. For what it's worth, I
>>> >>> >>>>>>>>> don't
>>> >>> >>>>>>>>> think
>>> >>> >>>>>>>>> committers are trying to minimize their own work -- every
>>> >>> >>>>>>>>> committer
>>> >>> >>>>>>>>> cares
>>> >>> >>>>>>>>> about making the software useful for users. However, it is
>>> >>> >>>>>>>>> always
>>> >>> >>>>>>>>> hard to
>>> >>> >>>>>>>>> get user input and so it helps to have this kind of
>>> >>> >>>>>>>>> process.
>>> >>> >>>>>>>>> I've
>>> >>> >>>>>>>>> certainly
>>> >>> >>>>>>>>> looked at the *IPs a lot in other software I use just to
>>> >>> >>>>>>>>> see
>>> >>> >>>>>>>>> the
>>> >>> >>>>>>>>> biggest
>>> >>> >>>>>>>>> things on the roadmap.
>>> >>> >>>>>>>>>
>>> >>> >>>>>>>>> When you're talking about "changing interfaces", are you
>>> >>> >>>>>>>>> talking
>>> >>> >>>>>>>>> about
>>> >>> >>>>>>>>> public or internal APIs? I do think many people hate
>>> >>> >>>>>>>>> changing
>>> >>> >>>>>>>>> public APIs
>>> >>> >>>>>>>>> and I actually think that's for the best of the project.
>>> >>> >>>>>>>>> That's
>>> >>> >>>>>>>>> a
>>> >>> >>>>>>>>> technical
>>> >>> >>>>>>>>> debate, but basically, the worst thing when you're using a
>>> >>> >>>>>>>>> piece
>>> >>> >>>>>>>>> of
>>> >>> >>>>>>>>> software
>>> >>> >>>>>>>>> is that the developers constantly ask you to rewrite your
>>> >>> >>>>>>>>> app
>>> >>> >>>>>>>>> to
>>> >>> >>>>>>>>> update to a
>>> >>> >>>>>>>>> new version (and thus benefit from bug fixes, etc). Cue
>>> >>> >>>>>>>>> anyone
>>> >>> >>>>>>>>> who's used
>>> >>> >>>>>>>>> Protobuf, or Guava. The "let's get everyone to change their
>>> >>> >>>>>>>>> code
>>> >>> >>>>>>>>> this
>>> >>> >>>>>>>>> release" model works well within a single large company,
>>> >>> >>>>>>>>> but
>>> >>> >>>>>>>>> doesn't work
>>> >>> >>>>>>>>> well for a community, which is why nearly all *very* widely
>>> >>> >>>>>>>>> used
>>> >>> >>>>>>>>> programming
>>> >>> >>>>>>>>> interfaces (I'm talking things like Java standard library,
>>> >>> >>>>>>>>> Windows
>>> >>> >>>>>>>>> API, etc)
>>> >>> >>>>>>>>> almost *never* break backwards compatibility. All this is
>>> >>> >>>>>>>>> done
>>> >>> >>>>>>>>> within reason
>>> >>> >>>>>>>>> though, e.g. we do change things in major releases (2.x,
>>> >>> >>>>>>>>> 3.x,
>>> >>> >>>>>>>>> etc).
>>> >>> >>>>>>>>
>>> >>> >>>>>>>>
>>> >>> >>>>>>>>
>>> >>> >>>>>>>>
>>> >>> >>>>>>>
>>> >>> >>>>>>
>>> >>> >>>>>>
>>> >>> >>>>>>
>>> >>> >>>>>>
>>> >>> >>>>>> ---------------------------------------------------------------------
>>> >>> >>>>>> To unsubscribe e-mail: [hidden email]
>>> >>> >>>>>>
>>> >>> >>>>>
>>> >>> >>>>>
>>> >>> >>>>>
>>> >>> >>>>> --
>>> >>> >>>>> Stavros Kontopoulos
>>> >>> >>>>> Senior Software Engineer
>>> >>> >>>>> Lightbend, Inc.
>>> >>> >>>>> p:  +30 6977967274
>>> >>> >>>>> e: [hidden email]
>>> >>> >>>>>
>>> >>> >>>>>
>>> >>> >>>>
>>> >>> >>>
>>> >>> >>
>>> >>> >>
>>> >>>
>>> >>
>>> >
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe e-mail: [hidden email]
>>> >
>>> >
>>> > ________________________________
>>> >
>>> > If you reply to this email, your message will be added to the
>>> > discussion
>>> > below:
>>> >
>>> >
>>> > http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-tp19268p19359.html
>>> >
>>> > To start a new topic under Apache Spark Developers List, email [hidden
>>> > email]
>>> > To unsubscribe from Apache Spark Developers List, click here.
>>> > NAML
>>> >
>>> >
>>> > ________________________________
>>> > View this message in context: RE: Spark Improvement Proposals
>>> > Sent from the Apache Spark Developers List mailing list archive at
>>> > Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [email protected]
>>>
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix


---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Re: Spark Improvement Proposals

Reply via email to