Re: [REMINDER] Technical discussion on the mailing list

Eugene Kirpichov Thu, 06 Oct 2016 10:59:34 -0700

Hi Daniel,

Thanks for raising this. I think I was a major contributor to your
frustration with the process by suggesting big changes to your IO PR.


As others have said, ideally that process should have gone differently: 1)
we should have had documentation on the best practices in developing IOs,
and 2) I should not have requested you to do the AutoValue conversion in
your PR because that (using AutoValue for PTransform builders) was a new
idea at the time and it was unreasonable to put the burden of implementing
it on you.

I'm very much in favor of improving that process, e.g. by looking at
recently submitted IOs and putting the essence of discussion on those PRs
together into a best-practices document (this document, when it appears,
should *absolutely* be discussed on the mailing list!); and I should have
at least mentioned the idea "consider using it for PTransform builders too"
on the mailing list, because even though we already use AutoValue in many
places, we didn't use it for PTransforms, so that is a new technical idea,
even if not a new concept. This was an example of a technical discussion
that happened off list, and thanks to JB for reflecting it back onto the
list. Indeed, it takes some habit-building time to learn to do this
consistently.

***

Now, for SplittableDoFn in particular - I just looked through all the 171
comments on the PR [https://github.com/apache/incubator-beam/pull/896] and
I honestly don't think there's anything there that merited a broad
discussion on the list - the API is nearly exactly the same as was proposed
on the mailing list, and the comments are about style nitpicks or
implementation details relevant only to people actively working on the
involved classes. Weekly updates, if I provided them, would have been of
the form "Still addressing comments / refactoring $x / extracting part $y
into a separate PR / adding tests". The large number of comments is because
this is a large PR and there's a lot of gritty details there to get right,
but none of them are high-level - the high-level design has been finalized
already.

The PR is taking so long also because I've spun off several smaller PRs
making focused changes to individual subsystems (e.g. DoFnReflector), which
I think also do not merit being discussed on the mailing list, because they
don't change the technical direction of the project, these are deep
implementation-detail classes, there's only 2-3 people involved in
developing or using them (including myself), and all these people are
listed as reviewers.

Please let me know if you feel differently - as the person driving SDF, I
really would like to be sure that the community is satisfied with the
communication around it.

On Thu, Oct 6, 2016 at 3:05 AM Daniel Kulp <dk...@apache.org> wrote:

> At the end of the day, it comes down to two questions:
>
> 1) Are there technical and project direction discussions happening off
> list and not reflected back to the list?
>
> 2) If yes, are the concrete decisions being made as a result of the off
> list discussions?
>
>
> The answer to #1 is a definite yes.   From a quick perusal, several of the
> pull requests that have more than 10-15 comments is a discussion that
> should be back here.  Look at the Metrics PR, the splittable DoFn pr, etc…
> There are discussions there that NEED to be coming back to this list one
> way or another.   For things like the Splittable DoFn that is a long
> running discussion, it would most likely need periodic summaries/updates.
> Trying to follow everything going on in that PR is impossible with the
> comments on the outdated commits and such.
>
> Because of the scale of the #1 problem, I’m unsure of the answer to #2,
> but my gut feeling is yes.  If the off list discussions are resulting in
> technical changes that people don’t know about and cannot object to or
> comment on, then there is a problem.
>
> From an Apache standpoint, we have to get the answer to BOTH questions to
> a “no” state.   That’s a requirement.   The question is how do we get there?
>
> Dan
>
>
>
> > On Oct 6, 2016, at 8:29 AM, Davor Bonaci <da...@google.com.INVALID>
> wrote:
> >
> > Daniel, so glad you are starting to contribute to Beam! It was great
> > talking with you in person back in May. Welcome!
> >
> > --
> >
> > There are lots of different things mentioned here; I'll try to address
> them
> > separately.
> >
> > The first use of AutoValue should have been discussed on the dev@
> mailing
> > list. I think the main reason for the discussion is a bit different --
> > AutoValue has a non-trivial tradeoff -- compile complexity vs.
> boilerplate
> > code. For example, AutoValue may degrade IDE experience for some
> > contributors. If we'd go in depth on this, I'm sure we'd find opposing
> > opinions on the use of AutoValue. This tradeoff should have been
> discussed
> > on the dev@ list, followed by a community decision.
> >
> > Note that this has happened *before* the JdbcIO work. Since AutoValue has
> > been already used elsewhere in the project, there was no real reason not
> to
> > use it in JdbcIO, as appropriate. Therefore, I think JB and Eugene did
> > everything right! Second, third or thousandth usage of a concept doesn't
> > require any particular discussion. They didn't make anything worse. Their
> > discussion is totally appropriate for code review.
> >
> > Now, as Daniel points out, I think it is not right to ask a contributor
> to
> > change his PR to use AutoValue when none of the existing IO connectors
> use
> > it. This is making a too high standard. In fact, it is desirable for new
> > contributions to follow already established patterns, instead of
> inventing
> > something new. If we want to change a pattern, we should do it as a
> > separate effort across the board.
> >
> > On the other hand, dev@ discussion wouldn't have helped to prevent
> review
> > comments/discussions. Let's say we have had the discussion, and a new
> > contributor comes a year later. Should we ask her to read all discussions
> > that ever happened in the project to learn everything she might need? Of
> > course not! She should follow already established patterns and learn any
> > specifics during code review. And then, best practices should be
> documented
> > on the website.
> >
> > To summarize, a few things could have been better:
> > * Discussion of the first use of AutoValue on dev@.
> > * Avoiding overzealous core review comments.
> > * Changing a pattern should have been done by filing several starter
> tasks
> > in JIRA.
> >
> > --
> >
> > There are also several different proposals for altering a part of the
> > workflow.
> >
> >> code review comments not making to a list
> >
> > We have >1000 PRs so far, with at least a dozen comments on average, with
> > pace increasing. This is >10,000 emails, most of which are "fix a typo".
> > This leads into information overload, with actual information being
> missed.
> >
> > If someone wants this extra information -- just clicking the Watch button
> > in the GitHub UI will make it happen!
> >
> >> creating new JIRA and opening PR to dev@
> >
> > These currently go to commits@. This would have resulted in another
> 1,700
> > email threads compared to <150 now.
> >
> > Generally speaking, *all* of this is already available to anyone who
> wants
> > to receive it. However, anyone I know that has tried, has given up very
> > quickly ;). If anybody is concerned, we can create several new lists for
> > this traffic -- but we shouldn't repurpose dev@ for it.
> >
> >> I feel I’m missing things as there is significant amount of things not
> > happening on a list
> >
> > I think "feeling of missing things" is totally valid. I feel that too, as
> > well as almost everybody else.
> >
> > My best answer is -- we should realize that we are an extremely large and
> > complex project, with >100 contributors and >20 people working on it full
> > time. Nobody can follow every SDK, every runner, every IO connector,
> every
> > pull request, every comment that all contributors make each and every
> day.
> >
> > While nobody can follow everything, everything is being followed by
> > multiple people. And, we need to be accountable to each other to surface
> > everything relevant to the dev@ list. And I believe that is already
> > happening the vast majority of time. This is just one example where it
> > didn't happen.
> >
> > --
> >
> > All that said, there are certainly areas for improvement. If anyone has
> > specific ideas, please reach out! I'd love to discuss them in detail and
> > propose improvements to the wider community.
> >
> > Thanks!
> >
> >
> > On Wed, Oct 5, 2016 at 6:16 PM, Thomas Weise <t...@apache.org> wrote:
> >>
> >> How about sending just the notifications for creating new JIRA and
> opening
> >> PR to dev@ so that those that are interested can start watching?
> >>
> >> Thanks,
> >> Thomas
> >>
> >> On Wed, Oct 5, 2016 at 5:33 PM, Dan Halperin
> <dhalp...@google.com.invalid>
> >> wrote:
> >>
> >>> On Wed, Oct 5, 2016 at 5:13 PM, Daniel Kulp <dk...@apache.org> wrote:
> >>>
> >>>> I just want to give a little more context to this….  I’ve been
> > lurking on
> >>>> this list for several months now reading everything that’s going on.
> >>> From
> >>>> Apache’s standpoint, that should be a “very good start” for getting to
> >>> know
> >>>> what is happening in a project.
> >>>>
> >>>> On my last PR, Eugene commented about using the AutoValue pattern for
> >>> part
> >>>> of it which caught me off guard.   None of the other IO’s in master
> > were
> >>>> using it, there wasn’t any discussion on this list about it, I had no
> >>> idea
> >>>> what it was about.   So I asked JB to make sure I hadn’t missed
> > anything.
> >>>>
> >>>
> >>>> Anyway, this is one of the main concerns I have with Beam’s PR work
> > flow,
> >>>> I feel I’m missing things as there is significant amount of things not
> >>>> happening on a list.   The initial pull request is going to the
> > commits
> >>>> list (ok, would prefer the dev list, but at least its on a list).
> >>> However,
> >>>> none of the comments or discussions or anything that is occurring as
> > part
> >>>> of the review is making it to any list.   The only people that “learn”
> >>> from
> >>>> the reviews are the reviewers and the person who initiated the PR
> > unless
> >>>> they go into each and every PR and read the comments (and find the
> > news
> >>>> ones and such).    With my Apache hat on, this bothers me.
> >>>
> >>>
> >>> Anyway, I don’t really understand why the full github integration
> wasn’t
> >>>> setup for the beam PR’s so that the comments would come back to the
> > lists
> >>>> as well (and JIRA, BTW).
> >>>>
> >>>
> >>> This part confuses me. We've been told that discussions on JIRA, even
> >>> though they are emailed to the mailing lists, don't count as happening
> > on
> >>> the mailing list. So why would github integration be helpful vs just
> > more
> >>> spam?
> >>>
> >>> As another example, the comments on PR1003 are very applicable to
> anyone
> >>>> looking into writing IO’s and they could learn about some of the “best
> >>>> practices” presented there.
> >>>
> >>>
> >>> As Beam grows during its incubation, we are moving a lot of knowledge
> to
> >>> documentation, but yes -- right now, most of the I/O related practices
> > live
> >>> in Eugene's and my head (and now, JB's!). We're working on it, and hope
> > to
> >>> dramatically improve documentation for source authors in the next
> > quarter.
> >>>
> >>> For AutoValue specifically, this is by no means codified and it is
> >>> DEFINITELY not mandatory. Eugene and JB just experimented with it in
> the
> >>> last few days and decided it was useful in a few cases. We do (or did,
> >>> before this thread) need to have an actual discussion on the mailing
> > list
> >>> before moving forward further towards making it policy.
> >>>
> >>> Right now Ben Chambers is trying to apply AutoValue in places that need
> >>> templated types and struggling with multiple ?s, so the discussion may
> > need
> >>> to continue! ...
> >>>
> >>> Thanks,
> >>> Dan
> >>>
> >>> That’s basically the background as to why JB sent this.  I was confused
> > and
> >>>> bugged him.   :-)
> >>>>
> >>>> Dan
> >>>>
> >>>>
> >>>>
> >>>>> On Oct 5, 2016, at 1:51 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
> >>>> wrote:
> >>>>>
> >>>>> Hi team,
> >>>>>
> >>>>> I would like to excuse myself to have forgotten to discuss and share
> >>>> with you a technical point and generally speaking do a small reminder.
> >>>>>
> >>>>> When we work with Eugene on the JdbcIO, we experimented AutoValue to
> >>>> deal with IO configuration. AutoValue provides a nice way to reduce
> > and
> >>>> limit the boilerplate code required by the IO configuration.
> >>>>> We used AutoValue in JdbcIO and, regarding the good improvements we
> >>> saw,
> >>>> we started to refactor the other IOs.
> >>>>>
> >>>>> The use of AutoValue should have been notice and discussed on the
> >>>> mailing list.
> >>>>>
> >>>>> "If it doesn't exist on the mailing list, it doesn't exist at all."
> >>>>>
> >>>>> So, any comment happening on a GitHub pull request, or discussion on
> >>>> hangouts which can impact the project (generally speaking) has to
> > happen
> >>> on
> >>>> the mailing list.
> >>>>>
> >>>>> It provides project transparency and facilitates the new
> > contribution
> >>>> onboarding.
> >>>>>
> >>>>> Thanks !
> >>>>>
> >>>>> Regards
> >>>>> JB
> >>>>>
> >>>>
> >>>> --
> >>>> Daniel Kulp
> >>>> dk...@apache.org <mailto:dk...@apache.org> - http://dankulp.com/blog
> <
> >>>> http://dankulp.com/blog>
> >>>> Talend Community Coder - http://coders.talend.com <
> >>>> http://coders.talend.com/>
> >>>>
> >>>
>
> --
> Daniel Kulp
> dk...@apache.org <mailto:dk...@apache.org> - http://dankulp.com/blog <
> http://dankulp.com/blog>
> Talend Community Coder - http://coders.talend.com <
> http://coders.talend.com/>
>

Re: [REMINDER] Technical discussion on the mailing list

Reply via email to