Re: [DISCUSS] stabilizing Hadoop releases wrt. downstream

Robert Evans Tue, 05 Mar 2013 07:19:12 -0800

That is a great point.  I have been meaning to set up the Jenkins build
for branch-2 for a while, so I took the 10 mins and just did it.


https://builds.apache.org/job/Hadoop-Common-2-Commit/

Don't let the name fool you, it publishes not just common, but HDFS, YARN,
MR, and tools too.  You should now have branch-2 SNAPSHOTS updated on each
commit to branch-2.  Feel free to bug me if you need more integration
points.  I am not an RE guy, but I can hack it to make things work :)

--Bobby

On 3/5/13 12:15 AM, "Konstantin Boudnik" <[email protected]> wrote:

>Arun,
>
>first of all, I don't think anyone is trying to put a blame on someone
>else. E.g. I had similar experience with Oozie being broken because of
>certain released changes in the upstream.
>
>I am sure that most people in BigTop community - especially those who
>share the committer-ship privilege in BigTop and other upstream
>projects, including Hadoop, - would be happy to help with the
>stabilization of the Hadoop base. The issue that a downstream
>integration project is likely to have is - for once - the absence of
>regularly published development artifacts. In the light of "it didn't
>happen if there's no picture" here's a couple of examples:
>
>  - 2.0.2-SNAPSHOT weren't published at all; only release 2.0.2-alpha
>artifacts were
>  - 2.0.3-SNAPSHOT weren't published until Feb 29, 2013 (it happened just
>once)
>
>So, technically speaking, unless an integration project is willing to
>build and maintain its own artifacts, it is impossible to do any
>preventive validation.
>
>Which brings me to my next question: how do you guys address
>"Integration is high on the list of *every* release". Again, please
>don't get me wrong - I am not looking to lay a blame on or corner
>anyone - I am really curious and would appreciate the input.
>
>
>Vinod:
>
>> As you yourself noted later, the pain is part of the 'alpha' status
>> of the release. We are targeting +one of the immediate future
>> releases to be a beta and so these troubles are really only the
>> short +term.
>
>I don't really want to get into the discussion about of what
>constitutes the alpha and how it has delayed the adoption of Hadoop2
>line. However, I want to point out that it is especially important for
>"alpha" platform to work nicely with downstream consumers of the said
>platform. For quite obvious reasons, I believe.
>
>> I think there is a fundamental problem with the interaction of
>> Bigtop with the downstream projects, if nothing else, with
>
>BigTop is as downstream as it can get, because BigTop essentially
>consumes all other component releases in order to produce a viable
>stack. Technicalities aside...
>
>> Hadoop. We never formalized on the process, will BigTop step in
>> after an RC is up for vote or before? As I see it, it's happening
>
>Bigtop essentially can give any component, including Hadoop, and
>better yet - the set of components - certain guaratees about
>compatibility and dependencies being included. Case in point is
>missing commons libraries missed in 1.0.1 release that essentially
>prevented HBase from working properly.
>
>> after the vote is up, so no wonder we are in this state. Shall we
>> have a pre-notice to Bigtop so that it can step in before?
>
>The above is in contradiction with earlier statement of "Integration
>is high on the list of *every* release". If BigTop isn't used for
>integration testing, then how said integration testing is performed?
>Is it some sort of test-patch process as Luke referred earlier?  And
>why it leaves the room for the integration issues being uncaught?
>Again, I am genuinely interested to know.
>
>> these short term pains. I'd rather like us swim through these now
>> instead of support broken APIs and features in our beta, having seen
>> this very thing happen with 1.*.
>
>I think you're mixing the point of integration with downstream and
>being in an alpha phase of the development. The former isn't about
>supporting "broken APIs" - it is about being consistent and avoid
>breaking the downstream applicaitons without letting said applications
>to accomodate the platform changes first.
>
>Changes in the API, after all, can be relatively easy traced by
>integration validation - this is the whole point of integration
>testing. And BigTop does the job better then anything around, simply
>because there's nothing else around to do it.
>
>If you stay in shape-shifting "alpha" that doesn't integrate well for
>a very long time, you risk to lose downstream customers' interest,
>because they might get tired of waiting until a next stable API will
>be ready for them.
>
>> Let's fix the way the release related communication is happening
>> across our projects so that we can all work together and make 2.X a
>> success.
>
>This is a very good point indeed! Let's start a separate discussion
>thread on how we can improve the release model for coming Hadoop
>releases, where we - as the community - can provide better guarantees
>of the inter-component compatibility (sorry for an overused word).
>
>Cos
>
>On Fri, Mar 01, 2013 at 10:58AM, Arun C Murthy wrote:
>> I feel this is being blown out of proportion.
>> 
>> Integration is high on the list of *every* release. In future, if
>>anyone or
>> bigtop wants to help, running integration tests on a hadoop RC and
>>providing
>> feedback would be very welcome. I'm pretty sure I will stop an RC if it
>> means it breaks and Oozie or HBase or Pig or Hive and re-spin it. For
>>e.g.
>> see recent efforts to do a 2.0.4-alpha.
>> 
>> With hadoop-2.0.3-alpha we discovered 3 *bugs* - making it sound like we
>> intentionally disregard integation issues is very harsh.
>> 
>> Please also see other thread where we discussed stabilizing APIS,
>>protocols
>> etc. for the next 'beta' release.
>> 
>> Arun
>> 
>> On Feb 26, 2013, at 5:43 PM, Roman Shaposhnik wrote:
>> 
>> > Hi!
>> > 
>> > for the past couple of releases of Hadoop 2.X code line the issue
>> > of integration between Hadoop and its downstream projects has
>> > become quite a thorny issue. The poster child here is Oozie, where
>> > every release of Hadoop 2.X seems to be breaking the compatibility
>> > in various unpredictable ways. At times other components (such
>> > as HBase for example) also seem to be affected.
>> > 
>> > Now, to be extremely clear -- I'm NOT talking about the *latest*
>>version
>> > of Oozie working with the *latest* version of Hadoop, instead
>> > my observations come from running previous *stable*  releases
>> > of Bigtop on top of Hadoop 2.X RCs.
>> > 
>> > As many of you know Apache Bigtop aims at providing a single
>> > platform for integration of Hadoop and Hadoop ecosystem projects.
>> > As such we're uniquely positioned to track compatibility between
>> > different Hadoop releases with regards to the downstream components
>> > (things like Oozie, Pig, Hive, Mahout, etc.). Every single single RC
>> > we've been pretty diligent at trying to provide integration-level
>>feedback
>> > on the quality of the upcoming release,  but it seems that our efforts
>> > don't quite suffice in Hadoop 2.X stabilizing.
>> > 
>> > Of course, one could argue that while Hadoop 2.X code line was
>> > designated 'alpha' expecting much in the way of perfect integration
>> > and compatibility was NOT what the Hadoop community was
>> > focusing on. I can appreciate that view, but what I'm interested in
>> > is the future of Hadoop 2.X not its past. Hence, here's my question
>> > to all of you as a Hadoop community at large:
>> > 
>> > Do you guys think that the project have reached a point where
>>integration
>> > and compatibility issues should be prioritized really high on the list
>> > of things that make or break each future release?
>> > 
>> > The good news, is that Bigtop's charter is in big part *exactly* about
>> > providing you with this kind of feedback. We can easily tell you when
>> > Hadoop behavior, with regard to downstream components, changes
>> > between a previous stable release and the new RC (or even
>>branch/trunk).
>> > What we can NOT do is submit patches for all the issues. We are simply
>> > too small a project and we need your help with that.
>> > 
>> > I truly believe that we owe it to the downstream projects, and in the
>> > second half of this email I will try to convince you of that.
>> > 
>> > We all know that integration projects are impossible to pull off
>> > unless there's a general consensus between all of the projects
>>involved
>> > that they indeed need to work with each other. You can NOT force
>> > that notion, but you can always try to influence. This relationship
>> > goes both ways.
>> > 
>> > Consider a question in front of the downstream communities
>> > of  whether or not to adopt Hadoop 2.X as the basis. To answer
>> > that question each downstream project has to be reasonably
>> > sure that their concerns will NOT fall on deaf ears and that
>> > Hadoop developers are, essentially, 'ready' for them to pick
>> > up Hadoop 2.X. I would argue that so far the Hadoop community
>> > had gone out of its way to signal that 2.X codeline is NOT
>> > ready for the downstream.
>> > 
>> > I would argue that moving forward this is a really unfortunate
>> > situation that may end up undermining the long term success
>> > of Hadoop 2.X if we don't start addressing the problem. Think
>> > about it -- 90% of unit tests that run downstream on Apache
>> > infrastructure are still exercising Hadoop 1.X underneath.
>> > In fact, if you were to forcefully make, lets say, HBase's
>> > unit tests run on top of Hadoop 2.X quite a few of them
>> > are going to fail. Hadoop community is, in effect, cutting
>> > itself off from the biggest source of feedback -- its downstream
>> > users. This in turn:
>> > 
>> >   * leaves Hadoop project in a perpetual state of broken
>> >     windows syndrome.
>> > 
>> >   * leaves Apache Hadoop 2.X releases in a state considerably
>> >     inferior to the releases *including* Apache Hadoop done by the
>> >     vendors. The users have no choice but to alight themselves
>> >     with vendor offerings if they wish to utilize latest Hadoop
>>functionality.
>> >     The artifact that is know as Apache Hadoop 2.X stopped being
>> >     a viable choice thus fracturing the user community and reducing
>> >     the benefits of a commonly deployed codebase.
>> > 
>> >    * leaves downstream projects of Hadoop  in a jaded state where
>> >      they legitimately get very discouraged and frustrated and
>>eventually
>> >      give up thinking that -- well, we work with one release of Hadoop
>> >      (the stable one Hadoop 1.X) and we shall wait for the Hadoop
>> >      community to get their act together.
>> > 
>> > In my view (shared by quite a few members of the Apache Bigtop) we
>> > can definitely do better than this if we all agree that the proposed
>> > first 'beta' release of Hadoop 2.0.4 is the right time for it to
>>happen.
>> > 
>> > It is about time Hadoop 2.X community wins back all those end users
>> > and downstream projects that got left behind during the alpha
>> > stabilization phase.
>> > 
>> > Thanks,
>> > Roman.
>> 
>> --
>> Arun C. Murthy
>> Hortonworks Inc.
>> http://hortonworks.com/
>> 
>>

Re: [DISCUSS] stabilizing Hadoop releases wrt. downstream

Reply via email to