Re: [DISCUSS] stabilizing Hadoop releases wrt. downstream

Chris Embree Tue, 26 Feb 2013 18:53:24 -0800

Hey Roman,

I don't want to hi-jack your topic with tech talk if detracts from your
primary purpose, so please re-direct me as you see fit.


There are a lot of details but it seems that the primary problem is that
changes that "break" other code are introduced without
being specifically included.
It's one thing to optimize a function and keep all of the functionality the
same, no segregation should be required.  If you're going to change
functionality of a API Call or existing routine, it should be sectioned off
and specifically included (we've got plenty of xml) until the old way is
deprecated for 2 or 3 releases.

I have some ideas but I want to make sure this is the forum. :)

Thanks for all of your work.
Chris

On Tue, Feb 26, 2013 at 8:43 PM, Roman Shaposhnik <[email protected]>wrote:

> Hi!
>
> for the past couple of releases of Hadoop 2.X code line the issue
> of integration between Hadoop and its downstream projects has
> become quite a thorny issue. The poster child here is Oozie, where
> every release of Hadoop 2.X seems to be breaking the compatibility
> in various unpredictable ways. At times other components (such
> as HBase for example) also seem to be affected.
>
> Now, to be extremely clear -- I'm NOT talking about the *latest* version
> of Oozie working with the *latest* version of Hadoop, instead
> my observations come from running previous *stable*  releases
> of Bigtop on top of Hadoop 2.X RCs.
>
> As many of you know Apache Bigtop aims at providing a single
> platform for integration of Hadoop and Hadoop ecosystem projects.
> As such we're uniquely positioned to track compatibility between
> different Hadoop releases with regards to the downstream components
> (things like Oozie, Pig, Hive, Mahout, etc.). Every single single RC
> we've been pretty diligent at trying to provide integration-level feedback
> on the quality of the upcoming release,  but it seems that our efforts
> don't quite suffice in Hadoop 2.X stabilizing.
>
> Of course, one could argue that while Hadoop 2.X code line was
> designated 'alpha' expecting much in the way of perfect integration
> and compatibility was NOT what the Hadoop community was
> focusing on. I can appreciate that view, but what I'm interested in
> is the future of Hadoop 2.X not its past. Hence, here's my question
> to all of you as a Hadoop community at large:
>
> Do you guys think that the project have reached a point where integration
> and compatibility issues should be prioritized really high on the list
> of things that make or break each future release?
>
> The good news, is that Bigtop's charter is in big part *exactly* about
> providing you with this kind of feedback. We can easily tell you when
> Hadoop behavior, with regard to downstream components, changes
> between a previous stable release and the new RC (or even branch/trunk).
> What we can NOT do is submit patches for all the issues. We are simply
> too small a project and we need your help with that.
>
> I truly believe that we owe it to the downstream projects, and in the
> second half of this email I will try to convince you of that.
>
> We all know that integration projects are impossible to pull off
> unless there's a general consensus between all of the projects involved
> that they indeed need to work with each other. You can NOT force
> that notion, but you can always try to influence. This relationship
> goes both ways.
>
> Consider a question in front of the downstream communities
> of  whether or not to adopt Hadoop 2.X as the basis. To answer
> that question each downstream project has to be reasonably
> sure that their concerns will NOT fall on deaf ears and that
> Hadoop developers are, essentially, 'ready' for them to pick
> up Hadoop 2.X. I would argue that so far the Hadoop community
> had gone out of its way to signal that 2.X codeline is NOT
> ready for the downstream.
>
> I would argue that moving forward this is a really unfortunate
> situation that may end up undermining the long term success
> of Hadoop 2.X if we don't start addressing the problem. Think
> about it -- 90% of unit tests that run downstream on Apache
> infrastructure are still exercising Hadoop 1.X underneath.
> In fact, if you were to forcefully make, lets say, HBase's
> unit tests run on top of Hadoop 2.X quite a few of them
> are going to fail. Hadoop community is, in effect, cutting
> itself off from the biggest source of feedback -- its downstream
> users. This in turn:
>
>   * leaves Hadoop project in a perpetual state of broken
>     windows syndrome.
>
>   * leaves Apache Hadoop 2.X releases in a state considerably
>     inferior to the releases *including* Apache Hadoop done by the
>     vendors. The users have no choice but to alight themselves
>     with vendor offerings if they wish to utilize latest Hadoop
> functionality.
>     The artifact that is know as Apache Hadoop 2.X stopped being
>     a viable choice thus fracturing the user community and reducing
>     the benefits of a commonly deployed codebase.
>
>    * leaves downstream projects of Hadoop  in a jaded state where
>      they legitimately get very discouraged and frustrated and eventually
>      give up thinking that -- well, we work with one release of Hadoop
>      (the stable one Hadoop 1.X) and we shall wait for the Hadoop
>      community to get their act together.
>
> In my view (shared by quite a few members of the Apache Bigtop) we
> can definitely do better than this if we all agree that the proposed
> first 'beta' release of Hadoop 2.0.4 is the right time for it to happen.
>
> It is about time Hadoop 2.X community wins back all those end users
> and downstream projects that got left behind during the alpha
> stabilization phase.
>
> Thanks,
> Roman.
>

Re: [DISCUSS] stabilizing Hadoop releases wrt. downstream

Reply via email to