Re: [VOTE] Abandon hdfsproxy HDFS contrib

Bernd Fondermann Fri, 18 Feb 2011 05:21:18 -0800

Hi Eric,

On Fri, Feb 18, 2011 at 13:46, Eric Baldeschwieler <eri...@yahoo-inc.com> wrote:
> Hi Bernd,
>
> Apache Hadoop is about scale. Most clusters will always be small, but Hadoop 
> is going mainstream precisely because it scales to huge data and cluster 
> sizes.
>
> There are lots of systems that work well on 10 node clusters. People select   
> Hadoop because they are confident that as their business / problem grows, 
> Hadoop can grow with it.


Please note that I did not say that Hadoop should not scale.
I know that winning Sorting contests is a great achievement and a huge
selling point.

I'm thinking along the lines of: How much scalability would the
majority of users be willing to trade for
a. more active committers (guess: 0%)
b. more regular releases
c. more non-scalability features (hot standby NN, security, younameit)

I for myself as a low-scale user *would* trade a few percent for b. and c.

Thanks,

  Bernd

> ---
> E14 - via iPhone
>
> On Feb 17, 2011, at 7:25 AM, "Bernd Fondermann" 
> <bernd.fonderm...@googlemail.com> wrote:
>
>> On Thu, Feb 17, 2011 at 14:58, Ian Holsman <had...@holsman.net> wrote:
>>> Hi Bernd.
>>>
>>> On Feb 17, 2011, at 7:43 AM, Bernd Fondermann wrote:
>>>>
>>>> We have the very unfortunate situation here at Hadoop where Apache
>>>> Hadoop is not the primary and foremost place of Hadoop development.
>>>> Instead, code is developed internally at Yahoo and then contributed in
>>>> (smaller or larger) chunks to Hadoop.
>>>
>>> This has been the situation in the past,
>>> but as you can see in the last month, this has changed.
>>>
>>> Yahoo! has publicly committed to move their development into the main code 
>>> base, and you can see they have started doing this with the 20.100 branch,
>>> and their recent commits to trunk.
>>> Combine this with Nige taking on the 0.22 release branch, (and sheperding 
>>> it into a stable release) and I think we have are addressing your concerns.
>>>
>>> They have also started bringing the discussions back on the list, see the 
>>> recent discussion about Jobtracker-nextgen Arun has re-started in 
>>> MAPREDUCE-279.
>>>
>>> I'm not saying it's perfect, but I think the major players understand there 
>>> is an issue, and they are *ALL* moving in the right direction.
>>
>> I enthusiastically would like to see your optimism be verified.
>> Maybe I'm misreading the statements issued publicly, but I don't think
>> that this is fully understood. I agree though that it's a move into
>> the right direction.
>>
>>>> This is open source development upside down.
>>>> It is not ok for people to diff ASF svn against their internal code
>>>> and provide the diff as a patch without reviewing IP first for every
>>>> line of code changed.
>>>> For larger chunks I'd suggest to even go via the Incubator IP clearance 
>>>> process.
>>>> Only then will we force committers to primarily work here in the open
>>>> and return to what I'd consider a healthy project.
>>>>
>>>> To be honest: Hadoop is in the process of falling apart.
>>>> Contrib Code gets moved out of Apache instead of being maintained here.
>>>> Discussions are seldom consense-driven.
>>>> Release branches stagnate.
>>>
>>> True. releases do take a long time. This is mainly due to it being 
>>> extremely hard to test and verify that a release is stable.
>>> It's not enough to just run the thing on 4 machines, you need at least 50 
>>> to test some of the major problems. This requires some serious $ for 
>>> someone to verify.
>>
>> It has been proposed on the list before, IIRC. Don't know how to get
>> there, but the project seriously needs access to a cluster of this
>> size.
>>
>>>> Downstream projects like HBase don't get proper support.
>>>> Production setups are made from 3rd party distributions.
>>>> Development is not happening here, but elsewhere behind corporate doors.
>>>> Discussion about future developments are started on corporate blogs (
>>>> http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/
>>>> ) instead of on the proper mailing list.
>>>> Hurdles for committing are way too high.
>>>> On the bright side, new committers and PMC members are added, this is
>>>> an improvement.
>>>>
>>>> I'd suggest to move away from relying on large code dumps from
>>>> corporations, and move back to the ASF-proven "individual committer
>>>> commits on trunk"-model where more committers can get involved.
>>>> If that means not to support high end cluster sizes for some months,
>>>> well, so be it.
>>>
>>>> Average committers cannot run - e.g. test - on high
>>>> end cluster sizes. If that would mean they cannot participate, then
>>>> the open source project better concentrate on small and medium sized
>>>> cluster instead.
>>>
>>>
>>> Well.. that's one approach.. but there are several companies out there who 
>>> rely on apache's hadoop to power their large clusters, so I'd hate to see 
>>> hadoop become something that only runs well on
>>> 10-nodes.. as I don't think that will help anyone either.
>>
>> But only looking at high-end scale doesn't help either.
>>
>> Lets face the fact that Hadoop is now moving from early adaptors phase
>> into a much broader market. I predict that small to medium sized
>> clusters will be the majority of Hadoop deployments in a few month
>> time. 4000, or even 500 machines is the high-end range. If the open
>> source project Hadoop cannot support those users adequately (without
>> becoming defunct), the committership might be better off to focus on
>> the low-end and medium sized users.
>>
>> I'm not suggesting to turn away from the handfull (?) of high-end
>> users. They certainly have most valuable input. But also, *they*
>> obviously have the resources in terms of larger clusters and
>> developers to deal with their specific setups. Obviously, they don't
>> need to rely on the open source project to make releases. In fact,
>> they *do* work on their own Hadoop derivatives.
>> All the other users, the hundreds of boring small cluster users, don't
>> have that choice. They *depend* on the open source releases.
>>
>> Hadoop is an Apache project, to provide HDFS and MR free of charge to
>> the general public. Not only to me - nor to only one or two big
>> companies either.
>> Focus on all the users.
>>
>>  Bernd
>

Re: [VOTE] Abandon hdfsproxy HDFS contrib

Reply via email to