Hi Eric, On Fri, Feb 18, 2011 at 13:46, Eric Baldeschwieler <eri...@yahoo-inc.com> wrote: > Hi Bernd, > > Apache Hadoop is about scale. Most clusters will always be small, but Hadoop > is going mainstream precisely because it scales to huge data and cluster > sizes. > > There are lots of systems that work well on 10 node clusters. People select > Hadoop because they are confident that as their business / problem grows, > Hadoop can grow with it.
Please note that I did not say that Hadoop should not scale. I know that winning Sorting contests is a great achievement and a huge selling point. I'm thinking along the lines of: How much scalability would the majority of users be willing to trade for a. more active committers (guess: 0%) b. more regular releases c. more non-scalability features (hot standby NN, security, younameit) I for myself as a low-scale user *would* trade a few percent for b. and c. Thanks, Bernd > --- > E14 - via iPhone > > On Feb 17, 2011, at 7:25 AM, "Bernd Fondermann" > <bernd.fonderm...@googlemail.com> wrote: > >> On Thu, Feb 17, 2011 at 14:58, Ian Holsman <had...@holsman.net> wrote: >>> Hi Bernd. >>> >>> On Feb 17, 2011, at 7:43 AM, Bernd Fondermann wrote: >>>> >>>> We have the very unfortunate situation here at Hadoop where Apache >>>> Hadoop is not the primary and foremost place of Hadoop development. >>>> Instead, code is developed internally at Yahoo and then contributed in >>>> (smaller or larger) chunks to Hadoop. >>> >>> This has been the situation in the past, >>> but as you can see in the last month, this has changed. >>> >>> Yahoo! has publicly committed to move their development into the main code >>> base, and you can see they have started doing this with the 20.100 branch, >>> and their recent commits to trunk. >>> Combine this with Nige taking on the 0.22 release branch, (and sheperding >>> it into a stable release) and I think we have are addressing your concerns. >>> >>> They have also started bringing the discussions back on the list, see the >>> recent discussion about Jobtracker-nextgen Arun has re-started in >>> MAPREDUCE-279. >>> >>> I'm not saying it's perfect, but I think the major players understand there >>> is an issue, and they are *ALL* moving in the right direction. >> >> I enthusiastically would like to see your optimism be verified. >> Maybe I'm misreading the statements issued publicly, but I don't think >> that this is fully understood. I agree though that it's a move into >> the right direction. >> >>>> This is open source development upside down. >>>> It is not ok for people to diff ASF svn against their internal code >>>> and provide the diff as a patch without reviewing IP first for every >>>> line of code changed. >>>> For larger chunks I'd suggest to even go via the Incubator IP clearance >>>> process. >>>> Only then will we force committers to primarily work here in the open >>>> and return to what I'd consider a healthy project. >>>> >>>> To be honest: Hadoop is in the process of falling apart. >>>> Contrib Code gets moved out of Apache instead of being maintained here. >>>> Discussions are seldom consense-driven. >>>> Release branches stagnate. >>> >>> True. releases do take a long time. This is mainly due to it being >>> extremely hard to test and verify that a release is stable. >>> It's not enough to just run the thing on 4 machines, you need at least 50 >>> to test some of the major problems. This requires some serious $ for >>> someone to verify. >> >> It has been proposed on the list before, IIRC. Don't know how to get >> there, but the project seriously needs access to a cluster of this >> size. >> >>>> Downstream projects like HBase don't get proper support. >>>> Production setups are made from 3rd party distributions. >>>> Development is not happening here, but elsewhere behind corporate doors. >>>> Discussion about future developments are started on corporate blogs ( >>>> http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/ >>>> ) instead of on the proper mailing list. >>>> Hurdles for committing are way too high. >>>> On the bright side, new committers and PMC members are added, this is >>>> an improvement. >>>> >>>> I'd suggest to move away from relying on large code dumps from >>>> corporations, and move back to the ASF-proven "individual committer >>>> commits on trunk"-model where more committers can get involved. >>>> If that means not to support high end cluster sizes for some months, >>>> well, so be it. >>> >>>> Average committers cannot run - e.g. test - on high >>>> end cluster sizes. If that would mean they cannot participate, then >>>> the open source project better concentrate on small and medium sized >>>> cluster instead. >>> >>> >>> Well.. that's one approach.. but there are several companies out there who >>> rely on apache's hadoop to power their large clusters, so I'd hate to see >>> hadoop become something that only runs well on >>> 10-nodes.. as I don't think that will help anyone either. >> >> But only looking at high-end scale doesn't help either. >> >> Lets face the fact that Hadoop is now moving from early adaptors phase >> into a much broader market. I predict that small to medium sized >> clusters will be the majority of Hadoop deployments in a few month >> time. 4000, or even 500 machines is the high-end range. If the open >> source project Hadoop cannot support those users adequately (without >> becoming defunct), the committership might be better off to focus on >> the low-end and medium sized users. >> >> I'm not suggesting to turn away from the handfull (?) of high-end >> users. They certainly have most valuable input. But also, *they* >> obviously have the resources in terms of larger clusters and >> developers to deal with their specific setups. Obviously, they don't >> need to rely on the open source project to make releases. In fact, >> they *do* work on their own Hadoop derivatives. >> All the other users, the hundreds of boring small cluster users, don't >> have that choice. They *depend* on the open source releases. >> >> Hadoop is an Apache project, to provide HDFS and MR free of charge to >> the general public. Not only to me - nor to only one or two big >> companies either. >> Focus on all the users. >> >> Bernd >