"Let's see if we can be smart and define the landscape" Well put @cos...I think Romans point was that it would be hard, not that it would be bad. And I think you're both right : it's hard? Yes. But worthwhile... Possibly? Next step we will all have to get in a room and think about this face to face.
Let's shoot for a meetup after january in California... Where we can plan the future direction of bigtop. In the meanwhile hope to hear more opinions on this. > On Dec 8, 2014, at 3:23 PM, Konstantin Boudnik <[email protected]> wrote: > > First I want to address the RJ's question: > > The most prominent downstream Bigtop Dependency would be any commercial > Hadoop distribution like HDP and CDH. The former is trying to > disguise their affiliation by pushing Ambari forward, and Cloudera's seemingly > shifting her focus to compressed tarballs media (aka parcels) which requires > a closed-source solutions like Cloudera Manager to deploy and control your > cluster, effectively rendering it useless if you ever decide to uninstall the > control software. In the interest of full disclosure, I don't think parcels > have any chance to landslide the consensus in the industry from Linux > packaging towards something so obscure and proprietary as parcels are. > > > And now to my actual points....: > > I do strongly believe the Bigtop was and is the only completely transparent, > vendors' friendly, and 100% sticking to official ASF product releases way of > building your stack from ground up, deploying and controlling it anyway you > want to. I agree with Roman's presentation on how this project can move > forward. However, I somewhat disagree with his view on the perspectives. It > might be a hard road to drive the opinion of the community. But, it is a high > road. > > We are definitely small and mostly unsupported by commercial groups that are > using the framework. Being a box of LEGO won't win us anything. If anything, > the empirical evidences are against it as commercial distros have decided to > move towards their own means of "vendor lock-in" (yes, you hear me > right - that's exactly what I said: all so called open-source companies have > invented a way to lock-in their customers either with fancy "enterprise > features" that aren't adding but amending underlying stack; or with custom set > of patches oftentimes rendering the cluster to become incompatible between > different vendors). > > By all means, my money are on the second way, yet slightly modified (as > use-cases are coming from users, not developers): > #2 start driving adoption of software stacks for the particular kind of data > workloads > > This community has enough day-to-day practitioners on board to > accumulate a near-complete introspection of where the technology is moving. > And instead of wobbling in a backwash, let's see if we can be smart and define > this landscape. After all, Bigtop has adopted Spark well before any of the > commercials have officially accepted it. We seemingly are moving more and > more into in-memory realm of data processing: Apache Ignite (Gridgain), > Tachyon, Spark. I don't know how much legs Hive got in it, but I am doubtful, > that it can walk for much longer... May be it's just me. > > In this thread http://is.gd/MV2BH9 we already discussed some of the aspects > influencing the feature of this project. And we are de-facto working on the > implementation. In my opinion, Hadoop has been more or less commoditized > already. And it isn't a bad thing, but it means that the innovations are > elsewhere. E.g. Spark moving is moving beyond its ties with storage layer via > Tachyon abstraction; GridGain simply doesn't care what's underlying storage > is. However, data needs to be stored somewhere before it can be processed. And > HCFS seems to be fitting the bill ok. But, as I said already, I see the real > action elsewhere. If I were to define the shape of our mid- to long'ish term > roadmap it'd be something like that: > > ^ Dashboard/Visualization ^ > | OLTP/ML processing | > | Caching/Acceleration | > | Storage | > > And around this we can add/improve on deployment (R8???), > virtualization/containers/clouds. In other words - let's focus on the > vertical part of the stack, instead of simply supporting the status quo. > > Does Cassandra fits the Storage layer in that model? I don't know and most > important - I don't care. If there's an interest and manpower to have > Cassandra-based stack - sure, but perhaps let's do as a separate branch or > something, so we aren't over-complicating things. As Roman said earlier, in > this case it'd be great to engage Cassandra/DataStax people into this project. > But something tells me they won't be eager to jump on board. > > And finally, all this above leads to "how": how we can start reshaping the > stack into its next incarnation? Perhaps, Ubuntu model might be an answer for > that, but we have discussed that elsewhere and dropped the idea as it wasn't > feasible back in the day. Perhaps its time just came? > > Apologies for a long post. > Cos > > >> On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote: >> Which other projects depend on BigTop? How will the questions about the >> direction of BigTop affect those projects? >> >> On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <[email protected]> >> wrote: >> >>> Hi! >>> >>> On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <[email protected]> >>> wrote: >>>> hi bigtop ! >>>> >>>> I thought id start a thread a few vaguely related thoughts i have around >>>> next couple iterations of bigtop. >>> >>> I think in general I see two major ways for something like >>> Bigtop to evolve: >>> #1 remain a 'box of LEGO bricks' with very little opinion on >>> how these pieces need to be integrated >>> #2 start driving oppinioned use-cases for the particular kind of >>> bigdata workloads >>> >>> #1 is sort of what all of the Linux distros have been doing for >>> the majority of time they existed. #2 is close to what CentOS >>> is doing with SIGs. >>> >>> Honestly, given the size of our community so far and a total >>> lack of corporate backing (with a small exception of Cloudera >>> still paying for our EC2 time) I think #1 is all we can do. I'd >>> love to be wrong, though. >>> >>>> 1) Hive: How will bigtop to evolve to support it, now that it is much >>> more >>>> than a mapreduce query wrapper? >>> >>> I think Hive will remain a big part of Hadoop workloads for forseeable >>> future. What I'd love to see more of is rationalizing things like how >>> HCatalog, etc. need to be deployed. >>> >>>> 2) I wonder wether we should confirm cassandra interoperability of spark >>> in >>>> bigtop distros, >>> >>> Only if there's a significant interest from cassandra community and even >>> then my biggest fear is that with cassandra we're totally changing the >>> requirements for the underlying storage subsystem (nothing wrong with >>> that, its just that in Hadoop ecosystem everything assumes very HDFS'ish >>> requirements for the scale-out storage). >>> >>>> 4) in general, i think bigtop can move in one of 3 directions. >>>> >>>> EXPAND ? : Expanding to include new components, with just basic >>> interop, >>>> and let folks evolve their own stacks on top of bigtop on their own. >>>> >>>> CONTRACT+FOCUS ? Contracting to focus on a lean set of core >>> components, >>>> with super high quality. >>>> >>>> STAY THE COURSE ? Staying the same ~ a packaging platform for just >>>> hadoop's direct ecosystem. >>>> >>>> I am intrigued by the idea of A and B both have clear benefits and >>> costs... >>>> would like to see the opinions of folks --- do we lean in one direction >>> or >>>> another? What is the criteria for adding a new feature, package, stack to >>>> bigtop? >>>> >>>> ... Or maybe im just overthinking it and should be spending this time >>>> testing spark for 0.9 release.... >>> >>> I'd love to know what other think, but for 0.9 I'd rather stay the course. >>> >>> Thanks, >>> Roman. >>> >>> P.S. There are also market forces at play that may fundamentally change >>> the focus of what we're all working on in the year or so. >>>
