First I want to address the RJ's question: The most prominent downstream Bigtop Dependency would be any commercial Hadoop distribution like HDP and CDH. The former is trying to disguise their affiliation by pushing Ambari forward, and Cloudera's seemingly shifting her focus to compressed tarballs media (aka parcels) which requires a closed-source solutions like Cloudera Manager to deploy and control your cluster, effectively rendering it useless if you ever decide to uninstall the control software. In the interest of full disclosure, I don't think parcels have any chance to landslide the consensus in the industry from Linux packaging towards something so obscure and proprietary as parcels are.
And now to my actual points....: I do strongly believe the Bigtop was and is the only completely transparent, vendors' friendly, and 100% sticking to official ASF product releases way of building your stack from ground up, deploying and controlling it anyway you want to. I agree with Roman's presentation on how this project can move forward. However, I somewhat disagree with his view on the perspectives. It might be a hard road to drive the opinion of the community. But, it is a high road. We are definitely small and mostly unsupported by commercial groups that are using the framework. Being a box of LEGO won't win us anything. If anything, the empirical evidences are against it as commercial distros have decided to move towards their own means of "vendor lock-in" (yes, you hear me right - that's exactly what I said: all so called open-source companies have invented a way to lock-in their customers either with fancy "enterprise features" that aren't adding but amending underlying stack; or with custom set of patches oftentimes rendering the cluster to become incompatible between different vendors). By all means, my money are on the second way, yet slightly modified (as use-cases are coming from users, not developers): #2 start driving adoption of software stacks for the particular kind of data workloads This community has enough day-to-day practitioners on board to accumulate a near-complete introspection of where the technology is moving. And instead of wobbling in a backwash, let's see if we can be smart and define this landscape. After all, Bigtop has adopted Spark well before any of the commercials have officially accepted it. We seemingly are moving more and more into in-memory realm of data processing: Apache Ignite (Gridgain), Tachyon, Spark. I don't know how much legs Hive got in it, but I am doubtful, that it can walk for much longer... May be it's just me. In this thread http://is.gd/MV2BH9 we already discussed some of the aspects influencing the feature of this project. And we are de-facto working on the implementation. In my opinion, Hadoop has been more or less commoditized already. And it isn't a bad thing, but it means that the innovations are elsewhere. E.g. Spark moving is moving beyond its ties with storage layer via Tachyon abstraction; GridGain simply doesn't care what's underlying storage is. However, data needs to be stored somewhere before it can be processed. And HCFS seems to be fitting the bill ok. But, as I said already, I see the real action elsewhere. If I were to define the shape of our mid- to long'ish term roadmap it'd be something like that: ^ Dashboard/Visualization ^ | OLTP/ML processing | | Caching/Acceleration | | Storage | And around this we can add/improve on deployment (R8???), virtualization/containers/clouds. In other words - let's focus on the vertical part of the stack, instead of simply supporting the status quo. Does Cassandra fits the Storage layer in that model? I don't know and most important - I don't care. If there's an interest and manpower to have Cassandra-based stack - sure, but perhaps let's do as a separate branch or something, so we aren't over-complicating things. As Roman said earlier, in this case it'd be great to engage Cassandra/DataStax people into this project. But something tells me they won't be eager to jump on board. And finally, all this above leads to "how": how we can start reshaping the stack into its next incarnation? Perhaps, Ubuntu model might be an answer for that, but we have discussed that elsewhere and dropped the idea as it wasn't feasible back in the day. Perhaps its time just came? Apologies for a long post. Cos On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote: > Which other projects depend on BigTop? How will the questions about the > direction of BigTop affect those projects? > > On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <[email protected]> > wrote: > > > Hi! > > > > On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <[email protected]> > > wrote: > > > hi bigtop ! > > > > > > I thought id start a thread a few vaguely related thoughts i have around > > > next couple iterations of bigtop. > > > > I think in general I see two major ways for something like > > Bigtop to evolve: > > #1 remain a 'box of LEGO bricks' with very little opinion on > > how these pieces need to be integrated > > #2 start driving oppinioned use-cases for the particular kind of > > bigdata workloads > > > > #1 is sort of what all of the Linux distros have been doing for > > the majority of time they existed. #2 is close to what CentOS > > is doing with SIGs. > > > > Honestly, given the size of our community so far and a total > > lack of corporate backing (with a small exception of Cloudera > > still paying for our EC2 time) I think #1 is all we can do. I'd > > love to be wrong, though. > > > > > 1) Hive: How will bigtop to evolve to support it, now that it is much > > more > > > than a mapreduce query wrapper? > > > > I think Hive will remain a big part of Hadoop workloads for forseeable > > future. What I'd love to see more of is rationalizing things like how > > HCatalog, etc. need to be deployed. > > > > > 2) I wonder wether we should confirm cassandra interoperability of spark > > in > > > bigtop distros, > > > > Only if there's a significant interest from cassandra community and even > > then my biggest fear is that with cassandra we're totally changing the > > requirements for the underlying storage subsystem (nothing wrong with > > that, its just that in Hadoop ecosystem everything assumes very HDFS'ish > > requirements for the scale-out storage). > > > > > 4) in general, i think bigtop can move in one of 3 directions. > > > > > > EXPAND ? : Expanding to include new components, with just basic > > interop, > > > and let folks evolve their own stacks on top of bigtop on their own. > > > > > > CONTRACT+FOCUS ? Contracting to focus on a lean set of core > > components, > > > with super high quality. > > > > > > STAY THE COURSE ? Staying the same ~ a packaging platform for just > > > hadoop's direct ecosystem. > > > > > > I am intrigued by the idea of A and B both have clear benefits and > > costs... > > > would like to see the opinions of folks --- do we lean in one direction > > or > > > another? What is the criteria for adding a new feature, package, stack to > > > bigtop? > > > > > > ... Or maybe im just overthinking it and should be spending this time > > > testing spark for 0.9 release.... > > > > I'd love to know what other think, but for 0.9 I'd rather stay the course. > > > > Thanks, > > Roman. > > > > P.S. There are also market forces at play that may fundamentally change > > the focus of what we're all working on in the year or so. > >
