Rj - that's not too radical, seems like a lot of folks are embracing that idiom.
1) I like featuring spark along with some persistence technology. Cassandra don't seem to have interest in BigTop however. So maybe... Spark Tachyon Hbase+Phoenix SOLR Kafka Could be pretty effective. 2) visualization ? I think that is an afterthought, at least for now.... It's a lot of work just to get the stack compiling. > On Dec 11, 2014, at 1:14 PM, RJ Nowling <[email protected]> wrote: > > GraphX, Streaming, MLlib, and Spark SQL are all part of Spark and would be > included in BigTop if Spark is included. They're also pretty well integrated > with each other. > > I'd like to throw out a radical idea, based on Andrew's comments: focus on > the vertical rather than the horizontal with a slimmed down, Spark-oriented > stack. (This could be a subset of the current stack.) Strat.io's work > provides a nice example of a pure Spark stack. > > Spark offers a smaller footprint, far less maintenance, functionality of many > Hadoop components in one (and better integration!), and is better suited for > diverse deployment situations (cloud, non-HDFS storage, etc.) > > A few other complementary components would be needed: Kafka would be needed > for HA with Spark streaming. Tachyon. Maybe offer Cassandra or similar as > an alternative storage option. Combine this with dashboards and > visualization and high quality deployment options (Puppet, Docker, etc.). > With the data generator and Spark implementation of BigPetStore, my goal is > to to expand BPS to provide high quality analytics examples, oriented more > towards data scientists. > > Just a thought... > >> On Thu, Dec 11, 2014 at 12:39 PM, Andrew Purtell <[email protected]> wrote: >> This is a really great post and I was nodding along with most of it. >> >> My personal view is Bigtop starts as a deployable stack of Apache ecosystem >> components for Big Data. Commodification of (Linux) deployable packages and >> basic install integration is the baseline. >> >> Bigtop packaging Spark components first is an unfortunately little known win >> of this community, but its still a win. Although replicating that success >> with choice of the 'next big thing' is going to be a hit or miss proposition >> unless one of us can figure out time travel, definitely we can make some >> observations and scour and/or influence the Apache project landscape to pick >> up coverage in the space: >> >> - Storage is commoditized. Nearly everyone bases the storage stack on HDFS. >> Everyone does so with what we'd call HCFS. Best to focus elsewhere. >> >> - Packaging is commoditized. It's a shame that vendors pursue misguided >> lock-in strategies but we have no control over that. It's still true that >> someone using HDP or CDH 4 can switch to Bigtop and vice versa without >> changing package management tools or strategy. As a user of Apache stack >> technologies I want long term sustainable package management so will vote >> with my feet for the commodity option, and won't be alone. Bigtop should >> provide this, and does, and it's mostly a solved problem. >> >> - Deployment is also a "solved" problem but unfortunately everyone solves it >> differently. :-) This is an area where Bigtop can provide real value, and >> does, with the Puppet scripts, with the containerization work. One function >> Bigtop can serve is as repository and example of Hadoop-ish production >> tooling. >> >> - YARN is a reasonably generic grid resource manager. We don't have the >> resources to stand up an alternate RM and all the tooling necessary with >> Mesos, but if Mesosphere made a contribution of that I suspect we'd take it. >> From the Bigtop perspective I think computation framework options are well >> handled, in that I don't see Bigtop or anyone else developing credible >> alternatives to MR and Spark for some time. Not sure there's enough oxygen. >> And we have Giraph (and is GraphX packaged with Spark?). To the extent >> Spark-on-YARN has rough edges in the Bigtop framework that's an area where >> contributors can produce value. Related, support for Hive on Spark, Pig on >> Spark (spork). >> >> - The Apache stack includes three streaming computation frameworks - Storm, >> Spark Streaming, Samza - but Bigtop has mostly missed the boat here. Spark >> streaming is included in the spark package (I think) but how well is it >> integrated? Samza is well integrated with YARN but we don't package it. >> There's also been Storm-on-YARN work out of Yahoo, not sure about what was >> upstreamed or might be available. Anyway, integration of stream computation >> frameworks into Bigtop's packaging and deployment/management scripts can >> produce value, especially if we provide multiple options, because vendors >> are choosing favorites. >> >> - Data access. We do have players differentiating themselves here. Bigtop >> provides two SQL options (Hive, Phoenix+HBase), can add a third, I see >> someone's proposed Presto packaging. I'm not sure from the Bigtop >> perspective we need to pursue additional alternatives, but if there were >> contributions, we might well take them. "Enterprise friendly API" (SQL) is >> half of the data access picture I think, the other half is access control. >> There are competing projects in incubation, Sentry and Ranger, with no >> shared purpose, which is a real shame. To the extent that Bigtop adopts a >> cross-component full-stack access control technology, or helps bring another >> alternative into incubation and adopts that, we can move the needle in this >> space. We'd offer a vendor neutral access control option devoid of lock-in >> risk, this would be a big deal for big-E enterprises. >> >> - Data management and provenance. Now we're moving up the value chain from >> storage and data access to the next layer. This is mostly greenfield / blue >> ocean space in the Apache stack. We have interesting options in incubation: >> Falcon, Taverna, NiFi. (I think the last one might be truly comprehensive.) >> All of these are higher level data management and processing workflows which >> include aspects of management and provenance. One or more could be adopted >> and refined. There are a lot of relevant integration opportunities up and >> down the stack that could be undertaken with shared effort of the Bigtop, >> framework, and component communities. >> >> - Machine learning. Moving further up the value chain, we have data and >> computation and workflow, now how do we derive the competitive advantage >> that all of the lower layer technologies are in place for? The new hotness >> is surfacing of insights out of scaled parallel statistical inference. >> Unfortunately this space doesn't present itself well to the toolbox >> approach. Bigtop provides Mahout and MLLib as part of Spark (right?), they >> themselves are toolkits with components of varying utility and maturity (and >> relevance). I think Bigtop could provide some value by curating ML >> frameworks that tie in with other Apache stack technologies. ML toolkits >> leave would-be users in the cold. One has to know what one is doing, and >> what to do is highly use case specific, this is why "data scientists" can >> command obscene salaries and only commercial vendors have the resources to >> focus on specific verticals. >> >> - Visualization and preparation. Moving further up, now we are almost >> touching directly the use case. We have data but we need to clean it, >> normalize, regularize, filter, slice and dice. Where there are reasonably >> generic open source tools, preferably at Apache, for data preparation and >> cleaning Bigtop could provide baseline value by packaging it, and additional >> value with deeper integration with Apache stack components. Data preparation >> is a concern hand in hand with data ingest, so we have an interesting >> feedback loop from the top back down to ingest tools/building blocks like >> Kafka and Flume. Data cleaning concerns might overlap with the workflow >> frameworks too. If there's a friendly licensed open source graphical front >> end to the data cleaning/munging/exploration process that is generic enough >> that would be a really interesting "acquisition". >> - We can also package visualization libraries and toolkits for building >> dashboards. Like with ML algorithms, a complete integration is probably out >> of scope because every instance would be use case and user specific. >> >> >> >>> On Mon, Dec 8, 2014 at 12:23 PM, Konstantin Boudnik <[email protected]> wrote: >>> First I want to address the RJ's question: >>> >>> The most prominent downstream Bigtop Dependency would be any commercial >>> Hadoop distribution like HDP and CDH. The former is trying to >>> disguise their affiliation by pushing Ambari forward, and Cloudera's >>> seemingly >>> shifting her focus to compressed tarballs media (aka parcels) which requires >>> a closed-source solutions like Cloudera Manager to deploy and control your >>> cluster, effectively rendering it useless if you ever decide to uninstall >>> the >>> control software. In the interest of full disclosure, I don't think parcels >>> have any chance to landslide the consensus in the industry from Linux >>> packaging towards something so obscure and proprietary as parcels are. >>> >>> >>> And now to my actual points....: >>> >>> I do strongly believe the Bigtop was and is the only completely transparent, >>> vendors' friendly, and 100% sticking to official ASF product releases way of >>> building your stack from ground up, deploying and controlling it anyway you >>> want to. I agree with Roman's presentation on how this project can move >>> forward. However, I somewhat disagree with his view on the perspectives. It >>> might be a hard road to drive the opinion of the community. But, it is a >>> high >>> road. >>> >>> We are definitely small and mostly unsupported by commercial groups that are >>> using the framework. Being a box of LEGO won't win us anything. If anything, >>> the empirical evidences are against it as commercial distros have decided to >>> move towards their own means of "vendor lock-in" (yes, you hear me >>> right - that's exactly what I said: all so called open-source companies have >>> invented a way to lock-in their customers either with fancy "enterprise >>> features" that aren't adding but amending underlying stack; or with custom >>> set >>> of patches oftentimes rendering the cluster to become incompatible between >>> different vendors). >>> >>> By all means, my money are on the second way, yet slightly modified (as >>> use-cases are coming from users, not developers): >>> #2 start driving adoption of software stacks for the particular kind of >>> data workloads >>> >>> This community has enough day-to-day practitioners on board to >>> accumulate a near-complete introspection of where the technology is moving. >>> And instead of wobbling in a backwash, let's see if we can be smart and >>> define >>> this landscape. After all, Bigtop has adopted Spark well before any of the >>> commercials have officially accepted it. We seemingly are moving more and >>> more into in-memory realm of data processing: Apache Ignite (Gridgain), >>> Tachyon, Spark. I don't know how much legs Hive got in it, but I am >>> doubtful, >>> that it can walk for much longer... May be it's just me. >>> >>> In this thread http://is.gd/MV2BH9 we already discussed some of the aspects >>> influencing the feature of this project. And we are de-facto working on the >>> implementation. In my opinion, Hadoop has been more or less commoditized >>> already. And it isn't a bad thing, but it means that the innovations are >>> elsewhere. E.g. Spark moving is moving beyond its ties with storage layer >>> via >>> Tachyon abstraction; GridGain simply doesn't care what's underlying storage >>> is. However, data needs to be stored somewhere before it can be processed. >>> And >>> HCFS seems to be fitting the bill ok. But, as I said already, I see the real >>> action elsewhere. If I were to define the shape of our mid- to long'ish term >>> roadmap it'd be something like that: >>> >>> ^ Dashboard/Visualization ^ >>> | OLTP/ML processing | >>> | Caching/Acceleration | >>> | Storage | >>> >>> And around this we can add/improve on deployment (R8???), >>> virtualization/containers/clouds. In other words - let's focus on the >>> vertical part of the stack, instead of simply supporting the status quo. >>> >>> Does Cassandra fits the Storage layer in that model? I don't know and most >>> important - I don't care. If there's an interest and manpower to have >>> Cassandra-based stack - sure, but perhaps let's do as a separate branch or >>> something, so we aren't over-complicating things. As Roman said earlier, in >>> this case it'd be great to engage Cassandra/DataStax people into this >>> project. >>> But something tells me they won't be eager to jump on board. >>> >>> And finally, all this above leads to "how": how we can start reshaping the >>> stack into its next incarnation? Perhaps, Ubuntu model might be an answer >>> for >>> that, but we have discussed that elsewhere and dropped the idea as it wasn't >>> feasible back in the day. Perhaps its time just came? >>> >>> Apologies for a long post. >>> Cos >>> >>> >>> On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote: >>> > Which other projects depend on BigTop? How will the questions about the >>> > direction of BigTop affect those projects? >>> > >>> > On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <[email protected]> >>> > wrote: >>> > >>> > > Hi! >>> > > >>> > > On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <[email protected]> >>> > > wrote: >>> > > > hi bigtop ! >>> > > > >>> > > > I thought id start a thread a few vaguely related thoughts i have >>> > > > around >>> > > > next couple iterations of bigtop. >>> > > >>> > > I think in general I see two major ways for something like >>> > > Bigtop to evolve: >>> > > #1 remain a 'box of LEGO bricks' with very little opinion on >>> > > how these pieces need to be integrated >>> > > #2 start driving oppinioned use-cases for the particular kind of >>> > > bigdata workloads >>> > > >>> > > #1 is sort of what all of the Linux distros have been doing for >>> > > the majority of time they existed. #2 is close to what CentOS >>> > > is doing with SIGs. >>> > > >>> > > Honestly, given the size of our community so far and a total >>> > > lack of corporate backing (with a small exception of Cloudera >>> > > still paying for our EC2 time) I think #1 is all we can do. I'd >>> > > love to be wrong, though. >>> > > >>> > > > 1) Hive: How will bigtop to evolve to support it, now that it is much >>> > > more >>> > > > than a mapreduce query wrapper? >>> > > >>> > > I think Hive will remain a big part of Hadoop workloads for forseeable >>> > > future. What I'd love to see more of is rationalizing things like how >>> > > HCatalog, etc. need to be deployed. >>> > > >>> > > > 2) I wonder wether we should confirm cassandra interoperability of >>> > > > spark >>> > > in >>> > > > bigtop distros, >>> > > >>> > > Only if there's a significant interest from cassandra community and even >>> > > then my biggest fear is that with cassandra we're totally changing the >>> > > requirements for the underlying storage subsystem (nothing wrong with >>> > > that, its just that in Hadoop ecosystem everything assumes very HDFS'ish >>> > > requirements for the scale-out storage). >>> > > >>> > > > 4) in general, i think bigtop can move in one of 3 directions. >>> > > > >>> > > > EXPAND ? : Expanding to include new components, with just basic >>> > > interop, >>> > > > and let folks evolve their own stacks on top of bigtop on their own. >>> > > > >>> > > > CONTRACT+FOCUS ? Contracting to focus on a lean set of core >>> > > components, >>> > > > with super high quality. >>> > > > >>> > > > STAY THE COURSE ? Staying the same ~ a packaging platform for just >>> > > > hadoop's direct ecosystem. >>> > > > >>> > > > I am intrigued by the idea of A and B both have clear benefits and >>> > > costs... >>> > > > would like to see the opinions of folks --- do we lean in one >>> > > > direction >>> > > or >>> > > > another? What is the criteria for adding a new feature, package, >>> > > > stack to >>> > > > bigtop? >>> > > > >>> > > > ... Or maybe im just overthinking it and should be spending this time >>> > > > testing spark for 0.9 release.... >>> > > >>> > > I'd love to know what other think, but for 0.9 I'd rather stay the >>> > > course. >>> > > >>> > > Thanks, >>> > > Roman. >>> > > >>> > > P.S. There are also market forces at play that may fundamentally change >>> > > the focus of what we're all working on in the year or so. >>> > > >> >> >> >> -- >> Best regards, >> >> - Andy >> >> Problems worthy of attack prove their worth by hitting back. - Piet Hein >> (via Tom White) >
