I want to agree with Andrew. While Spark is a huge step forward compare to basic Hadoop it isn't a solution for everything and definitely isn't a solution for fast processing of data sets that don't fit the memory. Oh, and by the way let's not forget about the fact that ML/analytics on Hadoop isn't the whole world of data processing. Say OLTP workloads command a way larger market share that just ML. That's why I am very optimistic about Ignite (incubating).
Cos On Thu, Dec 11, 2014 at 02:04PM, Andrew Purtell wrote: > The problem I see with a Spark-only stack is, in my experience, Spark falls > apart as soon as the working set exceeds all available RAM on the cluster. > (One is presented with a sea of exceptions.) We need Hadoop anyway for HDFS > and Common (required by many many components), we get YARN and the MR > runtime as part of this package, and Hadoop MR is still eminently useful > when data sets and storage requirements are far beyond agg RAM. > > We have an open JIRA for adding Kafka, it would be fantastic if someone > picks it up and brings it over the finish line. > > > On Thu, Dec 11, 2014 at 10:14 AM, RJ Nowling <[email protected]> wrote: > > > GraphX, Streaming, MLlib, and Spark SQL are all part of Spark and would be > > included in BigTop if Spark is included. They're also pretty well > > integrated with each other. > > > > I'd like to throw out a radical idea, based on Andrew's comments: focus on > > the vertical rather than the horizontal with a slimmed down, Spark-oriented > > stack. (This could be a subset of the current stack.) Strat.io's work > > provides a nice example of a pure Spark stack. > > > > Spark offers a smaller footprint, far less maintenance, functionality of > > many Hadoop components in one (and better integration!), and is better > > suited for diverse deployment situations (cloud, non-HDFS storage, etc.) > > > > A few other complementary components would be needed: Kafka would be > > needed for HA with Spark streaming. Tachyon. Maybe offer Cassandra or > > similar as an alternative storage option. Combine this with dashboards > > and visualization and high quality deployment options (Puppet, Docker, > > etc.). With the data generator and Spark implementation of BigPetStore, my > > goal is to to expand BPS to provide high quality analytics examples, > > oriented more towards data scientists. > > > > Just a thought... > > > > On Thu, Dec 11, 2014 at 12:39 PM, Andrew Purtell <[email protected]> > > wrote: > > > >> This is a really great post and I was nodding along with most of it. > >> > >> My personal view is Bigtop starts as a deployable stack of Apache > >> ecosystem components for Big Data. Commodification of (Linux) deployable > >> packages and basic install integration is the baseline. > >> > >> Bigtop packaging Spark components first is an unfortunately little known > >> win of this community, but its still a win. Although replicating that > >> success with choice of the 'next big thing' is going to be a hit or miss > >> proposition unless one of us can figure out time travel, definitely we can > >> make some observations and scour and/or influence the Apache project > >> landscape to pick up coverage in the space: > >> > >> - Storage is commoditized. Nearly everyone bases the storage stack on > >> HDFS. Everyone does so with what we'd call HCFS. Best to focus elsewhere. > >> > >> - Packaging is commoditized. It's a shame that vendors pursue misguided > >> lock-in strategies but we have no control over that. It's still true that > >> someone using HDP or CDH 4 can switch to Bigtop and vice versa without > >> changing package management tools or strategy. As a user of Apache stack > >> technologies I want long term sustainable package management so will vote > >> with my feet for the commodity option, and won't be alone. Bigtop should > >> provide this, and does, and it's mostly a solved problem. > >> > >> - Deployment is also a "solved" problem but unfortunately everyone solves > >> it differently. :-) This is an area where Bigtop can provide real value, > >> and does, with the Puppet scripts, with the containerization work. One > >> function Bigtop can serve is as repository and example of Hadoop-ish > >> production tooling. > >> > >> - YARN is a reasonably generic grid resource manager. We don't have the > >> resources to stand up an alternate RM and all the tooling necessary with > >> Mesos, but if Mesosphere made a contribution of that I suspect we'd take > >> it. From the Bigtop perspective I think computation framework options are > >> well handled, in that I don't see Bigtop or anyone else developing credible > >> alternatives to MR and Spark for some time. Not sure there's enough oxygen. > >> And we have Giraph (and is GraphX packaged with Spark?). To the extent > >> Spark-on-YARN has rough edges in the Bigtop framework that's an area where > >> contributors can produce value. Related, support for Hive on Spark, Pig on > >> Spark (spork). > >> > >> - The Apache stack includes three streaming computation frameworks - > >> Storm, Spark Streaming, Samza - but Bigtop has mostly missed the boat here. > >> Spark streaming is included in the spark package (I think) but how well is > >> it integrated? Samza is well integrated with YARN but we don't package it. > >> There's also been Storm-on-YARN work out of Yahoo, not sure about what was > >> upstreamed or might be available. Anyway, integration of stream computation > >> frameworks into Bigtop's packaging and deployment/management scripts can > >> produce value, especially if we provide multiple options, because vendors > >> are choosing favorites. > >> > >> - Data access. We do have players differentiating themselves here. Bigtop > >> provides two SQL options (Hive, Phoenix+HBase), can add a third, I see > >> someone's proposed Presto packaging. I'm not sure from the Bigtop > >> perspective we need to pursue additional alternatives, but if there were > >> contributions, we might well take them. "Enterprise friendly API" (SQL) is > >> half of the data access picture I think, the other half is access control. > >> There are competing projects in incubation, Sentry and Ranger, with no > >> shared purpose, which is a real shame. To the extent that Bigtop adopts a > >> cross-component full-stack access control technology, or helps bring > >> another alternative into incubation and adopts that, we can move the needle > >> in this space. We'd offer a vendor neutral access control option devoid of > >> lock-in risk, this would be a big deal for big-E enterprises. > >> > >> - Data management and provenance. Now we're moving up the value chain > >> from storage and data access to the next layer. This is mostly greenfield / > >> blue ocean space in the Apache stack. We have interesting options in > >> incubation: Falcon, Taverna, NiFi. (I think the last one might be truly > >> comprehensive.) All of these are higher level data management and > >> processing workflows which include aspects of management and provenance. > >> One or more could be adopted and refined. There are a lot of relevant > >> integration opportunities up and down the stack that could be undertaken > >> with shared effort of the Bigtop, framework, and component communities. > >> > >> - Machine learning. Moving further up the value chain, we have data and > >> computation and workflow, now how do we derive the competitive advantage > >> that all of the lower layer technologies are in place for? The new hotness > >> is surfacing of insights out of scaled parallel statistical inference. > >> Unfortunately this space doesn't present itself well to the toolbox > >> approach. Bigtop provides Mahout and MLLib as part of Spark (right?), they > >> themselves are toolkits with components of varying utility and maturity > >> (and relevance). I think Bigtop could provide some value by curating ML > >> frameworks that tie in with other Apache stack technologies. ML toolkits > >> leave would-be users in the cold. One has to know what one is doing, and > >> what to do is highly use case specific, this is why "data scientists" can > >> command obscene salaries and only commercial vendors have the resources to > >> focus on specific verticals. > >> > >> - Visualization and preparation. Moving further up, now we are almost > >> touching directly the use case. We have data but we need to clean it, > >> normalize, regularize, filter, slice and dice. Where there are reasonably > >> generic open source tools, preferably at Apache, for data preparation and > >> cleaning Bigtop could provide baseline value by packaging it, and > >> additional value with deeper integration with Apache stack components. Data > >> preparation is a concern hand in hand with data ingest, so we have an > >> interesting feedback loop from the top back down to ingest tools/building > >> blocks like Kafka and Flume. Data cleaning concerns might overlap with the > >> workflow frameworks too. If there's a friendly licensed open source > >> graphical front end to the data cleaning/munging/exploration process that > >> is generic enough that would be a really interesting "acquisition". > >> - We can also package visualization libraries and toolkits for building > >> dashboards. Like with ML algorithms, a complete integration is probably out > >> of scope because every instance would be use case and user specific. > >> > >> > >> > >> On Mon, Dec 8, 2014 at 12:23 PM, Konstantin Boudnik <[email protected]> > >> wrote: > >> > >>> First I want to address the RJ's question: > >>> > >>> The most prominent downstream Bigtop Dependency would be any commercial > >>> Hadoop distribution like HDP and CDH. The former is trying to > >>> disguise their affiliation by pushing Ambari forward, and Cloudera's > >>> seemingly > >>> shifting her focus to compressed tarballs media (aka parcels) which > >>> requires > >>> a closed-source solutions like Cloudera Manager to deploy and control > >>> your > >>> cluster, effectively rendering it useless if you ever decide to > >>> uninstall the > >>> control software. In the interest of full disclosure, I don't think > >>> parcels > >>> have any chance to landslide the consensus in the industry from Linux > >>> packaging towards something so obscure and proprietary as parcels are. > >>> > >>> > >>> And now to my actual points....: > >>> > >>> I do strongly believe the Bigtop was and is the only completely > >>> transparent, > >>> vendors' friendly, and 100% sticking to official ASF product releases > >>> way of > >>> building your stack from ground up, deploying and controlling it anyway > >>> you > >>> want to. I agree with Roman's presentation on how this project can move > >>> forward. However, I somewhat disagree with his view on the perspectives. > >>> It > >>> might be a hard road to drive the opinion of the community. But, it is > >>> a high > >>> road. > >>> > >>> We are definitely small and mostly unsupported by commercial groups that > >>> are > >>> using the framework. Being a box of LEGO won't win us anything. If > >>> anything, > >>> the empirical evidences are against it as commercial distros have > >>> decided to > >>> move towards their own means of "vendor lock-in" (yes, you hear me > >>> right - that's exactly what I said: all so called open-source companies > >>> have > >>> invented a way to lock-in their customers either with fancy "enterprise > >>> features" that aren't adding but amending underlying stack; or with > >>> custom set > >>> of patches oftentimes rendering the cluster to become incompatible > >>> between > >>> different vendors). > >>> > >>> By all means, my money are on the second way, yet slightly modified (as > >>> use-cases are coming from users, not developers): > >>> #2 start driving adoption of software stacks for the particular kind > >>> of data workloads > >>> > >>> This community has enough day-to-day practitioners on board to > >>> accumulate a near-complete introspection of where the technology is > >>> moving. > >>> And instead of wobbling in a backwash, let's see if we can be smart and > >>> define > >>> this landscape. After all, Bigtop has adopted Spark well before any of > >>> the > >>> commercials have officially accepted it. We seemingly are moving more and > >>> more into in-memory realm of data processing: Apache Ignite (Gridgain), > >>> Tachyon, Spark. I don't know how much legs Hive got in it, but I am > >>> doubtful, > >>> that it can walk for much longer... May be it's just me. > >>> > >>> In this thread http://is.gd/MV2BH9 we already discussed some of the > >>> aspects > >>> influencing the feature of this project. And we are de-facto working on > >>> the > >>> implementation. In my opinion, Hadoop has been more or less commoditized > >>> already. And it isn't a bad thing, but it means that the innovations are > >>> elsewhere. E.g. Spark moving is moving beyond its ties with storage > >>> layer via > >>> Tachyon abstraction; GridGain simply doesn't care what's underlying > >>> storage > >>> is. However, data needs to be stored somewhere before it can be > >>> processed. And > >>> HCFS seems to be fitting the bill ok. But, as I said already, I see the > >>> real > >>> action elsewhere. If I were to define the shape of our mid- to long'ish > >>> term > >>> roadmap it'd be something like that: > >>> > >>> ^ Dashboard/Visualization ^ > >>> | OLTP/ML processing | > >>> | Caching/Acceleration | > >>> | Storage | > >>> > >>> And around this we can add/improve on deployment (R8???), > >>> virtualization/containers/clouds. In other words - let's focus on the > >>> vertical part of the stack, instead of simply supporting the status quo. > >>> > >>> Does Cassandra fits the Storage layer in that model? I don't know and > >>> most > >>> important - I don't care. If there's an interest and manpower to have > >>> Cassandra-based stack - sure, but perhaps let's do as a separate branch > >>> or > >>> something, so we aren't over-complicating things. As Roman said earlier, > >>> in > >>> this case it'd be great to engage Cassandra/DataStax people into this > >>> project. > >>> But something tells me they won't be eager to jump on board. > >>> > >>> And finally, all this above leads to "how": how we can start reshaping > >>> the > >>> stack into its next incarnation? Perhaps, Ubuntu model might be an > >>> answer for > >>> that, but we have discussed that elsewhere and dropped the idea as it > >>> wasn't > >>> feasible back in the day. Perhaps its time just came? > >>> > >>> Apologies for a long post. > >>> Cos > >>> > >>> > >>> On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote: > >>> > Which other projects depend on BigTop? How will the questions about > >>> the > >>> > direction of BigTop affect those projects? > >>> > > >>> > On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <[email protected] > >>> > > >>> > wrote: > >>> > > >>> > > Hi! > >>> > > > >>> > > On Sat, Dec 6, 2014 at 3:23 PM, jay vyas < > >>> [email protected]> > >>> > > wrote: > >>> > > > hi bigtop ! > >>> > > > > >>> > > > I thought id start a thread a few vaguely related thoughts i have > >>> around > >>> > > > next couple iterations of bigtop. > >>> > > > >>> > > I think in general I see two major ways for something like > >>> > > Bigtop to evolve: > >>> > > #1 remain a 'box of LEGO bricks' with very little opinion on > >>> > > how these pieces need to be integrated > >>> > > #2 start driving oppinioned use-cases for the particular kind of > >>> > > bigdata workloads > >>> > > > >>> > > #1 is sort of what all of the Linux distros have been doing for > >>> > > the majority of time they existed. #2 is close to what CentOS > >>> > > is doing with SIGs. > >>> > > > >>> > > Honestly, given the size of our community so far and a total > >>> > > lack of corporate backing (with a small exception of Cloudera > >>> > > still paying for our EC2 time) I think #1 is all we can do. I'd > >>> > > love to be wrong, though. > >>> > > > >>> > > > 1) Hive: How will bigtop to evolve to support it, now that it is > >>> much > >>> > > more > >>> > > > than a mapreduce query wrapper? > >>> > > > >>> > > I think Hive will remain a big part of Hadoop workloads for > >>> forseeable > >>> > > future. What I'd love to see more of is rationalizing things like how > >>> > > HCatalog, etc. need to be deployed. > >>> > > > >>> > > > 2) I wonder wether we should confirm cassandra interoperability of > >>> spark > >>> > > in > >>> > > > bigtop distros, > >>> > > > >>> > > Only if there's a significant interest from cassandra community and > >>> even > >>> > > then my biggest fear is that with cassandra we're totally changing > >>> the > >>> > > requirements for the underlying storage subsystem (nothing wrong with > >>> > > that, its just that in Hadoop ecosystem everything assumes very > >>> HDFS'ish > >>> > > requirements for the scale-out storage). > >>> > > > >>> > > > 4) in general, i think bigtop can move in one of 3 directions. > >>> > > > > >>> > > > EXPAND ? : Expanding to include new components, with just basic > >>> > > interop, > >>> > > > and let folks evolve their own stacks on top of bigtop on their > >>> own. > >>> > > > > >>> > > > CONTRACT+FOCUS ? Contracting to focus on a lean set of core > >>> > > components, > >>> > > > with super high quality. > >>> > > > > >>> > > > STAY THE COURSE ? Staying the same ~ a packaging platform for > >>> just > >>> > > > hadoop's direct ecosystem. > >>> > > > > >>> > > > I am intrigued by the idea of A and B both have clear benefits and > >>> > > costs... > >>> > > > would like to see the opinions of folks --- do we lean in one > >>> direction > >>> > > or > >>> > > > another? What is the criteria for adding a new feature, package, > >>> stack to > >>> > > > bigtop? > >>> > > > > >>> > > > ... Or maybe im just overthinking it and should be spending this > >>> time > >>> > > > testing spark for 0.9 release.... > >>> > > > >>> > > I'd love to know what other think, but for 0.9 I'd rather stay the > >>> course. > >>> > > > >>> > > Thanks, > >>> > > Roman. > >>> > > > >>> > > P.S. There are also market forces at play that may fundamentally > >>> change > >>> > > the focus of what we're all working on in the year or so. > >>> > > > >>> > >> > >> > >> > >> -- > >> Best regards, > >> > >> - Andy > >> > >> Problems worthy of attack prove their worth by hitting back. - Piet Hein > >> (via Tom White) > >> > > > > > > > -- > Best regards, > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein > (via Tom White)
signature.asc
Description: Digital signature
