Re: What will the next generation of bigtop look like?

Jay Vyas Thu, 11 Dec 2014 13:11:48 -0800

Rj - that's not too radical, seems like a lot of folks are embracing that idiom.


1) I like featuring spark along with some persistence technology.  Cassandra 
don't seem to have interest in BigTop however.  So maybe...

Spark
Tachyon
Hbase+Phoenix
SOLR
Kafka

Could be pretty effective.

2) visualization ? I think that is an afterthought, at least for now.... It's a 
lot of work just to get the stack compiling.  

> On Dec 11, 2014, at 1:14 PM, RJ Nowling <[email protected]> wrote:
> 
> GraphX, Streaming, MLlib, and Spark SQL are all part of Spark and would be 
> included in BigTop if Spark is included. They're also pretty well integrated 
> with each other.
> 
> I'd like to throw out a radical idea, based on Andrew's comments: focus on 
> the vertical rather than the horizontal with a slimmed down, Spark-oriented 
> stack.  (This could be a subset of the current stack.)  Strat.io's work 
> provides a nice example of a pure Spark stack.
> 
> Spark offers a smaller footprint, far less maintenance, functionality of many 
> Hadoop components in one (and better integration!), and is better suited for 
> diverse deployment situations (cloud, non-HDFS storage, etc.) 
> 
> A few other complementary components would be needed: Kafka would be needed 
> for HA with Spark streaming.  Tachyon.  Maybe offer Cassandra or similar as 
> an alternative storage option.    Combine this with dashboards and 
> visualization and high quality deployment options (Puppet, Docker, etc.).  
> With the data generator and Spark implementation of BigPetStore, my goal is 
> to to expand BPS to provide high quality analytics examples, oriented more 
> towards data scientists.
> 
> Just a thought...
> 
>> On Thu, Dec 11, 2014 at 12:39 PM, Andrew Purtell <[email protected]> wrote:
>> This is a really great post and I was nodding along with most of it. 
>> 
>> My personal view is Bigtop starts as a deployable stack of Apache ecosystem 
>> components for Big Data. Commodification of (Linux) deployable packages and 
>> basic install integration is the baseline. 
>> 
>> Bigtop packaging Spark components first is an unfortunately little known win 
>> of this community, but its still a win. Although replicating that success 
>> with choice of the 'next big thing' is going to be a hit or miss proposition 
>> unless one of us can figure out time travel, definitely we can make some 
>> observations and scour and/or influence the Apache project landscape to pick 
>> up coverage in the space:
>> 
>> - Storage is commoditized. Nearly everyone bases the storage stack on HDFS. 
>> Everyone does so with what we'd call HCFS. Best to focus elsewhere.
>> 
>> - Packaging is commoditized. It's a shame that vendors pursue misguided 
>> lock-in strategies but we have no control over that. It's still true that 
>> someone using HDP or CDH 4 can switch to Bigtop and vice versa without 
>> changing package management tools or strategy. As a user of Apache stack 
>> technologies I want long term sustainable package management so will vote 
>> with my feet for the commodity option, and won't be alone. Bigtop should 
>> provide this, and does, and it's mostly a solved problem.
>> 
>> - Deployment is also a "solved" problem but unfortunately everyone solves it 
>> differently. :-) This is an area where Bigtop can provide real value, and 
>> does, with the Puppet scripts, with the containerization work. One function 
>> Bigtop can serve is as repository and example of Hadoop-ish production 
>> tooling.
>> 
>> - YARN is a reasonably generic grid resource manager. We don't have the 
>> resources to stand up an alternate RM and all the tooling necessary with 
>> Mesos, but if Mesosphere made a contribution of that I suspect we'd take it. 
>> From the Bigtop perspective I think computation framework options are well 
>> handled, in that I don't see Bigtop or anyone else developing credible 
>> alternatives to MR and Spark for some time. Not sure there's enough oxygen. 
>> And we have Giraph (and is GraphX packaged with Spark?). To the extent 
>> Spark-on-YARN has rough edges in the Bigtop framework that's an area where 
>> contributors can produce value. Related, support for Hive on Spark, Pig on 
>> Spark (spork). 
>> 
>> - The Apache stack includes three streaming computation frameworks - Storm, 
>> Spark Streaming, Samza - but Bigtop has mostly missed the boat here. Spark 
>> streaming is included in the spark package (I think) but how well is it 
>> integrated? Samza is well integrated with YARN but we don't package it. 
>> There's also been Storm-on-YARN work out of Yahoo, not sure about what was 
>> upstreamed or might be available. Anyway, integration of stream computation 
>> frameworks into Bigtop's packaging and deployment/management scripts can 
>> produce value, especially if we provide multiple options, because vendors 
>> are choosing favorites. 
>> 
>> - Data access. We do have players differentiating themselves here. Bigtop 
>> provides two SQL options (Hive, Phoenix+HBase), can add a third, I see 
>> someone's proposed Presto packaging. I'm not sure from the Bigtop 
>> perspective we need to pursue additional alternatives, but if there were 
>> contributions, we might well take them. "Enterprise friendly API" (SQL) is 
>> half of the data access picture I think, the other half is access control. 
>> There are competing projects in incubation, Sentry and Ranger, with no 
>> shared purpose, which is a real shame. To the extent that Bigtop adopts a 
>> cross-component full-stack access control technology, or helps bring another 
>> alternative into incubation and adopts that, we can move the needle in this 
>> space. We'd offer a vendor neutral access control option devoid of lock-in 
>> risk, this would be a big deal for big-E enterprises.
>> 
>> - Data management and provenance. Now we're moving up the value chain from 
>> storage and data access to the next layer. This is mostly greenfield / blue 
>> ocean space in the Apache stack. We have interesting options in incubation: 
>> Falcon, Taverna, NiFi. (I think the last one might be truly comprehensive.) 
>> All of these are higher level data management and processing workflows which 
>> include aspects of management and provenance. One or more could be adopted 
>> and refined. There are a lot of relevant integration opportunities up and 
>> down the stack that could be undertaken with shared effort of the Bigtop, 
>> framework, and component communities.
>> 
>> - Machine learning. Moving further up the value chain, we have data and 
>> computation and workflow, now how do we derive the competitive advantage 
>> that all of the lower layer technologies are in place for? The new hotness 
>> is surfacing of insights out of scaled parallel statistical inference. 
>> Unfortunately this space doesn't present itself well to the toolbox 
>> approach. Bigtop provides Mahout and MLLib as part of Spark (right?), they 
>> themselves are toolkits with components of varying utility and maturity (and 
>> relevance). I think Bigtop could provide some value by curating ML 
>> frameworks that tie in with other Apache stack technologies. ML toolkits 
>> leave would-be users in the cold. One has to know what one is doing, and 
>> what to do is highly use case specific, this is why "data scientists" can 
>> command obscene salaries and only commercial vendors have the resources to 
>> focus on specific verticals. 
>> 
>> - Visualization and preparation. Moving further up, now we are almost 
>> touching directly the use case. We have data but we need to clean it, 
>> normalize, regularize, filter, slice and dice. Where there are reasonably 
>> generic open source tools, preferably at Apache, for data preparation and 
>> cleaning Bigtop could provide baseline value by packaging it, and additional 
>> value with deeper integration with Apache stack components. Data preparation 
>> is a concern hand in hand with data ingest, so we have an interesting 
>> feedback loop from the top back down to ingest tools/building blocks like 
>> Kafka and Flume. Data cleaning concerns might overlap with the workflow 
>> frameworks too. If there's a friendly licensed open source graphical front 
>> end to the data cleaning/munging/exploration process that is generic enough 
>> that would be a really interesting "acquisition". 
>> - We can also package visualization libraries and toolkits for building 
>> dashboards. Like with ML algorithms, a complete integration is probably out 
>> of scope because every instance would be use case and user specific.
>> 
>> 
>> 
>>> On Mon, Dec 8, 2014 at 12:23 PM, Konstantin Boudnik <[email protected]> wrote:
>>> First I want to address the RJ's question:
>>> 
>>> The most prominent downstream Bigtop Dependency would be any commercial
>>> Hadoop distribution like HDP and CDH. The former is trying to
>>> disguise their affiliation by pushing Ambari forward, and Cloudera's 
>>> seemingly
>>> shifting her focus to compressed tarballs media (aka parcels) which requires
>>> a closed-source solutions like Cloudera Manager to deploy and control your
>>> cluster, effectively rendering it useless if you ever decide to uninstall 
>>> the
>>> control software. In the interest of full disclosure, I don't think parcels
>>> have any chance to landslide the consensus in the industry from Linux
>>> packaging towards something so obscure and proprietary as parcels are.
>>> 
>>> 
>>> And now to my actual points....:
>>> 
>>> I do strongly believe the Bigtop was and is the only completely transparent,
>>> vendors' friendly, and 100% sticking to official ASF product releases way of
>>> building your stack from ground up, deploying and controlling it anyway you
>>> want to. I agree with Roman's presentation on how this project can move
>>> forward. However, I somewhat disagree with his view on the perspectives. It
>>> might be a hard road to drive the opinion of the community.  But, it is a 
>>> high
>>> road.
>>> 
>>> We are definitely small and mostly unsupported by commercial groups that are
>>> using the framework. Being a box of LEGO won't win us anything. If anything,
>>> the empirical evidences are against it as commercial distros have decided to
>>> move towards their own means of "vendor lock-in" (yes, you hear me
>>> right - that's exactly what I said: all so called open-source companies have
>>> invented a way to lock-in their customers either with fancy "enterprise
>>> features" that aren't adding but amending underlying stack; or with custom 
>>> set
>>> of patches oftentimes rendering the cluster to become incompatible between
>>> different vendors).
>>> 
>>> By all means, my money are on the second way, yet slightly modified (as
>>> use-cases are coming from users, not developers):
>>>   #2 start driving adoption of software stacks for the particular kind of 
>>> data workloads
>>> 
>>> This community has enough day-to-day practitioners on board to
>>> accumulate a near-complete introspection of where the technology is moving.
>>> And instead of wobbling in a backwash, let's see if we can be smart and 
>>> define
>>> this landscape. After all, Bigtop has adopted Spark well before any of the
>>> commercials have officially accepted it. We seemingly are moving more and
>>> more into in-memory realm of data processing: Apache Ignite (Gridgain),
>>> Tachyon, Spark. I don't know how much legs Hive got in it, but I am 
>>> doubtful,
>>> that it can walk for much longer... May be it's just me.
>>> 
>>> In this thread http://is.gd/MV2BH9 we already discussed some of the aspects
>>> influencing the feature of this project. And we are de-facto working on the
>>> implementation. In my opinion, Hadoop has been more or less commoditized
>>> already. And it isn't a bad thing, but it means that the innovations are
>>> elsewhere. E.g. Spark moving is moving beyond its ties with storage layer 
>>> via
>>> Tachyon abstraction; GridGain simply doesn't care what's underlying storage
>>> is. However, data needs to be stored somewhere before it can be processed. 
>>> And
>>> HCFS seems to be fitting the bill ok. But, as I said already, I see the real
>>> action elsewhere. If I were to define the shape of our mid- to long'ish term
>>> roadmap it'd be something like that:
>>> 
>>>             ^   Dashboard/Visualization  ^
>>>             |     OLTP/ML processing     |
>>>             |    Caching/Acceleration    |
>>>             |         Storage            |
>>> 
>>> And around this we can add/improve on deployment (R8???),
>>> virtualization/containers/clouds.  In other words - let's focus on the
>>> vertical part of the stack, instead of simply supporting the status quo.
>>> 
>>> Does Cassandra fits the Storage layer in that model? I don't know and most
>>> important - I don't care. If there's an interest and manpower to have
>>> Cassandra-based stack - sure, but perhaps let's do as a separate branch or
>>> something, so we aren't over-complicating things. As Roman said earlier, in
>>> this case it'd be great to engage Cassandra/DataStax people into this 
>>> project.
>>> But something tells me they won't be eager to jump on board.
>>> 
>>> And finally, all this above leads to "how": how we can start reshaping the
>>> stack into its next incarnation? Perhaps, Ubuntu model might be an answer 
>>> for
>>> that, but we have discussed that elsewhere and dropped the idea as it wasn't
>>> feasible back in the day. Perhaps its time just came?
>>> 
>>> Apologies for a long post.
>>>   Cos
>>> 
>>> 
>>> On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote:
>>> > Which other projects depend on BigTop?  How will the questions about the
>>> > direction of BigTop affect those projects?
>>> >
>>> > On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <[email protected]>
>>> > wrote:
>>> >
>>> > > Hi!
>>> > >
>>> > > On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <[email protected]>
>>> > > wrote:
>>> > > > hi bigtop !
>>> > > >
>>> > > > I thought id start a thread a few vaguely related thoughts i have 
>>> > > > around
>>> > > > next couple iterations of bigtop.
>>> > >
>>> > > I think in general I see two major ways for something like
>>> > > Bigtop to evolve:
>>> > >    #1 remain a 'box of LEGO bricks' with very little opinion on
>>> > >         how these pieces need to be integrated
>>> > >    #2 start driving oppinioned use-cases for the particular kind of
>>> > >         bigdata workloads
>>> > >
>>> > > #1 is sort of what all of the Linux distros have been doing for
>>> > > the majority of time they existed. #2 is close to what CentOS
>>> > > is doing with SIGs.
>>> > >
>>> > > Honestly, given the size of our community so far and a total
>>> > > lack of corporate backing (with a small exception of Cloudera
>>> > > still paying for our EC2 time) I think #1 is all we can do. I'd
>>> > > love to be wrong, though.
>>> > >
>>> > > > 1) Hive:  How will bigtop to evolve to support it, now that it is much
>>> > > more
>>> > > > than a mapreduce query wrapper?
>>> > >
>>> > > I think Hive will remain a big part of Hadoop workloads for forseeable
>>> > > future. What I'd love to see more of is rationalizing things like how
>>> > > HCatalog, etc. need to be deployed.
>>> > >
>>> > > > 2) I wonder wether we should confirm cassandra interoperability of 
>>> > > > spark
>>> > > in
>>> > > > bigtop distros,
>>> > >
>>> > > Only if there's a significant interest from cassandra community and even
>>> > > then my biggest fear is that with cassandra we're totally changing the
>>> > > requirements for the underlying storage subsystem (nothing wrong with
>>> > > that, its just that in Hadoop ecosystem everything assumes very HDFS'ish
>>> > > requirements for the scale-out storage).
>>> > >
>>> > > > 4) in general, i think bigtop can move in one of 3 directions.
>>> > > >
>>> > > >   EXPAND ? : Expanding to include new components, with just basic
>>> > > interop,
>>> > > > and let folks evolve their own stacks on top of bigtop on their own.
>>> > > >
>>> > > >   CONTRACT+FOCUS ?  Contracting to focus on a lean set of core
>>> > > components,
>>> > > > with super high quality.
>>> > > >
>>> > > >   STAY THE COURSE ? Staying the same ~ a packaging platform for just
>>> > > > hadoop's direct ecosystem.
>>> > > >
>>> > > > I am intrigued by the idea of A and B both have clear benefits and
>>> > > costs...
>>> > > > would like to see the opinions of folks --- do we  lean in one 
>>> > > > direction
>>> > > or
>>> > > > another? What is the criteria for adding a new feature, package, 
>>> > > > stack to
>>> > > > bigtop?
>>> > > >
>>> > > > ... Or maybe im just overthinking it and should be spending this time
>>> > > > testing spark for 0.9 release....
>>> > >
>>> > > I'd love to know what other think, but for 0.9 I'd rather stay the 
>>> > > course.
>>> > >
>>> > > Thanks,
>>> > > Roman.
>>> > >
>>> > > P.S. There are also market forces at play that may fundamentally change
>>> > > the focus of what we're all working on in the year or so.
>>> > >
>> 
>> 
>> 
>> -- 
>> Best regards,
>> 
>>    - Andy
>> 
>> Problems worthy of attack prove their worth by hitting back. - Piet Hein 
>> (via Tom White)
>

Re: What will the next generation of bigtop look like?

Reply via email to