First I want to address the RJ's question:

The most prominent downstream Bigtop Dependency would be any commercial
Hadoop distribution like HDP and CDH. The former is trying to
disguise their affiliation by pushing Ambari forward, and Cloudera's seemingly
shifting her focus to compressed tarballs media (aka parcels) which requires
a closed-source solutions like Cloudera Manager to deploy and control your
cluster, effectively rendering it useless if you ever decide to uninstall the
control software. In the interest of full disclosure, I don't think parcels
have any chance to landslide the consensus in the industry from Linux
packaging towards something so obscure and proprietary as parcels are.


And now to my actual points....:

I do strongly believe the Bigtop was and is the only completely transparent,
vendors' friendly, and 100% sticking to official ASF product releases way of
building your stack from ground up, deploying and controlling it anyway you
want to. I agree with Roman's presentation on how this project can move
forward. However, I somewhat disagree with his view on the perspectives. It
might be a hard road to drive the opinion of the community.  But, it is a high
road.

We are definitely small and mostly unsupported by commercial groups that are
using the framework. Being a box of LEGO won't win us anything. If anything,
the empirical evidences are against it as commercial distros have decided to
move towards their own means of "vendor lock-in" (yes, you hear me
right - that's exactly what I said: all so called open-source companies have
invented a way to lock-in their customers either with fancy "enterprise
features" that aren't adding but amending underlying stack; or with custom set
of patches oftentimes rendering the cluster to become incompatible between
different vendors).

By all means, my money are on the second way, yet slightly modified (as
use-cases are coming from users, not developers):
  #2 start driving adoption of software stacks for the particular kind of data 
workloads

This community has enough day-to-day practitioners on board to
accumulate a near-complete introspection of where the technology is moving.
And instead of wobbling in a backwash, let's see if we can be smart and define
this landscape. After all, Bigtop has adopted Spark well before any of the
commercials have officially accepted it. We seemingly are moving more and
more into in-memory realm of data processing: Apache Ignite (Gridgain),
Tachyon, Spark. I don't know how much legs Hive got in it, but I am doubtful,
that it can walk for much longer... May be it's just me.

In this thread http://is.gd/MV2BH9 we already discussed some of the aspects
influencing the feature of this project. And we are de-facto working on the
implementation. In my opinion, Hadoop has been more or less commoditized
already. And it isn't a bad thing, but it means that the innovations are
elsewhere. E.g. Spark moving is moving beyond its ties with storage layer via
Tachyon abstraction; GridGain simply doesn't care what's underlying storage
is. However, data needs to be stored somewhere before it can be processed. And
HCFS seems to be fitting the bill ok. But, as I said already, I see the real
action elsewhere. If I were to define the shape of our mid- to long'ish term
roadmap it'd be something like that:

            ^   Dashboard/Visualization  ^
            |     OLTP/ML processing     |
            |    Caching/Acceleration    |
            |         Storage            |

And around this we can add/improve on deployment (R8???),
virtualization/containers/clouds.  In other words - let's focus on the
vertical part of the stack, instead of simply supporting the status quo.

Does Cassandra fits the Storage layer in that model? I don't know and most
important - I don't care. If there's an interest and manpower to have
Cassandra-based stack - sure, but perhaps let's do as a separate branch or
something, so we aren't over-complicating things. As Roman said earlier, in
this case it'd be great to engage Cassandra/DataStax people into this project.
But something tells me they won't be eager to jump on board.

And finally, all this above leads to "how": how we can start reshaping the
stack into its next incarnation? Perhaps, Ubuntu model might be an answer for
that, but we have discussed that elsewhere and dropped the idea as it wasn't
feasible back in the day. Perhaps its time just came?

Apologies for a long post.
  Cos


On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote:
> Which other projects depend on BigTop?  How will the questions about the
> direction of BigTop affect those projects?
> 
> On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <[email protected]>
> wrote:
> 
> > Hi!
> >
> > On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <[email protected]>
> > wrote:
> > > hi bigtop !
> > >
> > > I thought id start a thread a few vaguely related thoughts i have around
> > > next couple iterations of bigtop.
> >
> > I think in general I see two major ways for something like
> > Bigtop to evolve:
> >    #1 remain a 'box of LEGO bricks' with very little opinion on
> >         how these pieces need to be integrated
> >    #2 start driving oppinioned use-cases for the particular kind of
> >         bigdata workloads
> >
> > #1 is sort of what all of the Linux distros have been doing for
> > the majority of time they existed. #2 is close to what CentOS
> > is doing with SIGs.
> >
> > Honestly, given the size of our community so far and a total
> > lack of corporate backing (with a small exception of Cloudera
> > still paying for our EC2 time) I think #1 is all we can do. I'd
> > love to be wrong, though.
> >
> > > 1) Hive:  How will bigtop to evolve to support it, now that it is much
> > more
> > > than a mapreduce query wrapper?
> >
> > I think Hive will remain a big part of Hadoop workloads for forseeable
> > future. What I'd love to see more of is rationalizing things like how
> > HCatalog, etc. need to be deployed.
> >
> > > 2) I wonder wether we should confirm cassandra interoperability of spark
> > in
> > > bigtop distros,
> >
> > Only if there's a significant interest from cassandra community and even
> > then my biggest fear is that with cassandra we're totally changing the
> > requirements for the underlying storage subsystem (nothing wrong with
> > that, its just that in Hadoop ecosystem everything assumes very HDFS'ish
> > requirements for the scale-out storage).
> >
> > > 4) in general, i think bigtop can move in one of 3 directions.
> > >
> > >   EXPAND ? : Expanding to include new components, with just basic
> > interop,
> > > and let folks evolve their own stacks on top of bigtop on their own.
> > >
> > >   CONTRACT+FOCUS ?  Contracting to focus on a lean set of core
> > components,
> > > with super high quality.
> > >
> > >   STAY THE COURSE ? Staying the same ~ a packaging platform for just
> > > hadoop's direct ecosystem.
> > >
> > > I am intrigued by the idea of A and B both have clear benefits and
> > costs...
> > > would like to see the opinions of folks --- do we  lean in one direction
> > or
> > > another? What is the criteria for adding a new feature, package, stack to
> > > bigtop?
> > >
> > > ... Or maybe im just overthinking it and should be spending this time
> > > testing spark for 0.9 release....
> >
> > I'd love to know what other think, but for 0.9 I'd rather stay the course.
> >
> > Thanks,
> > Roman.
> >
> > P.S. There are also market forces at play that may fundamentally change
> > the focus of what we're all working on in the year or so.
> >

Reply via email to