Re: Plugin organization and build system

Bob Kerns Fri, 11 Jan 2013 17:49:55 -0800

Either via extension point, or OSGi service, yes.

While I agree we would like user involvement to be minimal (even compared
to SVN), the user will be involved in a few ways:

1) Encountering a situation where a connector is needed that is not
present. Hopefully, this can be presented to him in unambiguous and clear
terms.

2) Locating the needed connector.

3) Installing it.

We can avoid this most of the time, by including a full range of connectors
with the original update site. And if it's just a new version of Hadoop, a
simple update from the original site can handle it. If we can induce
Cloudera to supply the connector for their releases, even better. But the
potential will always be there, for the need for a connector to some
different environment. So to that extent, I think a connector is a
user-visible concept.

I hope we can avoid the situation where the user has to make a decision
about what connector to use. For that to work reliably, we'd need to be
sure the network protocols provide sufficient information to clearly
identify the target environment. There may be some issues to resolve there
on the Hadoop side.

I think we should consider SSH and local and grid connectors as well. Not
as core offerings (unless people find them very useful, perhaps in
administering a cluster). I think of them as outlier cases that would help
keep the architecture honest, forcing us to consider, "if this
functionality is not available, what should we do"? And I have a suspicion
that we'd find that very useful with our products.  Note that I'm not
suggesting we provide full functionality, but rather that we provide API
and UI that at the level of "run a job" or "see if the job has finished" or
"look at files", can be agnostic about the back end and its specific
capabilities.

This can lead to an architecture where those specific capabilities can be
exposed incrementally by the individual connector, rather than by an
upgrade of the core. For example, I think connectors should provide
metadata that allows configuration UI fragments to be constructed, so a
wizard to submit a job can expose options that are specific to the target.
(Metadata would be better than supplying actual UI fragments, because it
allows more reuse, for example, in a web-based interface, and helps ensure
consistency).

Also, I think we should expose a clean API that gives access to the
functionality made available by the connectors, so it can be use by other
tools besides our API. For example, at FICO, this would let us more closely
integrate with our tools, and would encourage people to contribute
additional useful tools.

I think the connector architecture should also support parts of the
ecosystem beyond MapReduce and HDFS. For example, Hive and HBase.  This
might be as simple as a design pattern, and a shared registry for
connectors. We could drive this with separate connectors for MapReduce and
HDFS, rather than one monolithic connector. (We would not need to package
the MapReduce and HDFS connectors as two bundles -- just register two
connectors to service the two functional areas).

I think all connectors would want to work with a shared description of the
target environment. In fact, you'd like to load start up by loading a
descriptor that describes the target environment, and this would drive
selection of the proper connectors from the available set. (To what extent
the existing configuration files meet the need is an open question, but a
pointer to a directory containing them would do as a starting point.) The
point is, configuration data describes the target, and so should be
conceptually shared, rather than specified (with partial redundancy) for
each type of connector.

Exactly how this bootstraps, and how much configuration is in a static
descriptor, vs dynamic discovery (i.e., OK, we have Hadoop, but which
version? OK, we have Cassandra, but which version?) is something we'd need
to work out. If you check the network protocol to discover protocol
version, you have a chicken-and-the-egg problem. If you get it from a
static specification, you have maintenance and accuracy headaches.

We'd like connectors and the core HDT to be as independent of each other as
possible.

On Fri, Jan 11, 2013 at 11:27 AM, Adam Berry <[email protected]> wrote:

> I think we will need a discussion on how to work the connector/multiple
> version side of this. There are a couple of different ways, and a few
> subtle issues.
>
> But, I think however we actually supply the implementation for versions of
> hadoop, they should be contributed via extension points. This would make it
> easiest to add newer versions down the line, and would also make it
> possible for connectors that support things like CDH, if anything special
> is required. Not that I'm suggesting that we would do these vendor
> connectors, just that it makes most sense to make these extensible.
>
> On Fri, Jan 11, 2013 at 10:02 AM, Simone Gianni <[email protected]>
> wrote:
>
> > I'm also +1 for Tycho.
> >
> > When talking about connectors, its IMHO fine to use them as an internal
> > abstraction, but I don't know if it's a good idea (or useful) to expose
> > connectors to the user as SVN plugins do (they need to do that for
> licence
> > problems). Were you talking about internal abstractions for different
> > versions?
> >
> > Simone
> >
> >
> > 2013/1/10 Adam Berry <[email protected]>
> >
> > > Hi everyone,
> > >
> > > First, I've dropped the code from Hadoop contrib into our git repo, its
> > on
> > > its own branch, hadoop-contrib. The reason I put it on a branch is
> > because
> > > I think that splitting things up a little would be a good idea, and
> > should
> > > make it a little easier to support multiple versions of Hadoop.
> > >
> > > So, the tools as they stand are just in one plugin. Broadly, the
> features
> > > right now can be divided into;
> > >
> > > MapReduce project and class code support (wizards etc)
> > > Launch support for Hadoop
> > > HDFS interaction
> > >
> > > So taking a root name space of org.apache.hdt, I suggest something like
> > the
> > > following for the plugin names
> > >
> > > org.apache.hdt.core
> > >
> > > org.apache.hdt.ui
> > >
> > > org.apache.hdt.debug.core
> > > org.apache.hdt.debug.ui
> > >
> > > org.apache.hdt.hdfs.core
> > > org.apache.hdt.hdfs.ui
> > >
> > > org.apache.hdt.help
> > >
> > > These may be a little fluid as we get into the details here, but from
> > 10000
> > > feet it looks ok.
> > >
> > > Finally, I would also like to suggest Tycho (Maven plugin for doing
> > Eclipse
> > > build stuff) as our build tool. I've done my fair share of pure Ant PDE
> > > build stuff over the years, and Tycho is vastly easier, and would make
> it
> > > much easier for people to build themselves without having to do a bunch
> > of
> > > local setup first.
> > >
> > > Thoughts? If everyone thinks these are ok, I'll enter some issues and
> get
> > > cracking.
> > >
> > > Cheers,
> > > Adam
> > >
> >
>

Re: Plugin organization and build system

Reply via email to