Re: [DISCUSS] Most storage plug-ins don't belong in the Drill repo

Jacques Nadeau Tue, 25 Aug 2015 08:33:11 -0700

I don't think we need to have any complex build infrastructure to support
building multiple versions.  Most storage plugins depend on Drill core
rather than the other way around.  As such, you could have 7 different hive
modules that all depend on Drill core and conflict with each other.  As
long as they never source their peers, the testing should work fine.  Each
storage plugin module should be responsible for how to compose itself based
on its needs.


Focusing on Hive specifically: it seems like minor pom variations could be
captured using profiles and then the hive module should manage those and
how to rerun surefire tests with each active profile.  For major changes, I
think you need to have separate modules.  (For example HBase 94 versus 98+)



--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Tue, Aug 25, 2015 at 8:14 AM, Jason Altekruse <[email protected]>
wrote:

> I think Jacques has a good point about the code belonging together, but we
> should then talk about how to solve this problem. Users are going to have
> different versions of these datasources that they have deployed, and I
> don't think we should consider it an afterthought as to how they get Drill
> to work with them.
>
> We should make part of our testing suite automate the tasks of building
> against multiple different versions and testing each. While this might
> require some creativity if we are working with changing API's, I think this
> is a problem that needs to be solved directly with targeted effort to make
> maintaining these dependencies easy.
>
> Although this general goal is not trivial, the last time we upgraded hive
> it was only a pom.xml file change that was needed [1]. So for starters we
> can simply work on making the build capable of interacting with several
> versions of Hive simultaneously. I am pretty sure our last Hbase upgrade
> did require changes to how Drill interacted with the Hbase plugin. I don't
> think it is strictly necessary, but I think it would be good to look at how
> we can maintain code that stretches across API versions.
>
> [1]
>
> https://github.com/apache/drill/commit/93533835bdcaff018a6b6ee6ea5999f3c5659d70
>
> On Mon, Aug 24, 2015 at 9:30 PM, Jacques Nadeau <[email protected]>
> wrote:
>
> > Is seems like you have a couple requirements:
> >
> > - Support multiple versions of a plugin against a particular system (e.g.
> > Hive, HBase, etc)
> > - Support loading these multiple versions in the same Drillbit
> >
> > I'm entirely in support of these goals and requirements.  We've been
> > talking about adding a classloading containerization system to better
> > encapsulate individual plugins so that we no longer have to choose only
> one
> > version.  If you want to put together some proposals around this, I think
> > that would be great for the community.
> >
> > On the flipside, I see the idea of taking plugins out of the Drill repo
> as
> > completely orthogonal to the issues/requirements above.  In fact, it
> would
> > be a mistake to separate the code at this point.  It wouldn't provide new
> > value to end users and would make Drill harder to use.  It would also
> lower
> > the quality of the product.
> >
> > As someone who has worked on all of the current storage plugins, the
> > interface is still maturing.  As we integrate new types of data sources,
> > the model around optimization continues to develop.  For example, I'm
> still
> > working through the enhancements required to support the JDBC interface
> in
> > the right way to control which phases certain rules are injected into the
> > query planning.  This is a set of core storage plugin enhancements. By
> > having the code all in one place, I'll make the fixes to any other
> storage
> > plugins as necessary since I know the meaning of these (to be documented)
> > enhancements.  If we had these as disconnected modules, coordinating this
> > type of change would be very difficult. There is also substantial value
> > from the other side: by including the storage plugins in the general
> build,
> > we can also ensure that a core change doesn't have an unintended
> > consequence to those plugins.
> >
> > If storage plugins start to become a huge burden *and* we have found the
> > storage plugin API to be extremely stable, this might make sense with
> > tertiary plugins.  However, for now, I strongly recommend we focus on the
> > items at the top of this email and don't start slicing up the codebase.
> >
> > TL;DR. I'm -1 on the statement "Most storage plug-ins don't belong in the
> > Drill repo".
> >
> > I think that's exactly where they belong.
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Mon, Aug 24, 2015 at 2:09 PM, Chris Westin <[email protected]>
> > wrote:
> >
> > > I'd like to propose that we move most of the storage plug-ins out of
> the
> > > main
> > > drill codebase/repo.
> > >
> > > The storage plug-ins don't belong with the drill code. They can live
> > > anywhere,
> > > and can have independent releases. This allows them to be more aligned
> > with
> > > the storage systems they represent, and removes the need for review by
> > > Drill
> > > committers who may not necessarily have any expertise with the target
> > > storage
> > > systems.
> > >
> > > Here's an example. We have a customer who wants to use current Drill
> with
> > > an older version of Hive (0.13). My understanding is that the current
> > Hive
> > > plug-in can only work with newer versions of Hive (1.0+) because of API
> > > incompatibilities with Hive. However, unless the storage plug-in
> > interface
> > > has changed in that time, there's no reason why they shouldn't be able
> to
> > > use the old 0.13 plug-in with current drill. It's just not built and
> > > packaged
> > > that way. If the plug-in source were separate, then it would be easier
> to
> > > just use the old plug-in.
> > >
> > > Another example, again involving Hive. We have a customer who has two
> > Hive
> > > clusters of different versions (because they belong to different
> > > departments).
> > > They want to use current drill to join data between the two. Given the
> > > Hive API incompatibilities, I've suggested that we find a way to use
> both
> > > versions of the Hive plugin (configured with different workspace
> > prefixes)
> > > at
> > > the same time. Assuming the storage plug-in interface hasn't changed in
> > > that
> > > time, it seems like this should work. (The Hive folks have mentioned
> that
> > > there may be library dependency incompatibilities between the two
> plug-in
> > > implementations, but it seems like we should be able to handle that
> with
> > > some
> > > adjustment to use separate class loaders for the storage plug-ins, if
> > that
> > > happens).
> > >
> > > I would suggest that the Drill source only keep a few basic plug-ins,
> > such
> > > as the text/csv, text/json ones, and possibly the parquet one. These
> > don't
> > > depend on anything other than the file system, and are useful for
> > immediate
> > > testing. Other plug-ins (e.g., Hive, MongoDB, Cassandra, etc) should
> live
> > > somewhere else, and have their own independent existence. The other
> > > plug-ins
> > > can then be released on their own schedule, possibly co-inciding with
> > > significant changes to the storage systems they provide access to.
> > >
> > > Assuming we do this, there are some logistics questions:
> > > (*) Where do we put the source?
> > > Does the Apache process have a provision for plug-in architectures like
> > > this?
> > > Or would each plug-in have to go through the whole incubation process?
> > That
> > > seems pretty heavyweight for these, so is there something else? Or
> should
> > > people
> > > just put them on Github (or their own favorite public repo).
> > >
> > > (*) Versioning the storage plug-in interface.
> > > It's possible that the storage plug-in interface will change over time.
> > (I
> > > think there are already plans to make it possible to get more metadata
> > > and/or
> > > statistics from a storage system, if it supports them, in order to do a
> > > better
> > > job of optimizing queries.) The interface is really the only thing
> whose
> > > version matters here. We need to take some steps to handle that. We
> might
> > > add
> > > an annotation to it, or require that we use a new name, or add a digit
> > > suffix
> > > to the name. Other ideas? Ideally we'd have adapters that allow the use
> > of
> > > older plug-ins (with less capable interfaces) with newer Drill so that
> > > users
> > > aren't held back from updating drill if there aren't newer plug-ins for
> > > their
> > > storage of choice.
> > >
> >
>

Re: [DISCUSS] Most storage plug-ins don't belong in the Drill repo

Reply via email to