[DISCUSS] Most storage plug-ins don't belong in the Drill repo

Chris Westin Mon, 24 Aug 2015 14:10:21 -0700

I'd like to propose that we move most of the storage plug-ins out of the
main
drill codebase/repo.


The storage plug-ins don't belong with the drill code. They can live
anywhere,
and can have independent releases. This allows them to be more aligned with
the storage systems they represent, and removes the need for review by Drill
committers who may not necessarily have any expertise with the target
storage
systems.

Here's an example. We have a customer who wants to use current Drill with
an older version of Hive (0.13). My understanding is that the current Hive
plug-in can only work with newer versions of Hive (1.0+) because of API
incompatibilities with Hive. However, unless the storage plug-in interface
has changed in that time, there's no reason why they shouldn't be able to
use the old 0.13 plug-in with current drill. It's just not built and
packaged
that way. If the plug-in source were separate, then it would be easier to
just use the old plug-in.

Another example, again involving Hive. We have a customer who has two Hive
clusters of different versions (because they belong to different
departments).
They want to use current drill to join data between the two. Given the
Hive API incompatibilities, I've suggested that we find a way to use both
versions of the Hive plugin (configured with different workspace prefixes)
at
the same time. Assuming the storage plug-in interface hasn't changed in that
time, it seems like this should work. (The Hive folks have mentioned that
there may be library dependency incompatibilities between the two plug-in
implementations, but it seems like we should be able to handle that with
some
adjustment to use separate class loaders for the storage plug-ins, if that
happens).

I would suggest that the Drill source only keep a few basic plug-ins, such
as the text/csv, text/json ones, and possibly the parquet one. These don't
depend on anything other than the file system, and are useful for immediate
testing. Other plug-ins (e.g., Hive, MongoDB, Cassandra, etc) should live
somewhere else, and have their own independent existence. The other plug-ins
can then be released on their own schedule, possibly co-inciding with
significant changes to the storage systems they provide access to.

Assuming we do this, there are some logistics questions:
(*) Where do we put the source?
Does the Apache process have a provision for plug-in architectures like
this?
Or would each plug-in have to go through the whole incubation process? That
seems pretty heavyweight for these, so is there something else? Or should
people
just put them on Github (or their own favorite public repo).

(*) Versioning the storage plug-in interface.
It's possible that the storage plug-in interface will change over time. (I
think there are already plans to make it possible to get more metadata
and/or
statistics from a storage system, if it supports them, in order to do a
better
job of optimizing queries.) The interface is really the only thing whose
version matters here. We need to take some steps to handle that. We might
add
an annotation to it, or require that we use a new name, or add a digit
suffix
to the name. Other ideas? Ideally we'd have adapters that allow the use of
older plug-ins (with less capable interfaces) with newer Drill so that users
aren't held back from updating drill if there aren't newer plug-ins for
their
storage of choice.

[DISCUSS] Most storage plug-ins don't belong in the Drill repo

Reply via email to