I'd like to propose that we move most of the storage plug-ins out of the main drill codebase/repo.
The storage plug-ins don't belong with the drill code. They can live anywhere, and can have independent releases. This allows them to be more aligned with the storage systems they represent, and removes the need for review by Drill committers who may not necessarily have any expertise with the target storage systems. Here's an example. We have a customer who wants to use current Drill with an older version of Hive (0.13). My understanding is that the current Hive plug-in can only work with newer versions of Hive (1.0+) because of API incompatibilities with Hive. However, unless the storage plug-in interface has changed in that time, there's no reason why they shouldn't be able to use the old 0.13 plug-in with current drill. It's just not built and packaged that way. If the plug-in source were separate, then it would be easier to just use the old plug-in. Another example, again involving Hive. We have a customer who has two Hive clusters of different versions (because they belong to different departments). They want to use current drill to join data between the two. Given the Hive API incompatibilities, I've suggested that we find a way to use both versions of the Hive plugin (configured with different workspace prefixes) at the same time. Assuming the storage plug-in interface hasn't changed in that time, it seems like this should work. (The Hive folks have mentioned that there may be library dependency incompatibilities between the two plug-in implementations, but it seems like we should be able to handle that with some adjustment to use separate class loaders for the storage plug-ins, if that happens). I would suggest that the Drill source only keep a few basic plug-ins, such as the text/csv, text/json ones, and possibly the parquet one. These don't depend on anything other than the file system, and are useful for immediate testing. Other plug-ins (e.g., Hive, MongoDB, Cassandra, etc) should live somewhere else, and have their own independent existence. The other plug-ins can then be released on their own schedule, possibly co-inciding with significant changes to the storage systems they provide access to. Assuming we do this, there are some logistics questions: (*) Where do we put the source? Does the Apache process have a provision for plug-in architectures like this? Or would each plug-in have to go through the whole incubation process? That seems pretty heavyweight for these, so is there something else? Or should people just put them on Github (or their own favorite public repo). (*) Versioning the storage plug-in interface. It's possible that the storage plug-in interface will change over time. (I think there are already plans to make it possible to get more metadata and/or statistics from a storage system, if it supports them, in order to do a better job of optimizing queries.) The interface is really the only thing whose version matters here. We need to take some steps to handle that. We might add an annotation to it, or require that we use a new name, or add a digit suffix to the name. Other ideas? Ideally we'd have adapters that allow the use of older plug-ins (with less capable interfaces) with newer Drill so that users aren't held back from updating drill if there aren't newer plug-ins for their storage of choice.
