Hi All,

As I’ve been playing with and learning about Drill, it struck me that Drill is 
a wonderful “industrial strength” query engine, but that the client API is a 
bit complex if all an app wants to do is execute a few queries. I wondered if 
we need an adapter between the full-blown Drill columnar, asynchronous RPC that 
Drill uses internally, and the row-based, synchronous API that most apps know 
and love.

In thinking about a simpler client API, a few items came to mind:

- We have the JDBC API for Java apps, but the internals of the current JDBC use 
the Drill client and so the JDBC jar is quite big (20MB).

- The current client API is not versioned, requiring clients to be upgraded in 
lock-step with servers. Many admins, however, find it necessary to upgrade 
clients on a schedule different from that of the server. (Imagine upgrading 
dozens of desktop users at the same time as the Drill cluster.) Many of the 
traditional DB products version their interferes to simplify this task.

- A cool feature of Drill is schema-on-read, which means Drill may encounter 
different schemas as data is read. At present, it is a bit hard for clients to 
consume different schemas. It turns out, however, that stored procedures 
provide something similar (multiple result sets) that we could leverage that 
idea to make schema changes into a first-class feature of the API.

Playing around a bit in my spare time, I found that we can grab lots of ideas 
from “traditional” DB APIs to solve the above problems (and more):

- A simplified client API provides a row-based view of results, with schema 
changes as a first-class API concept.
- A “direct" version of the client can sit directly on top of the Drill Client, 
much like the current JDBC driver.
- Because the client API is simple, it is easy to create a new wire protocol to 
carry the required row-based client messages.
- That wire protocol enables a very light-weight remote version of the client 
API.
- A new server implements the server-side of the new wire protocol. The server 
is an adapter: it converts the “retail” row-based API into the “wholesale” 
columnar API of Drill.
- A new JDBC implementation uses the remote API instead of directly using the 
Drill Client API.

Because the remote client has no dependencies on Drill (or, indeed, anything 
other than the JDK), it is very small.  Indeed, the revised JDBC jar is about 
1% of the size of the existing JDBC driver. (200KB instead of 20MB.)

The result is a little prototype project called “Jig”. I’d like to toss it out 
to the community to see if this is something of interest to others. The code 
works just well enough to prove the concept, though I’ve left off the more 
“advanced” data types, multiple cursors per connection, and other details.

The advantage for Java users is a simpler API, smaller JDBC driver, fewer 
dependencies and cross-version compatibility.

If we add clients in other languages, then just about any language can easily 
query Drill without a Java or ODBC bridge. This would be handy for that Caravel 
integration project discussed here a month or so back. Also for data scientists 
who prefer Python or R.

In case there is interest in this idea, a more detailed proposal is available: 
https://docs.google.com/document/d/1TpJOEUO-DBDGIidOML2_InpJ-fK4yHmsbV5ncqXT6pM

The code is in a GitHub repo: https://github.com/paul-rogers/drill-jig

The JIRA for this enhancement: DRILL-4791: 
https://issues.apache.org/jira/browse/DRILL-4791

This has been a great little learning exercise. Is this something that might we 
might want to take further? Thoughts on the approach taken?

Thanks,

- Paul


Reply via email to