Re: Apache Drill and Schema

Julian Hyde Mon, 13 May 2013 13:04:09 -0700

Gunnar,

I've come across some interesting use cases putting Optiq on top of Drill and 
Splunk. They help explain my philosophy for putting a strongly typed language 
(SQL) on top of systems that are weakly typed. If one tried to remove strong 
typing from SQL, one would end up with a different language, just as JavaScript 
is different from Java. But weakly typed systems clearly have their advantages; 
the problem is how to surface those benefits without abandoning SQL.


(For what it's worth, I believe that this philosophy is consistent with 
Jacques' vision for Drill.)

Optiq implements the SQL specification, and as such is strongly typed. Splunk 
and Drill are weakly typed, but in different ways.

In Splunk, there are a few core system fields, but every record can have a 
different set of fields. In a sense, the whole Splunk system is a single table 
with hundreds or thousands of fields, the set of distinct fields across all 
records. And of course that set of fields may change each time a record is 
added or removed.

The most convenient way to model a Splunk table was like this:

CREATE TABLE splunk (
  source VARCHAR NOT NULL,
  sourcetype VARCHAR NOT NULL,
  _extra VARCHAR);

There are two built-in fields "source" and "sourcetype", present in every 
record, and the "_extra ANY" declaration tells Optiq that you can ask for any 
other field. Those extra fields have type VARCHAR (in splunk all fields are 
strings) and their value in SQL will be NULL if they don't exist in a 
particular row. Thus you can write

SELECT s."source", s."sourcetype", s."action", CAST(s."discount" AS INTEGER)
FROM "splunk"."splunk" AS s
WHERE s."action" = 'purchase'

To implement the Optiq SQL front-end to Drill, I added support for the JSON 
object model to SQL by adding MAP and ARRAY type constructors and a variant 
type ANY. Thus Drill's "donuts" example table appears in SQL as

CREATE TABLE donuts (
   _map MAP(VARCHAR, ANY) NOT NULL);

This is strongly typed, and therefore consistent with the SQL standard. Every 
row has a single column called "_map", and its type is MAP(VARCHAR, ANY). 
(Equivalent to a Java field "final Map<String, Object> map;".) If you want to 
access the fields of the "_map" column you use the "[ ... ]" operator and 
casting.

To access the "ppu" field, you could write "SELECT CAST(_map['ppu'] AS INTEGER) 
as ppu FROM donuts". I would like to add syntactic sugar so that you can write 
"SELECT ppu FROM donuts" (implicitly inserting the "_map" field, just as Java 
inserts "this." before field references, and converting "x.field" to 
"x['field']" for any field of type MAP).

If later you know the schema, you can create a view:

CREATE VIEW donuts_typed AS
  SELECT CAST(_map['name'] AS VARCHAR) AS name,
    CAST(_map['ppu'] AS FLOAT) AS ppu
FROM donuts

You can build your applications on top of such views at little or no loss of 
efficiency.

These illustrate ways that we can add strongly-typed standard SQL on top of a 
system with late or no schema. It is still standard SQL, which everyone agrees 
is very important. The sugarings don't damage the core language; if you don't 
like them, don't use them.

The biggest shortfall between the systems I've described above and what you 
seem to want is enumerating what columns are available. (Your examples all 
contain 'select *'.) In strongly typed SQL, if you want to know what columns 
are available, you ask the catalog (i.e. you use metadata). In a weakly typed 
system, you issue a query (i.e. you use data). If you issue "select *" against 
Optiq-for-Splunk or Optiq-for-Drill you will be disappointed. The column list 
will expand to "source, sourcetype" and "_map" respectively, because that is 
the only metadata known to the system before the query is run.

The easiest solution is to ask your users to specify explicit columns rather 
than writing "select *". (Machine-generated queries, such as my own Mondrian 
engine, already tend to do this.) Or, as I suggested earlier, go ahead and 
provide strongly typed views for your users to use. Or, one could build a UI to 
assist writing queries: the UI could suggest which columns to ask for, based on 
sketches of the data, previous query results, and columns that are already used 
in the query.

Julian

Re: Apache Drill and Schema

Reply via email to