Gunnar,
I've come across some interesting use cases putting Optiq on top of Drill and
Splunk. They help explain my philosophy for putting a strongly typed language
(SQL) on top of systems that are weakly typed. If one tried to remove strong
typing from SQL, one would end up with a different language, just as JavaScript
is different from Java. But weakly typed systems clearly have their advantages;
the problem is how to surface those benefits without abandoning SQL.
(For what it's worth, I believe that this philosophy is consistent with
Jacques' vision for Drill.)
Optiq implements the SQL specification, and as such is strongly typed. Splunk
and Drill are weakly typed, but in different ways.
In Splunk, there are a few core system fields, but every record can have a
different set of fields. In a sense, the whole Splunk system is a single table
with hundreds or thousands of fields, the set of distinct fields across all
records. And of course that set of fields may change each time a record is
added or removed.
The most convenient way to model a Splunk table was like this:
CREATE TABLE splunk (
source VARCHAR NOT NULL,
sourcetype VARCHAR NOT NULL,
_extra VARCHAR);
There are two built-in fields "source" and "sourcetype", present in every
record, and the "_extra ANY" declaration tells Optiq that you can ask for any
other field. Those extra fields have type VARCHAR (in splunk all fields are
strings) and their value in SQL will be NULL if they don't exist in a
particular row. Thus you can write
SELECT s."source", s."sourcetype", s."action", CAST(s."discount" AS INTEGER)
FROM "splunk"."splunk" AS s
WHERE s."action" = 'purchase'
To implement the Optiq SQL front-end to Drill, I added support for the JSON
object model to SQL by adding MAP and ARRAY type constructors and a variant
type ANY. Thus Drill's "donuts" example table appears in SQL as
CREATE TABLE donuts (
_map MAP(VARCHAR, ANY) NOT NULL);
This is strongly typed, and therefore consistent with the SQL standard. Every
row has a single column called "_map", and its type is MAP(VARCHAR, ANY).
(Equivalent to a Java field "final Map<String, Object> map;".) If you want to
access the fields of the "_map" column you use the "[ ... ]" operator and
casting.
To access the "ppu" field, you could write "SELECT CAST(_map['ppu'] AS INTEGER)
as ppu FROM donuts". I would like to add syntactic sugar so that you can write
"SELECT ppu FROM donuts" (implicitly inserting the "_map" field, just as Java
inserts "this." before field references, and converting "x.field" to
"x['field']" for any field of type MAP).
If later you know the schema, you can create a view:
CREATE VIEW donuts_typed AS
SELECT CAST(_map['name'] AS VARCHAR) AS name,
CAST(_map['ppu'] AS FLOAT) AS ppu
FROM donuts
You can build your applications on top of such views at little or no loss of
efficiency.
These illustrate ways that we can add strongly-typed standard SQL on top of a
system with late or no schema. It is still standard SQL, which everyone agrees
is very important. The sugarings don't damage the core language; if you don't
like them, don't use them.
The biggest shortfall between the systems I've described above and what you
seem to want is enumerating what columns are available. (Your examples all
contain 'select *'.) In strongly typed SQL, if you want to know what columns
are available, you ask the catalog (i.e. you use metadata). In a weakly typed
system, you issue a query (i.e. you use data). If you issue "select *" against
Optiq-for-Splunk or Optiq-for-Drill you will be disappointed. The column list
will expand to "source, sourcetype" and "_map" respectively, because that is
the only metadata known to the system before the query is run.
The easiest solution is to ask your users to specify explicit columns rather
than writing "select *". (Machine-generated queries, such as my own Mondrian
engine, already tend to do this.) Or, as I suggested earlier, go ahead and
provide strongly typed views for your users to use. Or, one could build a UI to
assist writing queries: the UI could suggest which columns to ask for, based on
sketches of the data, previous query results, and columns that are already used
in the query.
Julian