I guess I've been thinking about something a little different. I've been
thinking about what would be an appropriate algebra to internally represent,
manipulate, and optimize Drill queries.
My conclusion is that the best candidate is relational algebra, augmented with
data types that allow collections of nested records and with "explode" and
"implode" operators. ("explode" takes a record with nested records and converts
it to a sequence of flat records, plus a "location" value that indicates how
the nested record fits into the parent; "implode" is the inverse: it takes a
sequence of flat records with a "location" field and converts into a nested
record.) Here is how I came to that conclusion.
The algebra is lower level than DrQL: viz, it would have fewer operators, and
the operators would be easier to reason about and to write transformation rules
for. (Anyone who thinks that DrQL's 'where' operator is straightforward should
ponder why BigQuery will sometimes give the error "Cannot query the cross
product of repeated fields".)
The algebra is (probably) higher level than what Ted calls a "logical plan".
His operators produce two outputs, and while that makes perfect sense for
physical operators, it is difficult to reason about.
Here are a few reasons why I consider DrQL to be a less clean model than the
relational model. As I've said before, the "where" operator has a much more
complex behavior than SQL's where. It is best understood as decomposing
records, applying a filtering predicate, then re-composing the fragments of the
row that made it through the filter. The "within" clause is a nice shorthand,
but is too limited to be considered a full operator. Trees (collections within
rows) are similar to relations (collections of rows) but are handled using
different operators. If the "tree" model was as powerful as advertised then we
wouldn't need the concept of "relation" at all.
That doesn't mean that DrQL is not a good query language. It seems to be
concise, and users learn it quickly and like it. Syntactic sugar operators like
"within" is totally appropriate (just like syntactic sugar "select distinct"
and "having" in SQL).
To fix up DrQL's "where" operator, we convert it to "explode" followed by
"filter" followed by "implode". To fix up aggregate "within", we apply
"explode" then aggregate. We find that we never need to operate on trees. If we
need to operate on a tree, we explode into several records, apply relational
operators on those records, and if necessary implode back again. We're
operating in relational algebra.
This is good news, because relational algebra is well behaved and well
understood.
And by the way, even if the algebra is about exploded sets of flat records,
there's no reason that the physical operators can't operate on tree-structured
records. We could recognize explode-followed-by-filter-followed-by-implode and
implement an operator that does precisely the same as a DrQL "where" clause.
Am I over-engineering here? It's possible. Maybe Drill doesn't need query
optimization. Maybe queries can go straight from a DrQL parse tree to a DAG of
operators using a straightforward mapping. But I'll argue that many people will
come to Drill with SQL queries, or queries very similar to SQL, data sets with
minimal nesting, and will be saddened when Drill can't execute their queries.
This particular user kicked the tires, was impressed with the speed of the car,
but was disappointed that he couldn't drive it where he wanted to go:
http://cwebbbi.wordpress.com/2012/05/20/a-look-at-google-bigquery/.
Julian
On Oct 12, 2012, at 7:37 PM, Ted Dunning <[email protected]> wrote:
> I talked to Jason some more. He had some very good suggestions.
>
> a) some operators need to have multiple outputs. For instance, the group
> operator needs to output the main data stream and a reference to the
> grouped field
>
> b) what Julian was calling nest/unnest is more naturally called explode and
> flatten. The idea is that some field has a list-like value and the output
> will be each of those values. Actually, there are two outputs. One is the
> original input and the second is the explode sequence. This can be the
> input to a DAG which does whatever we want to that exploded sequence,
> typically aggregating it, but really doing whatever we want. Then the
> flatten operator handles splicing the output of the sub-DAG into the
> original record that had the list-like value. There are two outputs of the
> flatten operator as well, which are the main data flow and a reference to
> the output of the DAG in the main data.
>
> This style handles all of the normal grouping/aggregating type of things we
> want to do and it also handles all of Dremel's within syntax.
>
> I also realized a few things as well
>
> 1) the bind needs to be rooted in some data source so that we can
> understand scoping relative to schemas
>
> 2) there is an important difference between two separate outputs of a DAG
> element and a single output that goes two places.
>
> 3) everywhere I was wanting to inject an output field name can be handled
> by multiple outputs
>
>
>
> I think that the logical plan spec is ready for two things, both of which
> can be done by somebody other than me:
>
>
> A - We can now start trying to convert an abstract syntax tree from
> Dremel-ish source into a logical plan
>
> B - We can implement a toy interpreter for the logical plan that transforms
> sequences of trees.