Stymied by queries

Rob Grzywinski Sun, 26 Aug 2012 15:51:27 -0700

I wanted to share some of the internal work that we had done on Dremel +
BigQuery-like queries to inject some momentum into the group.


We finished a basic implementation of a field-stripe encoder and decoder and
wanted to put a little effort into querying some data. (See
https://github.com/rgrzywinski/field-stripe/ )

We had the following requirements:
* 1 pass over the data
* SAX-style output (result set) record writer (implying that you don¹t cache
results < you simply hand them off as you project / select)
* support for select (filter), project and aggregate

We quickly came to the realization that there is only one model that is
supported under these requirements: sub-tree pruning. (You have to spend
some time thinking about a schema tree and what happens if you have select
and project in different branches of that tree. You realize that there are
always cases in which you cannot evaluate your select before you have to
project. This boils down to the fact that will hand off a result and then
later realize that it should have been filtered. The only way that you can
prevent this is by saying that selects only affect themselves and their
children (i.e. ³sub-tree pruning²).) The problem with sub-tree pruning is
that it¹s 100% counter to everything that you¹ve ever done with SQL. For
example:

    SELECT DocId
      FROM data
     WHERE Name.Url = ³whatever²

(using the schema from the Dremel paper)

That query would return *all records* but prune those sub-records that don¹t
have Name.Url equal to ³whatever². What would be more natural (i.e. more
SQL-like) would be to filter the records that don¹t have Name.Url. But then
one immediately asks ³what about the sub-records²? Do you both filter the
records and prune the sub-trees? And what does ³=² mean if Name.Url is a
collection? Does it mean ³has only², ³contains², other?

So how does Dremel solve all of this? They just do the sub-record/tree
purning and then wrap the query within another query to get what they want.
(And FYI < Dremel / BigQuery only support two table joins (where either or
both table can be a sub-select) --
https://developers.google.com/bigquery/docs/query-reference)

    SELECT DocId
      FROM (
               SELECT DocId
                 FROM data
               WHERE Name.Url = ³whatever² )
     WHERE Name.Url IS NOT NULL

(Or something like that) You can always make a sequence of queries that
gives you what you want.

(I would love to see someone from the group validate or squash all of this
theory and conjecture! I really want to be wrong here!)

One really really wants to try this out in BigQuery. Unfortunately, in spite
of promises, you still can¹t add hierarchical data to BigQuery. You can only
use their pre-defined data sets which unfortunately doesn¹t have any
interesting data in it to allow you to tease out answers. The moment that
you use a flat structure then all of this hoopla goes away and you devolve
to SQL (which is what it seems that the public version of BigQuery only
supports at this moment).

So we started thinking: what else does structured queries? XPath / XQuery.
XPath really only gives you subtrees which isn¹t that interesting. XQuery
immediately forces you into FLWOR. Having step 1 of writing a query
processor be FLWOR is depressing.

So we kept looking. What about JSON? There¹s a ton of dead projects for JSON
queries :(  We looked at MongoDB¹s query interface.  We looked at CouchBase.
Ah! UnQL ( http://www.unqlspec.org/display/UnQL/Home ) and a ton of press
releases. But, oh wait, it seems to be dead (
http://www.arangodb.org/2012/04/07/is_unql_dead ). Oh boy!. Then we find
this AQL thing ( http://www.arangodb.org/manuals/Aql.html and more info at
http://www.arangodb.org/2012/04/19/rfc-the-avocadodb-query-language ).
Oddly, they had to resort to FLWOR-like structures too.

(More to come ...)


--
Rob Grzywinski

Stymied by queries

Reply via email to