I wanted to share some of the internal work that we had done on Dremel + BigQuery-like queries to inject some momentum into the group.
We finished a basic implementation of a field-stripe encoder and decoder and wanted to put a little effort into querying some data. (See https://github.com/rgrzywinski/field-stripe/ ) We had the following requirements: * 1 pass over the data * SAX-style output (result set) record writer (implying that you don¹t cache results < you simply hand them off as you project / select) * support for select (filter), project and aggregate We quickly came to the realization that there is only one model that is supported under these requirements: sub-tree pruning. (You have to spend some time thinking about a schema tree and what happens if you have select and project in different branches of that tree. You realize that there are always cases in which you cannot evaluate your select before you have to project. This boils down to the fact that will hand off a result and then later realize that it should have been filtered. The only way that you can prevent this is by saying that selects only affect themselves and their children (i.e. ³sub-tree pruning²).) The problem with sub-tree pruning is that it¹s 100% counter to everything that you¹ve ever done with SQL. For example: SELECT DocId FROM data WHERE Name.Url = ³whatever² (using the schema from the Dremel paper) That query would return *all records* but prune those sub-records that don¹t have Name.Url equal to ³whatever². What would be more natural (i.e. more SQL-like) would be to filter the records that don¹t have Name.Url. But then one immediately asks ³what about the sub-records²? Do you both filter the records and prune the sub-trees? And what does ³=² mean if Name.Url is a collection? Does it mean ³has only², ³contains², other? So how does Dremel solve all of this? They just do the sub-record/tree purning and then wrap the query within another query to get what they want. (And FYI < Dremel / BigQuery only support two table joins (where either or both table can be a sub-select) -- https://developers.google.com/bigquery/docs/query-reference) SELECT DocId FROM ( SELECT DocId FROM data WHERE Name.Url = ³whatever² ) WHERE Name.Url IS NOT NULL (Or something like that) You can always make a sequence of queries that gives you what you want. (I would love to see someone from the group validate or squash all of this theory and conjecture! I really want to be wrong here!) One really really wants to try this out in BigQuery. Unfortunately, in spite of promises, you still can¹t add hierarchical data to BigQuery. You can only use their pre-defined data sets which unfortunately doesn¹t have any interesting data in it to allow you to tease out answers. The moment that you use a flat structure then all of this hoopla goes away and you devolve to SQL (which is what it seems that the public version of BigQuery only supports at this moment). So we started thinking: what else does structured queries? XPath / XQuery. XPath really only gives you subtrees which isn¹t that interesting. XQuery immediately forces you into FLWOR. Having step 1 of writing a query processor be FLWOR is depressing. So we kept looking. What about JSON? There¹s a ton of dead projects for JSON queries :( We looked at MongoDB¹s query interface. We looked at CouchBase. Ah! UnQL ( http://www.unqlspec.org/display/UnQL/Home ) and a ton of press releases. But, oh wait, it seems to be dead ( http://www.arangodb.org/2012/04/07/is_unql_dead ). Oh boy!. Then we find this AQL thing ( http://www.arangodb.org/manuals/Aql.html and more info at http://www.arangodb.org/2012/04/19/rfc-the-avocadodb-query-language ). Oddly, they had to resort to FLWOR-like structures too. (More to come ...) -- Rob Grzywinski
