Hi Rob, Looks like you guys have put a lot of thought and work into building support for Dremel-style data packing. I'm in total agreement with you on the requirements of decoding. I'm also in agreement about FLWOR - I've personally never seen anything about XQuery that made me excited to program.
All that said, I don't quite understand the dilemma with your query. SELECT DocId FROM data WHERE Name.Url = <whatever> Why is filtering on Name.Url different than in SQL? SQL database already use the same NULL rejection pattern you represented as a subquery. Should equality be treated differently for collections (e.g. "contains") and scalars? I don't see why this should be the case. -Jason On Sun, Aug 26, 2012 at 3:50 PM, Rob Grzywinski <[email protected]>wrote: > I wanted to share some of the internal work that we had done on Dremel + > BigQuery-like queries to inject some momentum into the group. > > We finished a basic implementation of a field-stripe encoder and decoder > and > wanted to put a little effort into querying some data. (See > https://github.com/rgrzywinski/field-stripe/ ) > > We had the following requirements: > * 1 pass over the data > * SAX-style output (result set) record writer (implying that you don¹t > cache > results < you simply hand them off as you project / select) > * support for select (filter), project and aggregate > > We quickly came to the realization that there is only one model that is > supported under these requirements: sub-tree pruning. (You have to spend > some time thinking about a schema tree and what happens if you have select > and project in different branches of that tree. You realize that there are > always cases in which you cannot evaluate your select before you have to > project. This boils down to the fact that will hand off a result and then > later realize that it should have been filtered. The only way that you can > prevent this is by saying that selects only affect themselves and their > children (i.e. ³sub-tree pruning²).) The problem with sub-tree pruning is > that it¹s 100% counter to everything that you¹ve ever done with SQL. For > example: > > SELECT DocId > FROM data > WHERE Name.Url = ³whatever² > > (using the schema from the Dremel paper) > > That query would return *all records* but prune those sub-records that > don¹t > have Name.Url equal to ³whatever². What would be more natural (i.e. more > SQL-like) would be to filter the records that don¹t have Name.Url. But then > one immediately asks ³what about the sub-records²? Do you both filter the > records and prune the sub-trees? And what does ³=² mean if Name.Url is a > collection? Does it mean ³has only², ³contains², other? > > So how does Dremel solve all of this? They just do the sub-record/tree > purning and then wrap the query within another query to get what they want. > (And FYI < Dremel / BigQuery only support two table joins (where either or > both table can be a sub-select) -- > https://developers.google.com/bigquery/docs/query-reference) > > SELECT DocId > FROM ( > SELECT DocId > FROM data > WHERE Name.Url = ³whatever² ) > WHERE Name.Url IS NOT NULL > > (Or something like that) You can always make a sequence of queries that > gives you what you want. > > (I would love to see someone from the group validate or squash all of this > theory and conjecture! I really want to be wrong here!) > > One really really wants to try this out in BigQuery. Unfortunately, in > spite > of promises, you still can¹t add hierarchical data to BigQuery. You can > only > use their pre-defined data sets which unfortunately doesn¹t have any > interesting data in it to allow you to tease out answers. The moment that > you use a flat structure then all of this hoopla goes away and you devolve > to SQL (which is what it seems that the public version of BigQuery only > supports at this moment). > > So we started thinking: what else does structured queries? XPath / XQuery. > XPath really only gives you subtrees which isn¹t that interesting. XQuery > immediately forces you into FLWOR. Having step 1 of writing a query > processor be FLWOR is depressing. > > So we kept looking. What about JSON? There¹s a ton of dead projects for > JSON > queries :( We looked at MongoDB¹s query interface. We looked at > CouchBase. > Ah! UnQL ( http://www.unqlspec.org/display/UnQL/Home ) and a ton of press > releases. But, oh wait, it seems to be dead ( > http://www.arangodb.org/2012/04/07/is_unql_dead ). Oh boy!. Then we find > this AQL thing ( http://www.arangodb.org/manuals/Aql.html and more info at > http://www.arangodb.org/2012/04/19/rfc-the-avocadodb-query-language ). > Oddly, they had to resort to FLWOR-like structures too. > > (More to come ...) > > > -- > Rob Grzywinski > >
