I think that you¹ve captured it well. My meta-point on all of this is that if you just read the Dremel paper casually then you can walk away thinking "Oooh! Look. I just run normal SQL and things go fast!". But the reality is that when you start introducing the hierarchical / nested types then SQL starts to get in your way more than it helps. (Which is why I started to look at FLWOR and the like.)
-- Rob Grzywinski On 8/27/12 11:12 AM, "Constantine Peresypkin" <[email protected]> wrote: > I think I understand what you want. > In BQL you write it as follows: > > SELECT DocId, LAST(Name.Url) WITHIN Name AS Url > FROM data > WHERE Name.Url = 'whatever' > > And then you can prune the NULL ones. > > And if you think that this is a "problem", you haven't seen anything yet. :) > > Constantine > > On Mon, Aug 27, 2012 at 5:56 PM, Rob Grzywinski > <[email protected]>wrote: > >> (And now that I really look at this, perhaps my outer query should be: >> >> SELECT DocId >> FROM data >> WHERE Name IS NOT NULL >> >> (rather than "Name.Url IS NOT NULL") so that the level of the filter and >> project are the same.) >> >> (And yes, I was being loose with the "SELECT" part for illustrative >> purposes >> below. It should have been "SELECT DocId, Name.Url" to return the results >> that I show.) >> >> On 8/27/12 8:57 AM, "Rob Grzywinski" <[email protected]> wrote: >> >>> You're highlighting a fundamental concern that I have about the queries >> as >>> stated in the Dremel paper. Section 5 in the Dremel paper says: >>> >>> "Think of a nested record as a labeled tree, where each label >> corresponds to a >>> field name. The selection op-erator prunes away the branches of the tree >> that >>> do not satisfy the specified conditions. Thus, only those nested records >> are >>> retained where Name.Url is defined and starts with http." >>> >>> Notice that it says "only those *nested* records are retained" and not >> "only >>> those records are retained". (This also follows if you work through the >>> algorithm in Figure 19 from the Dremel paper (what we've called the >> "trickle >>> reassembly").) >>> >>> Specifically, >>> >>> SELECT DocId >>> FROM data >>> WHERE Name.Url = ³whatever² >>> >>> will return *all* records but prune those nested records where Name.Url >>> doesn't equal "whatever". For example if you had the data (in JSON'y >> format): >>> >>> { {"DocId":1, "Name":[{"Url":"A1"},{"Url","whatever"},{"Url":"C1"}]}, >>> {"DocId":2, "Name":[{"Url":"A2"}]} } >>> >>> then the above query would return: >>> >>> { {"DocId":1, "Name":[{"Url","whatever"}]}, >>> {"DocId":2, "Name":null} } >>> >>> To then get only those records that have Name.Url equal to "whatever" you >>> would need to wrap this with: >>> >>> SELECT DocId >>> FROM ( results ) >>> WHERE Name.Url IS NOT NULL >>> >>> which gives: >>> >>> { {"DocId":1, "Name":[{"Url","whatever"}]} } >>> >>> The inner query that you presented would result in: >>> >>> SELECT DocId >>> FROM data >>> WHERE Name.Url IS NOT NULL >>> >>> { {"DocId":1, "Name":[{"Url":"A1"},{"Url","whatever"},{"Url":"C1"}]}, >>> {"DocId":2, "Name":[{"Url":"A2"}]} } >>> >>> (Nothing is restricted) And then the outer query is the same as my first >> query >>> above: >>> >>> SELECT DocId >>> FROM ( result ) >>> WHERE Name.Url = ³whatever² >>> >>> { {"DocId":1, "Name":[{"Url","whatever"}]}, >>> {"DocId":2, "Name":null} } >>> >>> I want to emphasize again that I really want to be wrong about my >> reading and >>> interpretation of the Dremel paper. >>> >>> >>> -- >>> Rob Grzywinski >>> >>> >>> >>> On 8/26/12 9:46 PM, "Constantine Peresypkin" <[email protected]> >> wrote: >>> >>>> Hello Rob, >>>> I think you've meant >>>> >>>> SELECT DocId >>>> FROM ( >>>> SELECT DocId >>>> FROM data >>>> WHERE Name.Url IS NOT NULL) >>>> WHERE Name.Url = ³whatever² >>>> >>>> in the example query. >>>> >>>> In general you're right, you can only "optimize" queries that have >> nothing >>>> but "required" fields inside. >>>> But the whole point in Dremel columnar structure is: do a fast full-scan >>>> and then fast prune. >>>> That's why they have all those integer repetition/definition levels, >> they >>>> only read the levels, if possible, and make early "pruning" decision. >>>> But you can optimize the queries by other means, mainly by data >>>> "compression", in fact you just optimize full column scans. >>>> For example: do not store NULLs, use RLE encoding, dictionary encoding >> and >>>> "plain" compression. >>>> We even implemented some of these optimizations in OpenDremel. >>>> >>>> On Mon, Aug 27, 2012 at 1:50 AM, Rob Grzywinski >>>> <[email protected]>wrote: >>>> >>>>> I wanted to share some of the internal work that we had done on Dremel >> + >>>>> BigQuery-like queries to inject some momentum into the group. >>>>> >>>>> We finished a basic implementation of a field-stripe encoder and >> decoder >>>>> and >>>>> wanted to put a little effort into querying some data. (See >>>>> https://github.com/rgrzywinski/field-stripe/ ) >>>>> >>>>> We had the following requirements: >>>>> * 1 pass over the data >>>>> * SAX-style output (result set) record writer (implying that you don¹t >>>>> cache >>>>> results < you simply hand them off as you project / select) >>>>> * support for select (filter), project and aggregate >>>>> >>>>> We quickly came to the realization that there is only one model that is >>>>> supported under these requirements: sub-tree pruning. (You have to >> spend >>>>> some time thinking about a schema tree and what happens if you have >> select >>>>> and project in different branches of that tree. You realize that there >> are >>>>> always cases in which you cannot evaluate your select before you have >> to >>>>> project. This boils down to the fact that will hand off a result and >> then >>>>> later realize that it should have been filtered. The only way that you >> can >>>>> prevent this is by saying that selects only affect themselves and their >>>>> children (i.e. ³sub-tree pruning²).) The problem with sub-tree pruning >> is >>>>> that it¹s 100% counter to everything that you¹ve ever done with SQL. >> For >>>>> example: >>>>> >>>>> SELECT DocId >>>>> FROM data >>>>> WHERE Name.Url = ³whatever² >>>>> >>>>> (using the schema from the Dremel paper) >>>>> >>>>> That query would return *all records* but prune those sub-records that >>>>> don¹t >>>>> have Name.Url equal to ³whatever². What would be more natural (i.e. >> more >>>>> SQL-like) would be to filter the records that don¹t have Name.Url. But >> then >>>>> one immediately asks ³what about the sub-records²? Do you both filter >> the >>>>> records and prune the sub-trees? And what does ³=² mean if Name.Url is >> a >>>>> collection? Does it mean ³has only², ³contains², other? >>>>> >>>>> So how does Dremel solve all of this? They just do the sub-record/tree >>>>> purning and then wrap the query within another query to get what they >> want. >>>>> (And FYI < Dremel / BigQuery only support two table joins (where >> either or >>>>> both table can be a sub-select) -- >>>>> https://developers.google.com/bigquery/docs/query-reference) >>>>> >>>>> SELECT DocId >>>>> FROM ( >>>>> SELECT DocId >>>>> FROM data >>>>> WHERE Name.Url = ³whatever² ) >>>>> WHERE Name.Url IS NOT NULL >>>>> >>>>> (Or something like that) You can always make a sequence of queries that >>>>> gives you what you want. >>>>> >>>>> (I would love to see someone from the group validate or squash all of >> this >>>>> theory and conjecture! I really want to be wrong here!) >>>>> >>>>> One really really wants to try this out in BigQuery. Unfortunately, in >>>>> spite >>>>> of promises, you still can¹t add hierarchical data to BigQuery. You can >>>>> only >>>>> use their pre-defined data sets which unfortunately doesn¹t have any >>>>> interesting data in it to allow you to tease out answers. The moment >> that >>>>> you use a flat structure then all of this hoopla goes away and you >> devolve >>>>> to SQL (which is what it seems that the public version of BigQuery only >>>>> supports at this moment). >>>>> >>>>> So we started thinking: what else does structured queries? XPath / >> XQuery. >>>>> XPath really only gives you subtrees which isn¹t that interesting. >> XQuery >>>>> immediately forces you into FLWOR. Having step 1 of writing a query >>>>> processor be FLWOR is depressing. >>>>> >>>>> So we kept looking. What about JSON? There¹s a ton of dead projects for >>>>> JSON >>>>> queries :( We looked at MongoDB¹s query interface. We looked at >>>>> CouchBase. >>>>> Ah! UnQL ( http://www.unqlspec.org/display/UnQL/Home ) and a ton of >> press >>>>> releases. But, oh wait, it seems to be dead ( >>>>> http://www.arangodb.org/2012/04/07/is_unql_dead ). Oh boy!. Then we >> find >>>>> this AQL thing ( http://www.arangodb.org/manuals/Aql.html and more >> info at >>>>> http://www.arangodb.org/2012/04/19/rfc-the-avocadodb-query-language ). >>>>> Oddly, they had to resort to FLWOR-like structures too. >>>>> >>>>> (More to come ...) >>>>> >>>>> >>>>> -- >>>>> Rob Grzywinski >>>>> >>>>> >>>> >> >> >
