Of course optimizer must work with some intermediate form of query, which I think could be an object graph expressed for now by using ordinary programming language objects without much fuss (like Java or Python). However, I think its upfront formalization is going to develop into never-ending story and mainly because iteration over nested datasets is far from being clear, different nested dataset languages are not only differs in syntax but also in the iteration model itself. I think Rob Grzywinski started this discussion in separate thread already...
Also regarding optimizer and DAG, as there is not much index/joins action going on what we have now is mostly a chain of transformations which of-course is also formally a DAG :) but thinking about it as a chain will simplify things at first, there is no much optimizations you can do with it. If you go one level down and consider scalar operations then you get more elaborate DAG of course. ----------------- Let's separate issues: 1. Query Plan is distributed workload and that must be formalized and I think no one suggests otherwise. Also no one suggest other model than DAG except me. I suggest unrestricted graph just to keep backend useful for other stuff, for the purposes of DrQL DAG is more than adequate in my opinion. However, this is a DAG of identical nodes, and it follows physical data partitioning. Let's label it physical DAG in order not to confuse with logical query plan. 2. Open and somewhat confused issue is what actually runs a single node of the above mentioned physical DAG? Is this a formalized query plan or just arbitrary code? 3. Query plan formalization: main obstacle here is that the model of iterating nested datasets are far from clear. Particularly nor Dremel paper neither BigQuery reference describe well the behavior of querying nested datasets with all different subcases. There many other languages to query nested data but the iteration model varies significantly between them. For formalization we miss one another academic paper which would rigorously define canonical high-performance iteration model for nested datasets. 4. Another complicating factor is columnar optimization. Drill is going to be nested-columnar engine and as such part of query plan must be columnar. So full set of column-oriented and record-oriented primitives are needed record-construction primitives. On Mon, Aug 27, 2012 at 9:29 AM, Hyunsik Choi <[email protected]>wrote: > Hi David, > > I agree with some of your claims. I also think that now DrQL may be enough > to Drill project. > > Even if we don't support various query languages, I think complex query > languages (like SQL and DrQL) should have an logical form in order to deal > a given query without considering actual physical information. It provides > an easy way to modify the query to be more optimized one (e.g., pushing > down projection, selection, and finding the best operator order) while the > optimized one is logically equivalent to the original query. > > Also, It would not hurt performance. For example, OLTP that processes a > query within a few milliseconds already employs such a logical plan > model. Although a logical plan is generic, it is not hugely different to > existing logical plan models. > > -- > Hyunsik Choi > > On Mon, Aug 27, 2012 at 2:34 PM, David Gruzman <[email protected] > >wrote: > > > Hi, > > Dremel is high performance system. I think building something generic > > "inter-languages" will hurt performance. > > Having generic executor service we can add several different paradigms of > > the local computation (and even not local). But I think > > SQL like query language should be done in most efficient way. > > David > > > > On Mon, Aug 27, 2012 at 3:20 AM, Hyunsik Choi <[email protected]> > wrote: > > > > > Hi, > > > > > > How about having a generic logical plan described as a DAG, where each > > > vertex indicates a logical operator including various annotations and > > each > > > edge represents a data flow. A DAG has much expressive power. Many > > > literatures have shown that most logical plans of various data > > manipulation > > > languages can be described as such a DAG. > > > > > > Additional languages have different ASTs, and they can be transformed > > into > > > the generic logical plan. In this case, we can reuse logical plan, > > logical > > > plan optimization, and physical execution plan. Besides, Drill may > > consider > > > a global plan that represents the distributed execution plan. Since > the > > > global plan generally depends on the logical plan, we can also reuse > all > > > code related to the global plan. > > > > > > -- > > > Hyunsik Choi > > > > > > > > > On Mon, Aug 27, 2012 at 6:22 AM, Ted Dunning <[email protected]> > > > wrote: > > > > > > > Camuel, > > > > > > > > Do you have a grammar test suite that demonstrates the range of > > > > expressions? > > > > > > > > Also, I believe that some have a goal to use additional languages > > besides > > > > SQL like languages. A limited version of pig, for instance, would be > > > very > > > > interesting. To do this, it will be important to have a logical plan > > > > structure that is common for different syntaxes and is not limited to > > the > > > > idiosyncracies of any particular syntax. > > > > > > > > How do you think that should be handled? Do you have an idea for a > > > logical > > > > plan structure? > > > > > > > > On Sun, Aug 26, 2012 at 4:11 PM, Camuel Gilyadov <[email protected]> > > > wrote: > > > > > > > > > I've written and attached ANTLR grammar for DrQL which I assume is > > same > > > > as > > > > > BigQuery language described in Query Reference on BigQuery website. > > > This > > > > > grammar includes AST production rules. > > > > > > > > > > > > > > > > > > > > > > > > >
