Regarding UNION vs UNION ALL. Sure, let's go with UNION. I could imagine cases where UNION is not sufficient (e.g. we want to generate a multiset, or annotate each row with the stratum at which it was generated) but UNION is good enough at first.
Regarding a parser/AST, did you see Riccardo's email in this thread[1] mentioning the datalog parser at the University of Bolzano. Julian [1] https://lists.apache.org/thread.html/e80629a16883ce32b07be0e14002aef6734a68913fe420f6edddfc17@%3Cdev.calcite.apache.org%3E On Thu, Dec 21, 2017 at 6:42 PM, Walaa Eldin Moustafa <[email protected]> wrote: > Agreed on the high level description of the iterate operator and table > function. The table function is basically a "combiner" that will combine > the delta with the result of past iteration somehow. > > I would say we need a UINION (versus UNION ALL) operator since we do not > want to re-add facts that were already inferred in past iteration (in case > they are re-inferred). > > Are you aware of anyone working on the parser/AST? I am giving them a try > in case someone wants to collaborate. > > Thanks, > Walaa. > > > On Mon, Dec 18, 2017 at 12:02 AM, Julian Hyde <[email protected]> wrote: > >> Yes, I agree. >> >> My first instinct is to add an Iterate operator whose arguments are (1) >> the input, (2) a table function that applies to the delta at each >> iteration. When the table function returns the empty set, iteration stops. >> The “table function” is not a function per se, but a RelNode tree that >> references a pseudo-table called “Delta”. Think of it as a relational >> lambda expression, and the “Delta" is the argument. >> >> Intermediate results are combined using UNION ALL. Is this too >> restrictive? I think maybe not, because you can always add a “finalizer” >> such as an Aggregate after the Iterate operator. >> >> Julian >> >> >> > On Dec 15, 2017, at 3:11 PM, Walaa Eldin Moustafa <[email protected]> >> wrote: >> > >> > Yes, Magic sets is a very important and popular optimization as well. I >> > guess once we can get a basic notion of recursion as a construct in >> > Calcite, and get it to work correctly, we can start cracking >> optimizations. >> > One thing to note is that the convergence/fixed point depend on the data, >> > and there is no way to know beforehand what the (complete) plan will look >> > like (i.e., how many joins). It seems that there must be some sort of >> > awareness in the host engine of the fact that the query is recursive, and >> > it should keep iterating till fixed point, or at least tell Calcite if it >> > converged or not, and if not, Calcite will ask it to keep trying, so >> every >> > iteration Calcite sends a traditional (non-recursive) RA plan, or ask the >> > engine to stop. Do you agree? >> > >> > >> > On Fri, Dec 15, 2017 at 12:06 PM, Julian Hyde <[email protected]> wrote: >> > >> >> (Moving Carl, Shrikanth, Vasanth to bcc.) >> >> >> >> Regarding optimizations. One one hand, it is daunting that there so >> >> many optimizations are required to make graph queries run efficiently, >> >> but on the other hand, it is good news for the project if those can be >> >> expressed in relational algebra. >> >> >> >> Looking at the previous research, some of the optimizations applied >> >> are genuinely only possible at run-time, but others should be thought >> >> of as logical rewrites. Semi-naive evaluation, which Walaa mentions, >> >> can be expressed as a logical operation (very similar to incremental >> >> view maintenance and streaming, by the way). >> >> >> >> (Untangling the capabilities of a particular engine from algebraic >> >> rewrites is Calcite's gift to the world!) >> >> >> >> Another very important logical rewrite is "magic sets"[1], which can >> >> be modeled as semi-join push-down and done entirely at planning >> >> time[2] or (if the runtime supports it) as side-ways information >> >> passing of bloom filters or similar. Magic sets are important for >> >> graph queries but also very useful for star-schema queries with a >> >> fixed number of joins. >> >> >> >> Julian >> >> >> >> [1] http://db.cs.berkeley.edu/papers/sigmod96-magic.pdf >> >> >> >> [2] https://issues.apache.org/jira/browse/CALCITE-468 >> >> >> >> >> >> On Fri, Dec 15, 2017 at 11:21 AM, Edmon Begoli <[email protected]> >> wrote: >> >>> Great initiative. >> >>> >> >>> I will also share some comparative performance studies we did at ORNL >> on >> >>> different graph processing engines. Could be useful. >> >>> >> >>> On Fri, Dec 15, 2017 at 14:11 Walaa Eldin Moustafa < >> >> [email protected]> >> >>> wrote: >> >>> >> >>>> Hi Julian, >> >>>> >> >>>> Thanks for referencing our Datalog query processing paper [5]. I have >> >> been >> >>>> thinking about the same idea for a while now too :) I think Calcite is >> >> very >> >>>> well positioned as a generic query optimizer to add Datalog/recursive >> >> query >> >>>> support. Also, it makes a lot of sense since it opens a completely new >> >>>> dimension for the kind of logic and queries that Calcite can handle, >> >>>> including but not limited to graph queries, and that can be >> immediately >> >>>> available to engines talking to Calcite. >> >>>> >> >>>> To answer your questions, we probably need to add a transitive closure >> >>>> operator. This 1988 paper <http://ieeexplore.ieee.org/document/42731/ >> > >> >> by >> >>>> Rakesh Agrawal proposes the notion of alpha relations, and defines an >> >> alpha >> >>>> operator on top of them which computes the transitive closure of alpha >> >>>> relations. The operator fits well with the rest of Cod's relational >> >> algebra >> >>>> operators. >> >>>> >> >>>> For query optimizations, one of the commonly used Datalog >> optimizations >> >> is >> >>>> Semi-naive evaluation, where instead of re-evaluating the recursive >> >> program >> >>>> using all existing facts, only new facts inferred since last iteration >> >> are >> >>>> used. Datalog optimizations become much more interesting when >> >> introducing >> >>>> aggregation and negation, and it is still an open research question, >> but >> >>>> there is already some tangible progress. Otherwise, as you mentioned >> >>>> transitive closure is repeated joins, so pretty much many of the join >> >>>> optimizations apply such as predicate pushdown, and aggregation >> >>>> pushdown/pull up. >> >>>> >> >>>> Regarding the effort, we can always start from basic features and >> expand >> >>>> from there. I have already started working on the parser, AST and >> >> logical >> >>>> plan builder for basic Datalog without recursion. I am happy to >> create a >> >>>> JIRA ticket to track this effort there. >> >>>> >> >>>> Thanks, >> >>>> Walaa. >> >>>> >> >>>> >> >>>> On Fri, Dec 15, 2017 at 10:26 AM, Julian Hyde <[email protected]> >> wrote: >> >>>> >> >>>>> I've been thinking about Datalog front end to Calcite. How much >> effort >> >>>>> would it be? Would there be an audience who would find it useful? >> >>>>> >> >>>>> The genesis of the idea is talks by Frank McSherry[1] and Vasia >> >>>>> Kalavri[2] about graph queries and in particular Timely >> >>>>> Dataflow[3][4], and a paper about using Datalog for graph processing >> >>>>> [5]. >> >>>>> >> >>>>> A few observations: >> >>>>> * Graph queries require repeated (unbounded) joins, and so are beyond >> >>>>> SQL:92. >> >>>>> * Datalog allows recursion, and can therefore compute things like >> >>>>> transitive closure, and is therefore powerful enough for graph >> >>>>> queries. >> >>>>> * SQL:99 added 'WITH RECURSIVE' so can handle a pretty useful class >> of >> >>>>> queries. However, for a variety of reasons, people tend not to use >> SQL >> >>>>> for querying graph databases. >> >>>>> * Datalog is more than just recursive queries. For instance, it is >> >>>>> popular with academics analyzing the power/complexity of languages. >> >>>>> And it can do deductive queries. >> >>>>> * There are different "strengths" of Datalog. Some variants support >> >>>>> only linearizable recursion; most variants only support queries whose >> >>>>> results are stratified (i.e. can be defined using well-founded >> >>>>> induction, necessary when negations are involved). >> >>>>> * The "big data" languages (Hadoop, Spark, Flink, even Pig) have all >> >>>>> discussed how they can handle graph queries and iterative >> computation. >> >>>>> However they have mainly focused on improvements to their engine and >> >>>>> physical algebra, not looked at logical algebra or query language. >> >>>>> * If Calcite's algebra could express deductive query / recursive >> query >> >>>>> / iteration, then not only would Datalog be possible, but also SQL's >> >>>>> WITH RECURSIVE, and also SPARQL. >> >>>>> >> >>>>> So, questions on my mind: >> >>>>> * What new operator(s) would we add to Calcite's algebra to enable >> >>>>> recursive query? >> >>>>> * What optimization rules are possible/necessary for graph queries? >> >>>>> * How much effort would it be to add a Datalog parser to Calcite and >> >>>>> translate to Calcite's algebra? >> >>>>> >> >>>>> Julian >> >>>>> >> >>>>> [1] http://www.dataengconf.com/scalability-but-at-what-cost >> >>>>> >> >>>>> [2] https://qconsf.com/sf2017/speakers/vasia-kalavri >> >>>>> >> >>>>> [3] https://github.com/frankmcsherry/timely-dataflow >> >>>>> >> >>>>> [4] http://sigops.org/sosp/sosp13/papers/p439-murray.pdf >> >>>>> >> >>>>> [5] http://www.sysnet.ucsd.edu/sysnet/miscpapers/datalog-icbd16.pdf >> >>>>> >> >>>> >> >> >> >>
