Re: Recursive query, graph query, Datalog

Julian Hyde Mon, 18 Dec 2017 00:03:29 -0800

Yes, I agree.

My first instinct is to add an Iterate operator whose arguments are (1) the 
input, (2) a table function that applies to the delta at each iteration. When 
the table function returns the empty set, iteration stops. The “table function” 
is not a function per se, but a RelNode tree that references a pseudo-table 
called “Delta”. Think of it as a relational lambda expression, and the “Delta" 
is the argument.


Intermediate results are combined using UNION ALL. Is this too restrictive? I 
think maybe not, because you can always add a “finalizer” such as an Aggregate 
after the Iterate operator.

Julian


> On Dec 15, 2017, at 3:11 PM, Walaa Eldin Moustafa <[email protected]> 
> wrote:
> 
> Yes, Magic sets is a very important and popular optimization as well. I
> guess once we can get a basic notion of recursion as a construct in
> Calcite, and get it to work correctly, we can start cracking optimizations.
> One thing to note is that the convergence/fixed point depend on the data,
> and there is no way to know beforehand what the (complete) plan will look
> like (i.e., how many joins). It seems that there must be some sort of
> awareness in the host engine of the fact that the query is recursive, and
> it should keep iterating till fixed point, or at least tell Calcite if it
> converged or not, and if not, Calcite will ask it to keep trying, so every
> iteration Calcite sends a traditional (non-recursive) RA plan, or ask the
> engine to stop. Do you agree?
> 
> 
> On Fri, Dec 15, 2017 at 12:06 PM, Julian Hyde <[email protected]> wrote:
> 
>> (Moving Carl, Shrikanth, Vasanth to bcc.)
>> 
>> Regarding optimizations. One one hand, it is daunting that there so
>> many optimizations are required to make graph queries run efficiently,
>> but on the other hand, it is good news for the project if those can be
>> expressed in relational algebra.
>> 
>> Looking at the previous research, some of the optimizations applied
>> are genuinely only possible at run-time, but others should be thought
>> of as logical rewrites. Semi-naive evaluation, which Walaa mentions,
>> can be expressed as a logical operation (very similar to incremental
>> view maintenance and streaming, by the way).
>> 
>> (Untangling the capabilities of a particular engine from algebraic
>> rewrites is Calcite's gift to the world!)
>> 
>> Another very important logical rewrite is "magic sets"[1], which can
>> be modeled as semi-join push-down and done entirely at planning
>> time[2] or (if the runtime supports it) as side-ways information
>> passing of bloom filters or similar. Magic sets are important for
>> graph queries but also very useful for star-schema queries with a
>> fixed number of joins.
>> 
>> Julian
>> 
>> [1] http://db.cs.berkeley.edu/papers/sigmod96-magic.pdf
>> 
>> [2] https://issues.apache.org/jira/browse/CALCITE-468
>> 
>> 
>> On Fri, Dec 15, 2017 at 11:21 AM, Edmon Begoli <[email protected]> wrote:
>>> Great initiative.
>>> 
>>> I will also share some comparative performance studies we did at ORNL on
>>> different graph processing engines. Could be useful.
>>> 
>>> On Fri, Dec 15, 2017 at 14:11 Walaa Eldin Moustafa <
>> [email protected]>
>>> wrote:
>>> 
>>>> Hi Julian,
>>>> 
>>>> Thanks for referencing our Datalog query processing paper [5]. I have
>> been
>>>> thinking about the same idea for a while now too :) I think Calcite is
>> very
>>>> well positioned as a generic query optimizer to add Datalog/recursive
>> query
>>>> support. Also, it makes a lot of sense since it opens a completely new
>>>> dimension for the kind of logic and queries that Calcite can handle,
>>>> including but not limited to graph queries, and that can be immediately
>>>> available to engines talking to Calcite.
>>>> 
>>>> To answer your questions, we probably need to add a transitive closure
>>>> operator. This 1988 paper <http://ieeexplore.ieee.org/document/42731/>
>> by
>>>> Rakesh Agrawal proposes the notion of alpha relations, and defines an
>> alpha
>>>> operator on top of them which computes the transitive closure of alpha
>>>> relations. The operator fits well with the rest of Cod's relational
>> algebra
>>>> operators.
>>>> 
>>>> For query optimizations, one of the commonly used Datalog optimizations
>> is
>>>> Semi-naive evaluation, where instead of re-evaluating the recursive
>> program
>>>> using all existing facts, only new facts inferred since last iteration
>> are
>>>> used. Datalog optimizations become much more interesting when
>> introducing
>>>> aggregation and negation, and it is still an open research question, but
>>>> there is already some tangible progress. Otherwise, as you mentioned
>>>> transitive closure is repeated joins, so pretty much many of the join
>>>> optimizations apply such as predicate pushdown, and aggregation
>>>> pushdown/pull up.
>>>> 
>>>> Regarding the effort, we can always start from basic features and expand
>>>> from there. I have already started working on the parser, AST and
>> logical
>>>> plan builder for basic Datalog without recursion. I am happy to create a
>>>> JIRA ticket to track this effort there.
>>>> 
>>>> Thanks,
>>>> Walaa.
>>>> 
>>>> 
>>>> On Fri, Dec 15, 2017 at 10:26 AM, Julian Hyde <[email protected]> wrote:
>>>> 
>>>>> I've been thinking about Datalog front end to Calcite. How much effort
>>>>> would it be? Would there be an audience who would find it useful?
>>>>> 
>>>>> The genesis of the idea is talks by Frank McSherry[1] and Vasia
>>>>> Kalavri[2] about graph queries and in particular Timely
>>>>> Dataflow[3][4], and a paper about using Datalog for graph processing
>>>>> [5].
>>>>> 
>>>>> A few observations:
>>>>> * Graph queries require repeated (unbounded) joins, and so are beyond
>>>>> SQL:92.
>>>>> * Datalog allows recursion, and can therefore compute things like
>>>>> transitive closure, and is therefore powerful enough for graph
>>>>> queries.
>>>>> * SQL:99 added 'WITH RECURSIVE' so can handle a pretty useful class of
>>>>> queries. However, for a variety of reasons, people tend not to use SQL
>>>>> for querying graph databases.
>>>>> * Datalog is more than just recursive queries. For instance, it is
>>>>> popular with academics analyzing the power/complexity of languages.
>>>>> And it can do deductive queries.
>>>>> * There are different "strengths" of Datalog. Some variants support
>>>>> only linearizable recursion; most variants only support queries whose
>>>>> results are stratified (i.e. can be defined using well-founded
>>>>> induction, necessary when negations are involved).
>>>>> * The "big data" languages (Hadoop, Spark, Flink, even Pig) have all
>>>>> discussed how they can handle graph queries and iterative computation.
>>>>> However they have mainly focused on improvements to their engine and
>>>>> physical algebra, not looked at logical algebra or query language.
>>>>> * If Calcite's algebra could express deductive query / recursive query
>>>>> / iteration, then not only would Datalog be possible, but also SQL's
>>>>> WITH RECURSIVE, and also SPARQL.
>>>>> 
>>>>> So, questions on my mind:
>>>>> * What new operator(s) would we add to Calcite's algebra to enable
>>>>> recursive query?
>>>>> * What optimization rules are possible/necessary for graph queries?
>>>>> * How much effort would it be to add a Datalog parser to Calcite and
>>>>> translate to Calcite's algebra?
>>>>> 
>>>>> Julian
>>>>> 
>>>>> [1] http://www.dataengconf.com/scalability-but-at-what-cost
>>>>> 
>>>>> [2] https://qconsf.com/sf2017/speakers/vasia-kalavri
>>>>> 
>>>>> [3] https://github.com/frankmcsherry/timely-dataflow
>>>>> 
>>>>> [4] http://sigops.org/sosp/sosp13/papers/p439-murray.pdf
>>>>> 
>>>>> [5] http://www.sysnet.ucsd.edu/sysnet/miscpapers/datalog-icbd16.pdf
>>>>> 
>>>> 
>>

Re: Recursive query, graph query, Datalog

Reply via email to