Re: Recursive query, graph query, Datalog

Julian Hyde Sun, 24 Dec 2017 22:14:26 -0800

Regarding UNION vs UNION ALL. Sure, let's go with UNION. I could
imagine cases where UNION is not sufficient (e.g. we want to generate
a multiset, or annotate each row with the stratum at which it was
generated) but UNION is good enough at first.


Regarding a parser/AST, did you see Riccardo's email in this thread[1]
mentioning the datalog parser at the University of Bolzano.

Julian

[1] 
https://lists.apache.org/thread.html/e80629a16883ce32b07be0e14002aef6734a68913fe420f6edddfc17@%3Cdev.calcite.apache.org%3E

On Thu, Dec 21, 2017 at 6:42 PM, Walaa Eldin Moustafa
<[email protected]> wrote:
> Agreed on the high level description of the iterate operator and table
> function. The table function is basically a "combiner" that will combine
> the delta with the result of past iteration somehow.
>
> I would say we need a UINION (versus UNION ALL) operator since we do not
> want to re-add facts that were already inferred in past iteration (in case
> they are re-inferred).
>
> Are you aware of anyone working on the parser/AST? I am giving them a try
> in case someone wants to collaborate.
>
> Thanks,
> Walaa.
>
>
> On Mon, Dec 18, 2017 at 12:02 AM, Julian Hyde <[email protected]> wrote:
>
>> Yes, I agree.
>>
>> My first instinct is to add an Iterate operator whose arguments are (1)
>> the input, (2) a table function that applies to the delta at each
>> iteration. When the table function returns the empty set, iteration stops.
>> The “table function” is not a function per se, but a RelNode tree that
>> references a pseudo-table called “Delta”. Think of it as a relational
>> lambda expression, and the “Delta" is the argument.
>>
>> Intermediate results are combined using UNION ALL. Is this too
>> restrictive? I think maybe not, because you can always add a “finalizer”
>> such as an Aggregate after the Iterate operator.
>>
>> Julian
>>
>>
>> > On Dec 15, 2017, at 3:11 PM, Walaa Eldin Moustafa <[email protected]>
>> wrote:
>> >
>> > Yes, Magic sets is a very important and popular optimization as well. I
>> > guess once we can get a basic notion of recursion as a construct in
>> > Calcite, and get it to work correctly, we can start cracking
>> optimizations.
>> > One thing to note is that the convergence/fixed point depend on the data,
>> > and there is no way to know beforehand what the (complete) plan will look
>> > like (i.e., how many joins). It seems that there must be some sort of
>> > awareness in the host engine of the fact that the query is recursive, and
>> > it should keep iterating till fixed point, or at least tell Calcite if it
>> > converged or not, and if not, Calcite will ask it to keep trying, so
>> every
>> > iteration Calcite sends a traditional (non-recursive) RA plan, or ask the
>> > engine to stop. Do you agree?
>> >
>> >
>> > On Fri, Dec 15, 2017 at 12:06 PM, Julian Hyde <[email protected]> wrote:
>> >
>> >> (Moving Carl, Shrikanth, Vasanth to bcc.)
>> >>
>> >> Regarding optimizations. One one hand, it is daunting that there so
>> >> many optimizations are required to make graph queries run efficiently,
>> >> but on the other hand, it is good news for the project if those can be
>> >> expressed in relational algebra.
>> >>
>> >> Looking at the previous research, some of the optimizations applied
>> >> are genuinely only possible at run-time, but others should be thought
>> >> of as logical rewrites. Semi-naive evaluation, which Walaa mentions,
>> >> can be expressed as a logical operation (very similar to incremental
>> >> view maintenance and streaming, by the way).
>> >>
>> >> (Untangling the capabilities of a particular engine from algebraic
>> >> rewrites is Calcite's gift to the world!)
>> >>
>> >> Another very important logical rewrite is "magic sets"[1], which can
>> >> be modeled as semi-join push-down and done entirely at planning
>> >> time[2] or (if the runtime supports it) as side-ways information
>> >> passing of bloom filters or similar. Magic sets are important for
>> >> graph queries but also very useful for star-schema queries with a
>> >> fixed number of joins.
>> >>
>> >> Julian
>> >>
>> >> [1] http://db.cs.berkeley.edu/papers/sigmod96-magic.pdf
>> >>
>> >> [2] https://issues.apache.org/jira/browse/CALCITE-468
>> >>
>> >>
>> >> On Fri, Dec 15, 2017 at 11:21 AM, Edmon Begoli <[email protected]>
>> wrote:
>> >>> Great initiative.
>> >>>
>> >>> I will also share some comparative performance studies we did at ORNL
>> on
>> >>> different graph processing engines. Could be useful.
>> >>>
>> >>> On Fri, Dec 15, 2017 at 14:11 Walaa Eldin Moustafa <
>> >> [email protected]>
>> >>> wrote:
>> >>>
>> >>>> Hi Julian,
>> >>>>
>> >>>> Thanks for referencing our Datalog query processing paper [5]. I have
>> >> been
>> >>>> thinking about the same idea for a while now too :) I think Calcite is
>> >> very
>> >>>> well positioned as a generic query optimizer to add Datalog/recursive
>> >> query
>> >>>> support. Also, it makes a lot of sense since it opens a completely new
>> >>>> dimension for the kind of logic and queries that Calcite can handle,
>> >>>> including but not limited to graph queries, and that can be
>> immediately
>> >>>> available to engines talking to Calcite.
>> >>>>
>> >>>> To answer your questions, we probably need to add a transitive closure
>> >>>> operator. This 1988 paper <http://ieeexplore.ieee.org/document/42731/
>> >
>> >> by
>> >>>> Rakesh Agrawal proposes the notion of alpha relations, and defines an
>> >> alpha
>> >>>> operator on top of them which computes the transitive closure of alpha
>> >>>> relations. The operator fits well with the rest of Cod's relational
>> >> algebra
>> >>>> operators.
>> >>>>
>> >>>> For query optimizations, one of the commonly used Datalog
>> optimizations
>> >> is
>> >>>> Semi-naive evaluation, where instead of re-evaluating the recursive
>> >> program
>> >>>> using all existing facts, only new facts inferred since last iteration
>> >> are
>> >>>> used. Datalog optimizations become much more interesting when
>> >> introducing
>> >>>> aggregation and negation, and it is still an open research question,
>> but
>> >>>> there is already some tangible progress. Otherwise, as you mentioned
>> >>>> transitive closure is repeated joins, so pretty much many of the join
>> >>>> optimizations apply such as predicate pushdown, and aggregation
>> >>>> pushdown/pull up.
>> >>>>
>> >>>> Regarding the effort, we can always start from basic features and
>> expand
>> >>>> from there. I have already started working on the parser, AST and
>> >> logical
>> >>>> plan builder for basic Datalog without recursion. I am happy to
>> create a
>> >>>> JIRA ticket to track this effort there.
>> >>>>
>> >>>> Thanks,
>> >>>> Walaa.
>> >>>>
>> >>>>
>> >>>> On Fri, Dec 15, 2017 at 10:26 AM, Julian Hyde <[email protected]>
>> wrote:
>> >>>>
>> >>>>> I've been thinking about Datalog front end to Calcite. How much
>> effort
>> >>>>> would it be? Would there be an audience who would find it useful?
>> >>>>>
>> >>>>> The genesis of the idea is talks by Frank McSherry[1] and Vasia
>> >>>>> Kalavri[2] about graph queries and in particular Timely
>> >>>>> Dataflow[3][4], and a paper about using Datalog for graph processing
>> >>>>> [5].
>> >>>>>
>> >>>>> A few observations:
>> >>>>> * Graph queries require repeated (unbounded) joins, and so are beyond
>> >>>>> SQL:92.
>> >>>>> * Datalog allows recursion, and can therefore compute things like
>> >>>>> transitive closure, and is therefore powerful enough for graph
>> >>>>> queries.
>> >>>>> * SQL:99 added 'WITH RECURSIVE' so can handle a pretty useful class
>> of
>> >>>>> queries. However, for a variety of reasons, people tend not to use
>> SQL
>> >>>>> for querying graph databases.
>> >>>>> * Datalog is more than just recursive queries. For instance, it is
>> >>>>> popular with academics analyzing the power/complexity of languages.
>> >>>>> And it can do deductive queries.
>> >>>>> * There are different "strengths" of Datalog. Some variants support
>> >>>>> only linearizable recursion; most variants only support queries whose
>> >>>>> results are stratified (i.e. can be defined using well-founded
>> >>>>> induction, necessary when negations are involved).
>> >>>>> * The "big data" languages (Hadoop, Spark, Flink, even Pig) have all
>> >>>>> discussed how they can handle graph queries and iterative
>> computation.
>> >>>>> However they have mainly focused on improvements to their engine and
>> >>>>> physical algebra, not looked at logical algebra or query language.
>> >>>>> * If Calcite's algebra could express deductive query / recursive
>> query
>> >>>>> / iteration, then not only would Datalog be possible, but also SQL's
>> >>>>> WITH RECURSIVE, and also SPARQL.
>> >>>>>
>> >>>>> So, questions on my mind:
>> >>>>> * What new operator(s) would we add to Calcite's algebra to enable
>> >>>>> recursive query?
>> >>>>> * What optimization rules are possible/necessary for graph queries?
>> >>>>> * How much effort would it be to add a Datalog parser to Calcite and
>> >>>>> translate to Calcite's algebra?
>> >>>>>
>> >>>>> Julian
>> >>>>>
>> >>>>> [1] http://www.dataengconf.com/scalability-but-at-what-cost
>> >>>>>
>> >>>>> [2] https://qconsf.com/sf2017/speakers/vasia-kalavri
>> >>>>>
>> >>>>> [3] https://github.com/frankmcsherry/timely-dataflow
>> >>>>>
>> >>>>> [4] http://sigops.org/sosp/sosp13/papers/p439-murray.pdf
>> >>>>>
>> >>>>> [5] http://www.sysnet.ucsd.edu/sysnet/miscpapers/datalog-icbd16.pdf
>> >>>>>
>> >>>>
>> >>
>>
>>

Re: Recursive query, graph query, Datalog

Reply via email to