Re: How does Jena-ARQ execute the queries containing ORDER BY + LIMIT clause

Andy Seaborne Tue, 10 Mar 2015 07:18:06 -0700

On 10/03/15 14:10, Rob Vesse wrote:

Andy


Paul's MINUS implementation does use a hash join


Yes - true.

One day I'll port properly Quack to ARQ. It's getting tested in otherwork. Maybe I should temporarily rip out the hash (left)join part tocatch the cases where index join is not used for joins.


But a release first ...

        Andy


Rob

On 10/03/2015 12:16, "Andy Seaborne" <[email protected]> wrote:

What's the background for all these questions ??!!

Index Joins have the useful feature that they take constant memory
overhead.  This means one major area where RAM would otherwise run out
is removed.

ARQ does not currently use hash joins.  It should, though SPARQL tends
quite strongly to produce chains of joins. (There are cases in SPARQL
involving OPTIONALs where index joins, with scope elimination, does not
work as an execution strategy.  See the literature.)

I have an alternative library of a collection of various join algorithms
that can be used to build alternative query execution strategies.  This
evaluator is not ready for production and not part of the project (yet?).

Pipeline joins are interesting as a sort of two-sided hash join which is
non-blocking.  Useful for multi-core and distributed evaluation.

        Andy



On 10/03/15 11:49, Rose Beck wrote:

Dear Rob,

Thanks a lot again :)

It seems to me that Index Joins are an efficient solution when the
query plan is a left deep join tree. However, when the query plan is
bushy and the RHS itself includes many Index Join operators
(essentially forming a sub-tree), then dont you think computing the
entire RHS sub-tree for each join result of the LHS sub-tree is
expensive. Please correct me if I am wrong. Or does Jena only builds
left deep join trees.

On Tue, Mar 10, 2015 at 5:11 PM, Rob Vesse <[email protected]> wrote:

Jena does use hash joins where necessary which are indeed blocking

However in most cases it can instead use index joins which are
essentially
a form of merge join whereby you substitute candidate solutions from
the
LHS into the RHS and evaluate the RHS for each LHS solution

Rob

On 10/03/2015 11:10, "Rose Beck" <[email protected]> wrote:

Dear Rob,

Thanks a lot for the reply again.

But I am curious does Jena implement Hash joins -- which are
essentially blocking in nature. If Jena does not, then how is a join
between two unsorted lists (of intermediate results) done in Jena?



On Tue, Mar 10, 2015 at 4:34 PM, Rob Vesse <[email protected]>
wrote:

Yes that is pretty much what happens

Note though that most query evaluation in ARQ is done in a streaming
fashion so the full set of solutions is typically never held in
memory
for
any query unless an operator which requires full /partial
materialisation
e.g. DISTINCT is encountered

Rob

On 10/03/2015 10:24, "Rose Beck" <[email protected]> wrote:

Dear Rob,

First and foremost thank you for such a wonderful explanation.

Just to clarify, say my example query is:

select ?a?b?c where{?a <pred1> ?c. ?a <pred2> ?b} order by ?c limit
10

Then for the query above all the solutions are generated just as it
would be the case for the SPARQL query: select ?a?b?c where{?a
<pred1>
?c. ?a <pred2> ?b}. But within the priority queue (which is the last
operation which is applied before results are output to the users)
at
any point just 10 solutions ordered by ?c are placed. Please correct
me if I am wrong.

Cheers,
Rose


On Tue, Mar 10, 2015 at 3:38 PM, Rob Vesse <[email protected]>
wrote:

Query execution in ARQ is based on nested iterators so
QueryIterTopN
will
always apply over another iterator

A PriorityQueue is used internally as temporary storage within
QueryIterTopN while it exhausts the inner iterator allowing it to
only
use
at most the limit amount of storage in the priority queue plus
whatever
temporary storage the inner iterator(s) may need.

There is still a "total sort" in the sense that every possible
solution
has to be compared to see if it should be placed into the priority
queue
however there is not a "total sort" in the sense of needing to
materialise
all possible solutions into memory and then sort.

Rob

p.s. Please don't post identical questions to both users@ and dev@
-
one
list is sufficient as the developers are on both lists.  As a
general
rule
general support questions should go to users@ and
technical/architecture
questions like this should go to dev@



On 10/03/2015 05:54, "Rose Beck" <[email protected]> wrote:

Hi,

I saw the following issue posted on Jena website (which has been
recently resolved):
Avoid a total sort for ORDER BY + LIMIT queries
(https://issues.apache.org/jira/browse/JENA-89).

I am very interested in understanding as to how does Jena-ARQ
avoids
total sort for ORDER BY + LIMIT queries. In the post it is
mentioned
that Jena-ARQ uses priority queue for avoiding a final sort,
however
it is also mentioned that "ARQ's algebra package contains already
a
OpTopN [3] operator. The OpExecutor [4] will need to use a new
QueryIterTopN instead of QueryIterSort + QueryIterSlice." It is
not
clear now does the priority queue benefit from OpTopN operator and
QueryIterTopN as the links [3] and [4] mentioned on the website
does
not work, so I am not able to understand their operation and as
to how
do they help in avoiding a total sort.

Can someone please explain how does Jena-ARQ execute the queries
containing ORDER BY + LIMIT clause.

With Warm Regards,
Rose




--
With Warm Regards,
Rose




--
With Warm Regards,
Rose

Re: How does Jena-ARQ execute the queries containing ORDER BY + LIMIT clause

Reply via email to