Re: Parallelism in query execution

Rob Vesse Fri, 29 Jan 2016 06:08:11 -0800

Andy

Parallelism is used in a couple of main places, bear in mind that
dotNetRDF uses a block based engine rather than a streaming engine (I.e.
no OpSequence, OpConditional) although it does do a form of index join to
avoid unnecessary work wherever possible.  The main areas in which it uses
parallelism are joins and filters.  So the unit of work is over a
multi-set of possible solutions in SPARQL spec parlance.

For joins since we are joining (or left joining or minus-ing) the results
of two operators we essentially do a parallelized hash join between the
two sides.  The hash table is built in serial from the LHS results and
then we parallelise over the RHS results doing look ups into the hash
table and outputting the join results in parallel.  You have to be careful
about the data structures used to avoid threads stomping each others
results but this isn't too difficult.

For filters we do something much closer to vectorization where essentially
we parallelise the evaluation of the expression over all the possible
solutions (again we're a blocking engine) and recombine the results
afterwards.

In terms of control in the .Net world we benefit from PLinq which are CLR
supplied extensions to the basic Linq constructs (aka Streams in Java 8)
that automatically parallelize according to the available resources on the
machine.  I'm not sure how smart this is but at least in the .Net world
you can constrain it if you want.

For Jena which is a predominantly streaming engine I don't see either of
these approaches providing great benefits.  They could be used in the
cases where ARQ does have to do block evaluation to improve things subject
to the points about how best to control the level of parallelism.  The
other area where there may be some benefit is to consider the case of
queries with UNION constructs, particularly for queries where we are doing
lazy evaluation (there's a LIMIT and/or OFFSET, ASK etc.) where it may be
possible to dispatch multiple branches of the UNION in parallel.

Rob

On 29/01/2016 11:49, "Andy Seaborne" <[email protected]> wrote:

>Rob,
>
>In dotNetRDF, there is parallel execution, isn't there?
>
>I have been thinking (toying with) the idea of parallel execution and I
>wondered what unit of work is for the parallelism in dotNetRDF.
>
>What little thinking I've done suggests that tapping into the
>parallelism in java streams is not the right way to do it (which is a
>shame as that's less work).  It needs more control and probably larger
>units of work. There is a danger that small/fast queries slow down due
>to too much thinking.
>
>It needs more control as well to limit how much of the machine it will
>take over because, in Fuseki, it might lead to starvation of other
>requests.  As some usage is " many clients, many small requests",
>parallelism can impact the the system negatively as well as positively.
>At some point, the limitation will be the connection of CPUs to RAM
>rather than cycles.
>
>     Andy
>
>Historical note: RDQL had true parallelism once-upon-a-time.  An RDQL
>query is a BGP+Filter and not more.  The filter ran on a separate thread
>to the BGP solver.  Timed gain ... just 10%.  This was on a early
>generation 2 CPU, 2 processor machine so the cost of threading was huge.
>  Most users then did not have a multi-anything machine. It lead to lots
>of problems with thread management when Java wasn't what it is today.

Re: Parallelism in query execution

Reply via email to