Rob,
Thanks for the details. Lots of good stuff in there.
I am thinking about both tweaking current execution and for a new
(hypothetical) execution engine more targeted at analytics workloads, say.
I have seem some truly monstrous UNIONs (1000s), and FILTER (NOT) IN
leads to unions as well.
The case of a IFP yielding one result, and using that as an index join
is somewhat import. A small amount of buffering probably would not hurt
the execution and make it adapt to large LHS matches. One thing that
might be interesting is switch between index and hash join by sniffing
the input stream and if it is below some threshold use index join and do
a block hash join on a few multiple threads above the threshold. Or a
full-blown parallel hash join (takes twice the working state - it
retains up to about twice the size of the smaller side of the join).
Andy
On 29/01/16 14:06, Rob Vesse wrote:
Andy
Parallelism is used in a couple of main places, bear in mind that
dotNetRDF uses a block based engine rather than a streaming engine (I.e.
no OpSequence, OpConditional) although it does do a form of index join to
avoid unnecessary work wherever possible. The main areas in which it uses
parallelism are joins and filters. So the unit of work is over a
multi-set of possible solutions in SPARQL spec parlance.
For joins since we are joining (or left joining or minus-ing) the results
of two operators we essentially do a parallelized hash join between the
two sides. The hash table is built in serial from the LHS results and
then we parallelise over the RHS results doing look ups into the hash
table and outputting the join results in parallel. You have to be careful
about the data structures used to avoid threads stomping each others
results but this isn't too difficult.
For filters we do something much closer to vectorization where essentially
we parallelise the evaluation of the expression over all the possible
solutions (again we're a blocking engine) and recombine the results
afterwards.
In terms of control in the .Net world we benefit from PLinq which are CLR
supplied extensions to the basic Linq constructs (aka Streams in Java 8)
that automatically parallelize according to the available resources on the
machine. I'm not sure how smart this is but at least in the .Net world
you can constrain it if you want.
For Jena which is a predominantly streaming engine I don't see either of
these approaches providing great benefits. They could be used in the
cases where ARQ does have to do block evaluation to improve things subject
to the points about how best to control the level of parallelism. The
other area where there may be some benefit is to consider the case of
queries with UNION constructs, particularly for queries where we are doing
lazy evaluation (there's a LIMIT and/or OFFSET, ASK etc.) where it may be
possible to dispatch multiple branches of the UNION in parallel.
Rob
On 29/01/2016 11:49, "Andy Seaborne" <[email protected]> wrote:
Rob,
In dotNetRDF, there is parallel execution, isn't there?
I have been thinking (toying with) the idea of parallel execution and I
wondered what unit of work is for the parallelism in dotNetRDF.
What little thinking I've done suggests that tapping into the
parallelism in java streams is not the right way to do it (which is a
shame as that's less work). It needs more control and probably larger
units of work. There is a danger that small/fast queries slow down due
to too much thinking.
It needs more control as well to limit how much of the machine it will
take over because, in Fuseki, it might lead to starvation of other
requests. As some usage is " many clients, many small requests",
parallelism can impact the the system negatively as well as positively.
At some point, the limitation will be the connection of CPUs to RAM
rather than cycles.
Andy
Historical note: RDQL had true parallelism once-upon-a-time. An RDQL
query is a BGP+Filter and not more. The filter ran on a separate thread
to the BGP solver. Timed gain ... just 10%. This was on a early
generation 2 CPU, 2 processor machine so the cost of threading was huge.
Most users then did not have a multi-anything machine. It lead to lots
of problems with thread management when Java wasn't what it is today.