Re: Parallelism in query execution

Rob Vesse Tue, 02 Feb 2016 06:54:12 -0800

The other place huge UNIONs can appear is if you want to try and flatten
out property paths queries involving +, - and * into approximated queries
that yield the same results


I don't think Jena does this but we at Cray certainly provide users the
option to do this

I like the idea of maybe buffering slightly and then intelligently
deciding what kind of join to use.  However this probably needs to take
into account the overall query structure because if you have a small
LIMIT/OFFSET present then you want to do the minimum work possible so
don't want to compute an unnecessary hash join when an index join could
have yielded the one result you needed much faster.

Rob

On 02/02/2016 14:17, "Andy Seaborne" <[email protected]> wrote:

>Rob,
>
>Thanks for the details.  Lots of good stuff in there.
>
>I am thinking about both tweaking current execution and for a new
>(hypothetical) execution engine more targeted at analytics workloads, say.
>
>I have seem some truly monstrous UNIONs (1000s), and FILTER (NOT) IN
>leads to unions as well.
>
>The case of a IFP yielding one result, and using that as an index join
>is somewhat import. A small amount of buffering probably would not hurt
>the execution and make it adapt to large LHS matches. One thing that
>might be interesting is switch between index and hash join by sniffing
>the input stream and if it is below some threshold use index join and do
>a block hash join on a few multiple threads above the threshold.  Or a
>full-blown parallel hash join (takes twice the working state - it
>retains up to about twice the size of the smaller side of the join).
>
>
>       Andy
>
>On 29/01/16 14:06, Rob Vesse wrote:
>> Andy
>>
>> Parallelism is used in a couple of main places, bear in mind that
>> dotNetRDF uses a block based engine rather than a streaming engine (I.e.
>> no OpSequence, OpConditional) although it does do a form of index join
>>to
>> avoid unnecessary work wherever possible.  The main areas in which it
>>uses
>> parallelism are joins and filters.  So the unit of work is over a
>> multi-set of possible solutions in SPARQL spec parlance.
>>
>> For joins since we are joining (or left joining or minus-ing) the
>>results
>> of two operators we essentially do a parallelized hash join between the
>> two sides.  The hash table is built in serial from the LHS results and
>> then we parallelise over the RHS results doing look ups into the hash
>> table and outputting the join results in parallel.  You have to be
>>careful
>> about the data structures used to avoid threads stomping each others
>> results but this isn't too difficult.
>>
>> For filters we do something much closer to vectorization where
>>essentially
>> we parallelise the evaluation of the expression over all the possible
>> solutions (again we're a blocking engine) and recombine the results
>> afterwards.
>>
>> In terms of control in the .Net world we benefit from PLinq which are
>>CLR
>> supplied extensions to the basic Linq constructs (aka Streams in Java 8)
>> that automatically parallelize according to the available resources on
>>the
>> machine.  I'm not sure how smart this is but at least in the .Net world
>> you can constrain it if you want.
>>
>> For Jena which is a predominantly streaming engine I don't see either of
>> these approaches providing great benefits.  They could be used in the
>> cases where ARQ does have to do block evaluation to improve things
>>subject
>> to the points about how best to control the level of parallelism.  The
>> other area where there may be some benefit is to consider the case of
>> queries with UNION constructs, particularly for queries where we are
>>doing
>> lazy evaluation (there's a LIMIT and/or OFFSET, ASK etc.) where it may
>>be
>> possible to dispatch multiple branches of the UNION in parallel.
>>
>> Rob
>>
>> On 29/01/2016 11:49, "Andy Seaborne" <[email protected]> wrote:
>>
>>> Rob,
>>>
>>> In dotNetRDF, there is parallel execution, isn't there?
>>>
>>> I have been thinking (toying with) the idea of parallel execution and I
>>> wondered what unit of work is for the parallelism in dotNetRDF.
>>>
>>> What little thinking I've done suggests that tapping into the
>>> parallelism in java streams is not the right way to do it (which is a
>>> shame as that's less work).  It needs more control and probably larger
>>> units of work. There is a danger that small/fast queries slow down due
>>> to too much thinking.
>>>
>>> It needs more control as well to limit how much of the machine it will
>>> take over because, in Fuseki, it might lead to starvation of other
>>> requests.  As some usage is " many clients, many small requests",
>>> parallelism can impact the the system negatively as well as positively.
>>> At some point, the limitation will be the connection of CPUs to RAM
>>> rather than cycles.
>>>
>>>      Andy
>>>
>>> Historical note: RDQL had true parallelism once-upon-a-time.  An RDQL
>>> query is a BGP+Filter and not more.  The filter ran on a separate
>>>thread
>>> to the BGP solver.  Timed gain ... just 10%.  This was on a early
>>> generation 2 CPU, 2 processor machine so the cost of threading was
>>>huge.
>>>   Most users then did not have a multi-anything machine. It lead to
>>>lots
>>> of problems with thread management when Java wasn't what it is today.
>>
>>
>>
>>
>

Re: Parallelism in query execution

Reply via email to