Re: Nested Aggregates?

Rob Vesse Mon, 04 Jun 2012 11:37:33 -0700

Again comments inline:

On 6/4/12 11:10 AM, "Andy Seaborne" <[email protected]> wrote:



>On 04/06/12 18:51, Rob Vesse wrote:
>> Comments inline:
>>
>>
>> On 6/3/12 6:05 AM, "Andy Seaborne"<[email protected]>  wrote:
>>
>>>
>>>> So two  questions:
>>>>
>>>>    1.  Are nested aggregates permitted?  The grammar says yes so I'm
>>>> assuming yes
>>>>    2.  Is there a bug in ARQ's implementation of this?
>>>
>>>
>>> 1 - yes legal syntax ... that can be changed :-)  I'm going to pass
>>>this
>>> on the WG.
>>
>> I'm not sure if they necessarily need forbidding, I think they may have
>> some uses
>
>Do you have a real example? While I don't want to forbid anything just
>because, there is a specific implementation impact (count(*) -- see
>below) so I'm looking for a real example, not just that some app might
>want to do it.

No and to be honest my colleagues and I would probably be quite happy if
they were forbidden because it makes things easier.  Especially since as
you and I both noted a subquery works around this problem and expresses
the meaning of the query more explicitly.

>
>As you pointed out - the subquery may have been what was meant in the
>first place - a sort of groups-in-groups.
>
>The complicating example is sum(?x)/count(?x) which is the moral
>equivalent of avg(?x) give or take count == 0
>
>Now replace ?x with something silly ... like avg(?x) !
>
>As you can not technically write anything but tree expressions we don't
>have the maniac cases of shared sub-expressions (SPARQL only needs to be
>defined over strings - not programmtical constructed shapes).
>
>I ask about real examples because the cost at scale of this has the
>potentially to be quite bad (having to keep rows around longer).  Of
>course, we can spill to disk, but it's still disk.
>
>(BTW I still think SPARQL should add standard deviation aggregates).

Personally I have advocated in the past (though not necessarily in a
formal comment) that engines should be able to introduce new aggregates
using the same extension function mechanism new functions can be
introduced now.  Parsers just have to be well informed enough to know
what's an extension function vs what's an extension aggregate.

dotNetRDF already has lvn:nmax and lvn:nmin (MAX and MIN on numerics only,
non-numerics are ignored) and lvn:median included

>
>>> 2 - What ARQ does is to calculate the aggregates of a group as the
>>>group
>>> is seem, not at the end of the group block when all the elements are
>>> known.
>>>
>>> A line of argument is that the expression inside the aggregate is
>>> applied to each row, so only row variables are in-scope.  The aggregate
>>> AVG(max(?x)+1) is violating that.  So for streaming and for scoping,
>>>I'd
>>> argue it's wrong - is there a use case that argues for it?
>>
>> Yes, the reason this works easily in dotNetRDF (after I fixed the
>>scoping
>> bug) was that dotNetRDF always calculates the full result at every stage
>> (with some special exceptions for some ASK and LIMITed queries) so when
>> applying the aggregates all the groups have been calculated and then
>> aggregates are applied afterwards.
>
>You do that for count(*) and count(?x)?
>
>TDB counts these with zero space overhead.

Yes, count(*) is still very fast because the data structures provide a
Count property so no additional loop over the data is required.  Count(?x)
less so but still relatively fast.

The non-streaming thing is a historical decision, I have designs for a
fully streaming engine but it requires more effort and time than I have
free right now.

Also non-streaming does have some advantage in .Net in leveraging
automated parallelization of some operations (Joins, Products, Filters,
Extends etc) plus the join algorithm we use trades memory usage of having
everything in memory all at once for performance so works well in our
engine.  These things can all be done when moving to a streaming engine
but there is other rework involved and the risk of losing performance by
not having our current join algorithm available (though with some work
that may be achievable)

>
>> I think because of the potential scoping confusion and the fact that
>> allowing nesting may make implementation much harder for streaming
>> implementations at ARQ it is certainly worth the working group
>>discussing
>> this.
>>  Do you want me to make a formal comment to the working group?
>
>I have :
>
>http://lists.w3.org/Archives/Public/public-rdf-dawg/2012AprJun/0194.html
>
>but don't let that stop you doing so.  I did it to make sure it didn't
>get lost.

Thanks for doing that

>
>>>
>>> The spec needs clarifying if it is to be bad syntax.  The simple
>>> solution is no nested aggregate expressions.
>>>
>>> Here is a related example:
>>>
>>> SELECT (max(?x) As ?M) (avg(?M+1) AS ?A)
>>>
>>> because the select expression rules say you can use ?M inside AVG().
>>
>> I tried that but it didn't work in ARQ either so not sure if that is a
>>bug
>
>It won't for much the same reason -- I presented it as a related but
>different example.
>
>>
>>>
>>> Now sure what SQL says about this - the SPARQL processing model is the
>>> same framework as SQL.
>>>
>>> The failure in ARQ is because AVG is calculating the sum on each row of
>>> a group as each row comes it (and the row can be thorwn away afterward
>>>-
>>> just the key and aggregators need be kept) so it's before MAX() is
>>>ready
>>> overall and even possibly before it has been called even with the
>>> current row (undefined ordering).
>>
>> As I noted above even separating out the aggregates and using the
>>variable
>> for the max inside the average seemed not to work so may be another ARQ
>> issue?
>
>It's the same root cause: ?M is out-of-scope of a row, so avg(?M+1) is
>avg(undef+1) -> error -> binding for ?A.

Of course, should have realized that

Cheers,

Rob



>
>       Andy
>
>>
>> Rob
>>
>>>
>>>     Andy
>>>
>>> On 01/06/12 18:15, Rob Vesse wrote:
>>>> Just a thought - for those who may be interested here is a version of
>>>> the
>>>> query that does work as expected with the current ARQ snapshot
>>>>
>>>> It removes the GROUP BY so the query actually answers the question of
>>>> interest and moves the MAX() into a subquery to force evaluation order
>>>> and
>>>> scope:
>>>>
>>>> PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>>>> SELECT (AVG(?mealPrice * (1.0 + ?MaxTipPercent)) AS
>>>>?avgCostWithBestTip)
>>>> WHERE
>>>> {
>>>>     ?description rdf:mealPrice ?mealPrice .
>>>>     {
>>>>       SELECT (MAX(?mealTip / ?mealPrice) AS ?MaxTipPercent)
>>>>       WHERE
>>>>       {
>>>>         ?description rdf:mealPrice ?mealPrice .
>>>>         ?description rdf:mealTip ?mealTip .
>>>>       }
>>>>     }
>>>> }
>>>>
>>>> Rob
>>>>
>>>>
>>>>
>>>>
>>>> On 6/1/12 10:11 AM, "Rob Vesse"<[email protected]>   wrote:
>>>>
>>>>> Hey All
>>>>>
>>>>> I have an interesting question about nested aggregates which was
>>>>>posed
>>>>> by
>>>>> some colleagues that I'm trying to figure out.
>>>>>
>>>>> The sample data is as follows:
>>>>>
>>>>> @prefix ex:<http://example.org/meals#>   .
>>>>>
>>>>> [] ex:mealPrice 25 ; ex:mealTip 7 .
>>>>> [] ex:mealPrice 50 ; ex:mealTip 10 .
>>>>> [] ex:mealPrice 100 ; ex:mealTip 25 .
>>>>>
>>>>> The question they were trying to answer is what would my meals cost
>>>>>on
>>>>> average if one always tipped at their best percentage.  The original
>>>>> query they came up with was as follows:
>>>>>
>>>>> PREFIX ex:<http://example.org/meals#>
>>>>> SELECT (AVG(?mealPrice * (1.0 + MAX( ?mealTip / ?mealPrice))) AS
>>>>> ?avgCostWithBestTip)
>>>>> WHERE {
>>>>>    ?description ex:mealPrice ?mealPrice .
>>>>>    ?description ex:mealTip ?mealTip .
>>>>> } GROUP BY ?description
>>>>>
>>>>> Now this looks reasonable enough but is in fact incorrect because
>>>>>they
>>>>> added a spurious GROUP BY so it actually calculating the total price
>>>>>of
>>>>> each individual meal if the query worked.  (It works in dotNetRDF but
>>>>> gives an incorrect answer due to a previously undiscovered scoping
>>>>>bug
>>>>> with nested aggregates)
>>>>>
>>>>> With ARQ at least this query doesn't work, the SPARQL algebra
>>>>>generated
>>>>> looks semi-reasonable. The problem is that while it moves the inner
>>>>> MAX()
>>>>> aggregate out to be evaluated before the outer AVG() it fails to then
>>>>> substitute the ?.0 into the AVG leaving the original MAX in place and
>>>>> this seems to lead to an evaluation failure in the AVG and so we get
>>>>> unbound values for each result.  (dotNetRDF gives bound values just
>>>>>the
>>>>> values are incorrect due to a scoping issue)
>>>>>
>>>>> (base<http://example/base/>
>>>>>    (prefix ((ex:<http://example.org/meals#>))
>>>>>      (project (?avgCostWithBestTip)
>>>>>        (extend ((?avgCostWithBestTip ?.1))
>>>>>          (group (?description) ((?.0 (max (/ ?mealTip ?mealPrice)))
>>>>>(?.1
>>>>> (avg (* ?mealPrice (+ 1.0 (max (/ ?mealTip ?mealPrice)))))))
>>>>>            (quadpattern
>>>>>              (quad<urn:x-arq:DefaultGraphNode>   ?description
>>>>> rdf:mealPrice
>>>>> ?mealPrice)
>>>>>              (quad<urn:x-arq:DefaultGraphNode>   ?description
>>>>>rdf:mealTip
>>>>> ?mealTip)
>>>>>            ))))))
>>>>>
>>>>> Regardless of the correctness of the query wrt to the original
>>>>>question
>>>>> (which is easily fixable by just stripping off the GROUP BY clause)
>>>>>it
>>>>> still appears that ARQ is not generating entirely correct algebra
>>>>>here.
>>>>> It looks like it is trying to do the right thing but only partially
>>>>> succeeds.
>>>>>
>>>>> So two  questions:
>>>>>
>>>>>    1.  Are nested aggregates permitted?  The grammar says yes so I'm
>>>>> assuming yes
>>>>>    2.  Is there a bug in ARQ's implementation of this?
>>>>>
>>>>> I'll poke around in the source code myself and maybe if it is a bug
>>>>> it's
>>>>> an easy fix but I imagine Andy can answer this much faster than I
>>>>>can.
>>>>>  From what I've found so far it looks like ARQ does aim to intern and
>>>>> reuse aggregates but it doesn't seem to be working properly in this
>>>>> case
>>>>> so maybe some subtle bug that I can't see due to lack of knowledge of
>>>>> the
>>>>> code :-S
>>>>>
>>>>> Rob
>>>>
>>>
>>
>

Re: Nested Aggregates?

Reply via email to