Re: Nested Aggregates?

Andy Seaborne Mon, 04 Jun 2012 11:10:38 -0700

On 04/06/12 18:51, Rob Vesse wrote:

Comments inline:



On 6/3/12 6:05 AM, "Andy Seaborne"<[email protected]>  wrote:

So two  questions:

   1.  Are nested aggregates permitted?  The grammar says yes so I'm
assuming yes
   2.  Is there a bug in ARQ's implementation of this?



1 - yes legal syntax ... that can be changed :-)  I'm going to pass this
on the WG.


I'm not sure if they necessarily need forbidding, I think they may have
some uses

Do you have a real example? While I don't want to forbid anything justbecause, there is a specific implementation impact (count(*) -- seebelow) so I'm looking for a real example, not just that some app mightwant to do it.

As you pointed out - the subquery may have been what was meant in thefirst place - a sort of groups-in-groups.

The complicating example is sum(?x)/count(?x) which is the moralequivalent of avg(?x) give or take count == 0


Now replace ?x with something silly ... like avg(?x) !

As you can not technically write anything but tree expressions we don'thave the maniac cases of shared sub-expressions (SPARQL only needs to bedefined over strings - not programmtical constructed shapes).

I ask about real examples because the cost at scale of this has thepotentially to be quite bad (having to keep rows around longer). Ofcourse, we can spill to disk, but it's still disk.


(BTW I still think SPARQL should add standard deviation aggregates).

2 - What ARQ does is to calculate the aggregates of a group as the group
is seem, not at the end of the group block when all the elements are
known.

A line of argument is that the expression inside the aggregate is
applied to each row, so only row variables are in-scope.  The aggregate
AVG(max(?x)+1) is violating that.  So for streaming and for scoping, I'd
argue it's wrong - is there a use case that argues for it?


Yes, the reason this works easily in dotNetRDF (after I fixed the scoping
bug) was that dotNetRDF always calculates the full result at every stage
(with some special exceptions for some ASK and LIMITed queries) so when
applying the aggregates all the groups have been calculated and then
aggregates are applied afterwards.


You do that for count(*) and count(?x)?

TDB counts these with zero space overhead.

I think because of the potential scoping confusion and the fact that
allowing nesting may make implementation much harder for streaming
implementations at ARQ it is certainly worth the working group discussing
this.
 Do you want me to make a formal comment to the working group?


I have :

http://lists.w3.org/Archives/Public/public-rdf-dawg/2012AprJun/0194.html

but don't let that stop you doing so. I did it to make sure it didn'tget lost.


The spec needs clarifying if it is to be bad syntax.  The simple
solution is no nested aggregate expressions.

Here is a related example:

SELECT (max(?x) As ?M) (avg(?M+1) AS ?A)

because the select expression rules say you can use ?M inside AVG().


I tried that but it didn't work in ARQ either so not sure if that is a bug

It won't for much the same reason -- I presented it as a related butdifferent example.


Now sure what SQL says about this - the SPARQL processing model is the
same framework as SQL.

The failure in ARQ is because AVG is calculating the sum on each row of
a group as each row comes it (and the row can be thorwn away afterward -
just the key and aggregators need be kept) so it's before MAX() is ready
overall and even possibly before it has been called even with the
current row (undefined ordering).


As I noted above even separating out the aggregates and using the variable
for the max inside the average seemed not to work so may be another ARQ
issue?

It's the same root cause: ?M is out-of-scope of a row, so avg(?M+1) isavg(undef+1) -> error -> binding for ?A.


        Andy

Rob


        Andy

On 01/06/12 18:15, Rob Vesse wrote:

Just a thought - for those who may be interested here is a version of
the
query that does work as expected with the current ARQ snapshot

It removes the GROUP BY so the query actually answers the question of
interest and moves the MAX() into a subquery to force evaluation order
and
scope:

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT (AVG(?mealPrice * (1.0 + ?MaxTipPercent)) AS ?avgCostWithBestTip)
WHERE
{
    ?description rdf:mealPrice ?mealPrice .
    {
      SELECT (MAX(?mealTip / ?mealPrice) AS ?MaxTipPercent)
      WHERE
      {
        ?description rdf:mealPrice ?mealPrice .
        ?description rdf:mealTip ?mealTip .
      }
    }
}

Rob




On 6/1/12 10:11 AM, "Rob Vesse"<[email protected]>   wrote:

Hey All

I have an interesting question about nested aggregates which was posed
by
some colleagues that I'm trying to figure out.

The sample data is as follows:

@prefix ex:<http://example.org/meals#>   .

[] ex:mealPrice 25 ; ex:mealTip 7 .
[] ex:mealPrice 50 ; ex:mealTip 10 .
[] ex:mealPrice 100 ; ex:mealTip 25 .

The question they were trying to answer is what would my meals cost on
average if one always tipped at their best percentage.  The original
query they came up with was as follows:

PREFIX ex:<http://example.org/meals#>
SELECT (AVG(?mealPrice * (1.0 + MAX( ?mealTip / ?mealPrice))) AS
?avgCostWithBestTip)
WHERE {
   ?description ex:mealPrice ?mealPrice .
   ?description ex:mealTip ?mealTip .
} GROUP BY ?description

Now this looks reasonable enough but is in fact incorrect because they
added a spurious GROUP BY so it actually calculating the total price of
each individual meal if the query worked.  (It works in dotNetRDF but
gives an incorrect answer due to a previously undiscovered scoping bug
with nested aggregates)

With ARQ at least this query doesn't work, the SPARQL algebra generated
looks semi-reasonable. The problem is that while it moves the inner
MAX()
aggregate out to be evaluated before the outer AVG() it fails to then
substitute the ?.0 into the AVG leaving the original MAX in place and
this seems to lead to an evaluation failure in the AVG and so we get
unbound values for each result.  (dotNetRDF gives bound values just the
values are incorrect due to a scoping issue)

(base<http://example/base/>
   (prefix ((ex:<http://example.org/meals#>))
     (project (?avgCostWithBestTip)
       (extend ((?avgCostWithBestTip ?.1))
         (group (?description) ((?.0 (max (/ ?mealTip ?mealPrice))) (?.1
(avg (* ?mealPrice (+ 1.0 (max (/ ?mealTip ?mealPrice)))))))
           (quadpattern
             (quad<urn:x-arq:DefaultGraphNode>   ?description
rdf:mealPrice
?mealPrice)
             (quad<urn:x-arq:DefaultGraphNode>   ?description rdf:mealTip
?mealTip)
           ))))))

Regardless of the correctness of the query wrt to the original question
(which is easily fixable by just stripping off the GROUP BY clause) it
still appears that ARQ is not generating entirely correct algebra here.
It looks like it is trying to do the right thing but only partially
succeeds.

So two  questions:

   1.  Are nested aggregates permitted?  The grammar says yes so I'm
assuming yes
   2.  Is there a bug in ARQ's implementation of this?

I'll poke around in the source code myself and maybe if it is a bug
it's
an easy fix but I imagine Andy can answer this much faster than I can.
 From what I've found so far it looks like ARQ does aim to intern and
reuse aggregates but it doesn't seem to be working properly in this
case
so maybe some subtle bug that I can't see due to lack of knowledge of
the
code :-S

Rob

Re: Nested Aggregates?

Reply via email to