[arangodb-google] Re: AQL: Logical or technical Update Issue

Simran Brucherseifer Tue, 24 Oct 2017 10:02:14 -0700

COLLECT clears all variables in the current scope. Click on Explain the web 
interface and inspect the execution plan:


<https://lh3.googleusercontent.com/-0P4-5oHCxII/We9inq0LnlI/AAAAAAAAAC4/hwHz7nP744kwImjRWLugN98A9z6U4jhyQCL4CGAYYCw/s1600/movie_ratings_query_plan.png>


Follow the CollectNode with Id 7 all the way up to the next SingletonNode - 
it's Id 3 and the ROOT of the current scope.
The variables emitted by the graph traversal (m, e) and the variable for 
the genre iteration (g) are in the scope, which means you don't have access 
to them after COLLECT.
The parent scope (which is the top-level scope) contains the iteration over 
all user documents. Variable u can be still be accessed after COLLECT, 
because it is defined outside its scope.

If you want to know into which "buckets" values are grouped, use the COLLECT 
... INTO syntax.
https://docs.arangodb.com/3.2/AQL/Operations/Collect.html

The COLLECT ... WITH COUNT INTO ... syntax is a shorthand if you want to 
group and count the number of occurrences (how many items per bucket if you 
will).
This syntax can not be extended by an INTO clause however. We still need 
the counts nonetheless, so we need to rework the query a bit.

We could use the standard INTO syntax, but it would keep way too much data 
which we don't need further down the query. All we actually need is the 
rating stored as edge attribute.
Thus, we can create a projection like so: COLLECT ... INTO r = e.rating
For every bucket (genre), we will have access to an array with the rating 
values via variable r.

We had to remove the counting, and need to add it back in a different way 
now. There are two options. Post-calculation:
COLLECT genre = g INTO r = e.rating
RETURN LENGTH(r) // array length of ratings equals number of items in 
bucket (what if there's no rating attribute though?)

Aggregation (can be more efficient, although it shouldn't make any 
difference in your case):
COLLECT genre = g AGGREGATE count = LENGTH(1) INTO r = e.rating

For every item in a bucket, a counter is increased by one (the LENGTH 
function always returns 1 in conjunction with AGGREGATE, no matter what you 
pass to it).
AGGREGATE could also be used to find out the minimum and maximum values as 
well as a few other statistical metrics, but it's not needed in this 
context.


The full query:

FOR u IN users
    LET genreStats = MERGE(
        FOR m, e IN OUTBOUND u GRAPH 'ratedGraph' // get all movies a user 
is linked to
            OPTIONS {uniqueVertices: 'global', bfs: true} // ignore 
duplicate movies
            FOR g IN m.genre
                COLLECT genre = g AGGREGATE count = LENGTH(1) INTO r = e.rating 
// group by genre
                RETURN {[genre]: count * AVERAGE(r)} // return one object 
per genre (merged into single object by MERGE function in 2nd line)


    )
    FILTER LENGTH(genreStats) // don't update user documents which are not 
linked to any movie
    UPDATE u WITH {genreStats} IN users
    RETURN NEW


count is multiplied by the AVERAGE (mean) of the ratings per genre. There 
are also functions like MEDIAN which could be used instead:
https://docs.arangodb.com/3.2/AQL/Functions/Numeric.html

-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[arangodb-google] Re: AQL: Logical or technical Update Issue

Reply via email to