[ 
https://issues.apache.org/jira/browse/UNOMI-204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16647950#comment-16647950
 ] 

Thomas Draier commented on UNOMI-204:
-------------------------------------

There are multiple optimizations, depending on the cases. 

Segment creation : 

1/ On segment/scoring save, we already had rules created for the existing past 
event conditions. These rules are setting a property for profile who matched 
the event condition, with the number of occurences. The initial properties 
values are initialized for every profile in 
updateExistingProfilesForPastEventCondition . A simple term aggregate was used, 
but terms aggregated are limited to 5000 - so only 5000 profiles were updated 
with the property. I split the term aggregate with partitions so that all 
profiles are correctly updated, avoiding memory issues. 
In a second phase, updateExistingProfilesForSegment/ 
updateExistingProfilesForScoring is called to add/remove segments to all 
profiles. This is done by doing a query on profiles, with query scrolling, 
using the PastEventConditionESQueryBuilder. 

PastEventConditionESQueryBuilder / query : 

This one was doing a first term aggregate to list all profiles id (still sized 
to 5000), then generate an id query, then returning the id query. The ids query 
was maximum 5000 ids, so anyway, it was not possible to have more than 5000 
users returned by this query (and was kind of limiting the segments) 
- I did change the PastEventConditionESQueryBuilder to check if a generated 
property exists, and use it (which is the case after segment/scoring creation). 
If number of occurences is specified in the condition, use a range query, 
otherwise a simple exists query. This is much lighter than the id query, do not 
need intermediate aggregate, and find all profiles (not only 5000). 
- If the property is not set, then I still generate an id query - but uses 
again a split on the term aggregate to avoid the 5000 limit. This may not even 
be very useful, as I added a limit to the size of the id query that can be 
generated to avoid creating huge query and memory issues. I rather send an 
exception here than generate a big (tbd?) query. Note that this case is 
actually rarely used - in our cases only for getting aggregated count, when the 
count method cannot be used. 

queryCount : 

That's where count methods are introduced - they are called by queryCount if 
they are available on the condition builder. Only 
PastEventConditionESQueryBuilder implements it for now - so a count on boolean 
condition (with inner past events) will still continue to use the query builder 
(and probably the second case, where property is not set). 

PastEventConditionESQueryBuilder / count : 

For count on past event conditions (so, as root condition) , 2 cases : 
- no min/max event occurences : was doing the query and counting the result, 
which was : a first term aggregate to list all profiles id (5000), then 
generate an id query, then execute the id query to get the count of profiles 
(!). Now, just do a cardinality aggregate, return the total count. 
- with min/max event occurences : do a term aggregate with partition, and count 
the profiles that match the number of occurences. Also take into account that 
buckets are sorted by doc count. Return the count. 

> Optimize pastEvents conditions execution and count
> --------------------------------------------------
>
>                 Key: UNOMI-204
>                 URL: https://issues.apache.org/jira/browse/UNOMI-204
>             Project: Apache Unomi
>          Issue Type: Improvement
>            Reporter: Thomas Draier
>            Priority: Major
>
> Past event condition query execution is based on an aggregate on events to 
> get all profile ids, then generate an id query on profiles with each id. This 
> leads to different issues :
> - the terms aggregate is limited to 5000 buckets by default ( configurable 
> thanks to UNOMI-119 ), so the condition will anyway not return more than 5000 
> users (which is an issue for updateExistingProfilesForSegment ). The limit is 
> necessary to avoid out of memory, but we still need the list of profiles - 
> using aggregate filter/partition should help getting all items.
> - The id query can be huge (millions of ids ?) - even if, in the end, we have 
> a limit on the size of results we want. This is unfortunately difficult to 
> optimize, as 1/we don't know if a limit will be used or not and 2/ the 
> condition can be part of a and boolean condition, which would require an 
> unknown minimal number of ids
> - the "count" method is not optimal as it executes the full query and gets 
> the number of results, where it can in some cases be optimized. For 
> pastEventCondition, we generate an IdQuery with a list of ids to just get the 
> count of profiles - counting the ids should be enough, and in some cases we 
> could even use cardinality aggregate to directly get the count. In all cases, 
> keeping the list of all ids in memory should not be needed for counting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to