[ https://issues.apache.org/jira/browse/UNOMI-204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16647950#comment-16647950 ]
Thomas Draier commented on UNOMI-204: ------------------------------------- There are multiple optimizations, depending on the cases. Segment creation : 1/ On segment/scoring save, we already had rules created for the existing past event conditions. These rules are setting a property for profile who matched the event condition, with the number of occurences. The initial properties values are initialized for every profile in updateExistingProfilesForPastEventCondition . A simple term aggregate was used, but terms aggregated are limited to 5000 - so only 5000 profiles were updated with the property. I split the term aggregate with partitions so that all profiles are correctly updated, avoiding memory issues. In a second phase, updateExistingProfilesForSegment/ updateExistingProfilesForScoring is called to add/remove segments to all profiles. This is done by doing a query on profiles, with query scrolling, using the PastEventConditionESQueryBuilder. PastEventConditionESQueryBuilder / query : This one was doing a first term aggregate to list all profiles id (still sized to 5000), then generate an id query, then returning the id query. The ids query was maximum 5000 ids, so anyway, it was not possible to have more than 5000 users returned by this query (and was kind of limiting the segments) - I did change the PastEventConditionESQueryBuilder to check if a generated property exists, and use it (which is the case after segment/scoring creation). If number of occurences is specified in the condition, use a range query, otherwise a simple exists query. This is much lighter than the id query, do not need intermediate aggregate, and find all profiles (not only 5000). - If the property is not set, then I still generate an id query - but uses again a split on the term aggregate to avoid the 5000 limit. This may not even be very useful, as I added a limit to the size of the id query that can be generated to avoid creating huge query and memory issues. I rather send an exception here than generate a big (tbd?) query. Note that this case is actually rarely used - in our cases only for getting aggregated count, when the count method cannot be used. queryCount : That's where count methods are introduced - they are called by queryCount if they are available on the condition builder. Only PastEventConditionESQueryBuilder implements it for now - so a count on boolean condition (with inner past events) will still continue to use the query builder (and probably the second case, where property is not set). PastEventConditionESQueryBuilder / count : For count on past event conditions (so, as root condition) , 2 cases : - no min/max event occurences : was doing the query and counting the result, which was : a first term aggregate to list all profiles id (5000), then generate an id query, then execute the id query to get the count of profiles (!). Now, just do a cardinality aggregate, return the total count. - with min/max event occurences : do a term aggregate with partition, and count the profiles that match the number of occurences. Also take into account that buckets are sorted by doc count. Return the count. > Optimize pastEvents conditions execution and count > -------------------------------------------------- > > Key: UNOMI-204 > URL: https://issues.apache.org/jira/browse/UNOMI-204 > Project: Apache Unomi > Issue Type: Improvement > Reporter: Thomas Draier > Priority: Major > > Past event condition query execution is based on an aggregate on events to > get all profile ids, then generate an id query on profiles with each id. This > leads to different issues : > - the terms aggregate is limited to 5000 buckets by default ( configurable > thanks to UNOMI-119 ), so the condition will anyway not return more than 5000 > users (which is an issue for updateExistingProfilesForSegment ). The limit is > necessary to avoid out of memory, but we still need the list of profiles - > using aggregate filter/partition should help getting all items. > - The id query can be huge (millions of ids ?) - even if, in the end, we have > a limit on the size of results we want. This is unfortunately difficult to > optimize, as 1/we don't know if a limit will be used or not and 2/ the > condition can be part of a and boolean condition, which would require an > unknown minimal number of ids > - the "count" method is not optimal as it executes the full query and gets > the number of results, where it can in some cases be optimized. For > pastEventCondition, we generate an IdQuery with a list of ids to just get the > count of profiles - counting the ids should be enough, and in some cases we > could even use cardinality aggregate to directly get the count. In all cases, > keeping the list of all ids in memory should not be needed for counting. -- This message was sent by Atlassian JIRA (v7.6.3#76005)