David, This is excellent feedback! Thank you for this very informative write-up!
Cheers, Lee. On Wed, Nov 4, 2020 at 6:14 AM David Cromberge <[email protected]> wrote: > Hi Folks, > > I thought those on the dev mailing list might appreciate some details of > how we successfully applied the frequent items sketch to a complex problem > in our application. Although we are still in the early phases of the > release, we are very pleased with our initial results. > > *Background* > For the purposes of establishing context, our customers deploy our SDK > which typically runs on customers' mobile devices. This SDK collects usage > data in a privacy compliant manner without tracking any user identity. > This data is then forwarded to our backend where it is processed, stored > and made available as reports within our customer facing dashboards. One > such report is a list of most popular items of customer content (e.g. page > title), along with the number of discrete user views pertaining to that > content. The user is able to apply ad-hoc criteria (e.g. region), in > conjunction with a date range that may extend back up to two years. > > Our initial solution was to use a cloud database to query the data > directly in one composite SQL query. Whilst this provided a quick solution > that was familiar to most team members, there were several drawbacks: > > - It was necessary to retain the data in its raw form for lengthy periods > of time. Pre-aggregation was not an option - sorting through all matching > items and counting the frequency of each group is not easily mergeable > across combinations of date dimension / filter criteria. The fundamental > difficulty with merging is in accurately preserving a distinct count of the > unique users alongside the frequent item. This made our solution > cost-prohibitive. > - The data for the report was sourced from different relations. This > resulted in many complex dynamic predicates that became difficult to > maintain. > - Performance - querying raw data using exact methods required us to > access the complete dataset in its entirety which made queries expensive to > compute and execute. > > To address these problems, we decided to investigate an alternative - the > Frequent Items sketch from Java library. > > *Investigatory Analysis * > We typically integrate a new sketch by performing an investigatory > analysis, which involves building and reading from the sketch under a > production-like workload. An entire days' worth of production data from > one of our biggest customers was written into the Frequent Items sketch. > The serialised size of the sketch was roughly 3,5Mb, which is considerably > less than our raw column compressed equivalent in the database. > > Another aspect to the investigation was to assess the suitability of the > sketch to streaming updates. We consume our input stream across multiple > independent consumers, and we need sketches to be both mergeable and order > independent. Whilst this is difficult to test in an investigation, we > typically compare the accuracy of the merged output to satisfy these > criteria. To do this, we took the same dataset and shuffled it, splitting > it randomly over several sketches which are ultimately merged. The output > from the merged Frequent Items sketch was compared to the exact method, and > we could not measure a significant loss in accuracy for the merged result. > A crucial parameter we had to adjust during our tests was the sketch size, > which allowed us to trade off different sizes vs accuracy of the final > result. > > *Streaming Implementation* > Once we were satisfied that the sketch met our initial criteria, we > divided up the implementation in order to facilitate a phased rollout. > This gave us more control over the release and allowed for us to accurately > monitor the operational aspects of the release. Furthermore, we found it > useful to firstly integrate the Frequent Items sketch into our streaming > infrastructure, and implement the changes to our queries once we were > successfully storing the data. > > Since we are dealing with a large quantity of data, we first window the > Frequent Items sketch for a short period in memory before serialising it to > disk. In order to do this safely, we rely on language level concurrency > primitives to ensure that we do not access the same sketch concurrently, > for thread safety reasons. Whilst this is efficient at write-time, we need > to further combine these smaller sketches into larger composites to ensure > that our queries are not adversely affected. This compaction process > occurs at regular intervals and gradually promotes sketches into lower time > resolutions, eg. days and weeks. Duplicates presented a problem - if we > compacted the same sketch into the larger composite, these would skew the > results. The solution was straightforward - we did not incrementally > compact our sketches and needed to combine pristine copies at each > resolution for every scheduled compaction update. As a side note, this > design allows for late event time data arrival. > > Lastly, there was some trial and error in dynamically adjusting the size > of the sketch with the time resolution to ensure that we did not eliminate > items from the sketch for larger intervals. > > *Query implementation* > Our proposed solution has had to support the identical set of report > filter criteria. This presented us with two alternatives: > > - construct a frequent items sketch for every filter permutation > - integrate the frequent items sketch with our existing body of Theta > sketches > > The former approach presented concerns for high cardinality dimensions, > and we dismissed this as a viable solution. Since we already have a large > body of Theta sketches for our data, we decided to build upon and leverage > these sketches to extract the distinct cardinality estimates that are > identified for each frequent item. This amounted to storing a Theta sketch > for each frequent item key, as this would allow us to perform additional > set operations with our existing Theta sketches to perform the correct > filtering. However, this did incur a penalty, as we needed N secondary > queries for the N-most frequent items. In other words, we first retrieve > the top N most frequent item keys, and for each of these keys apply a set > of Theta set operations to extract the final estimate. The resulting > report is then assembled after all the subqueries complete. Fortunately, > we are able to batch these to lessen roundtrip times. > > *Conclusion* > To conclude, a nested query is not as performant as a single composite > query, but this is an acceptable trade off since the solution is more > maintainable, and orders of magnitude more cost efficient! > > Unfortunately I'm not able to show graphs or quantify the improvements. > But, we are pleased with the results - the difference in both frequent > items and cardinality estimates are difficult to distinguish from our > existing report that uses exact methods! Moreover, from a performance > perspective, we have the ability to control the size of the sketch through > our configuration options. > > This would not have been possible without the extremely helpful > documentation and community, who have always been helpful with questions we > had along the way. I would like to thank the team for the library and for > giving me the opportunity to share this feedback. > > Thank you, > David > > >
