Hi Folks, I thought those on the dev mailing list might appreciate some details of how we successfully applied the frequent items sketch to a complex problem in our application. Although we are still in the early phases of the release, we are very pleased with our initial results.
*Background* For the purposes of establishing context, our customers deploy our SDK which typically runs on customers' mobile devices. This SDK collects usage data in a privacy compliant manner without tracking any user identity. This data is then forwarded to our backend where it is processed, stored and made available as reports within our customer facing dashboards. One such report is a list of most popular items of customer content (e.g. page title), along with the number of discrete user views pertaining to that content. The user is able to apply ad-hoc criteria (e.g. region), in conjunction with a date range that may extend back up to two years. Our initial solution was to use a cloud database to query the data directly in one composite SQL query. Whilst this provided a quick solution that was familiar to most team members, there were several drawbacks: - It was necessary to retain the data in its raw form for lengthy periods of time. Pre-aggregation was not an option - sorting through all matching items and counting the frequency of each group is not easily mergeable across combinations of date dimension / filter criteria. The fundamental difficulty with merging is in accurately preserving a distinct count of the unique users alongside the frequent item. This made our solution cost-prohibitive. - The data for the report was sourced from different relations. This resulted in many complex dynamic predicates that became difficult to maintain. - Performance - querying raw data using exact methods required us to access the complete dataset in its entirety which made queries expensive to compute and execute. To address these problems, we decided to investigate an alternative - the Frequent Items sketch from Java library. *Investigatory Analysis * We typically integrate a new sketch by performing an investigatory analysis, which involves building and reading from the sketch under a production-like workload. An entire days' worth of production data from one of our biggest customers was written into the Frequent Items sketch. The serialised size of the sketch was roughly 3,5Mb, which is considerably less than our raw column compressed equivalent in the database. Another aspect to the investigation was to assess the suitability of the sketch to streaming updates. We consume our input stream across multiple independent consumers, and we need sketches to be both mergeable and order independent. Whilst this is difficult to test in an investigation, we typically compare the accuracy of the merged output to satisfy these criteria. To do this, we took the same dataset and shuffled it, splitting it randomly over several sketches which are ultimately merged. The output from the merged Frequent Items sketch was compared to the exact method, and we could not measure a significant loss in accuracy for the merged result. A crucial parameter we had to adjust during our tests was the sketch size, which allowed us to trade off different sizes vs accuracy of the final result. *Streaming Implementation* Once we were satisfied that the sketch met our initial criteria, we divided up the implementation in order to facilitate a phased rollout. This gave us more control over the release and allowed for us to accurately monitor the operational aspects of the release. Furthermore, we found it useful to firstly integrate the Frequent Items sketch into our streaming infrastructure, and implement the changes to our queries once we were successfully storing the data. Since we are dealing with a large quantity of data, we first window the Frequent Items sketch for a short period in memory before serialising it to disk. In order to do this safely, we rely on language level concurrency primitives to ensure that we do not access the same sketch concurrently, for thread safety reasons. Whilst this is efficient at write-time, we need to further combine these smaller sketches into larger composites to ensure that our queries are not adversely affected. This compaction process occurs at regular intervals and gradually promotes sketches into lower time resolutions, eg. days and weeks. Duplicates presented a problem - if we compacted the same sketch into the larger composite, these would skew the results. The solution was straightforward - we did not incrementally compact our sketches and needed to combine pristine copies at each resolution for every scheduled compaction update. As a side note, this design allows for late event time data arrival. Lastly, there was some trial and error in dynamically adjusting the size of the sketch with the time resolution to ensure that we did not eliminate items from the sketch for larger intervals. *Query implementation* Our proposed solution has had to support the identical set of report filter criteria. This presented us with two alternatives: - construct a frequent items sketch for every filter permutation - integrate the frequent items sketch with our existing body of Theta sketches The former approach presented concerns for high cardinality dimensions, and we dismissed this as a viable solution. Since we already have a large body of Theta sketches for our data, we decided to build upon and leverage these sketches to extract the distinct cardinality estimates that are identified for each frequent item. This amounted to storing a Theta sketch for each frequent item key, as this would allow us to perform additional set operations with our existing Theta sketches to perform the correct filtering. However, this did incur a penalty, as we needed N secondary queries for the N-most frequent items. In other words, we first retrieve the top N most frequent item keys, and for each of these keys apply a set of Theta set operations to extract the final estimate. The resulting report is then assembled after all the subqueries complete. Fortunately, we are able to batch these to lessen roundtrip times. *Conclusion* To conclude, a nested query is not as performant as a single composite query, but this is an acceptable trade off since the solution is more maintainable, and orders of magnitude more cost efficient! Unfortunately I'm not able to show graphs or quantify the improvements. But, we are pleased with the results - the difference in both frequent items and cardinality estimates are difficult to distinguish from our existing report that uses exact methods! Moreover, from a performance perspective, we have the ability to control the size of the sketch through our configuration options. This would not have been possible without the extremely helpful documentation and community, who have always been helpful with questions we had along the way. I would like to thank the team for the library and for giving me the opportunity to share this feedback. Thank you, David
