David,
This is excellent feedback! Thank you for this very informative write-up!

Cheers,
Lee.

On Wed, Nov 4, 2020 at 6:14 AM David Cromberge <[email protected]>
wrote:

> Hi Folks,
>
> I thought those on the dev mailing list might appreciate some details of
> how we successfully applied the frequent items sketch to a complex problem
> in our application.  Although we are still in the early phases of the
> release, we are very pleased with our initial results.
>
> *Background*
> For the purposes of establishing context, our customers deploy our SDK
> which typically runs on customers' mobile devices.  This SDK collects usage
> data in a privacy compliant manner without tracking any user identity.
> This data is then forwarded to our backend where it is processed, stored
> and made available as reports within our customer facing dashboards. One
> such report is a list of most popular items of customer content (e.g. page
> title), along with the number of discrete user views pertaining to that
> content.  The user is able to apply ad-hoc criteria (e.g. region),  in
> conjunction with a date range that may extend back up to two years.
>
> Our initial solution was to use a cloud database to query the data
> directly in one composite SQL query.  Whilst this provided a quick solution
> that was familiar to most team members, there were several drawbacks:
>
> - It was necessary to retain the data in its raw form for lengthy periods
> of time.  Pre-aggregation was not an option - sorting through all matching
> items and counting the frequency of each group is not easily mergeable
> across combinations of date dimension / filter criteria.   The fundamental
> difficulty with merging is in accurately preserving a distinct count of the
> unique users alongside the frequent item. This made our solution
> cost-prohibitive.
> - The data for the report was sourced from different relations.  This
> resulted in many complex dynamic predicates that became difficult to
> maintain.
> - Performance - querying raw data using exact methods required us to
> access the complete dataset in its entirety which made queries expensive to
> compute and execute.
>
> To address these problems, we decided to investigate an alternative - the
> Frequent Items sketch from Java library.
>
> *Investigatory Analysis *
> We typically integrate a new sketch by performing an investigatory
> analysis, which involves building and reading from the sketch under a
> production-like workload.  An entire days' worth of production data from
> one of our biggest customers was written into the Frequent Items sketch.
> The serialised size of the sketch was roughly 3,5Mb, which is considerably
> less than our raw column compressed equivalent in the database.
>
> Another aspect to the investigation was to assess the suitability of the
> sketch to streaming updates.  We consume our input stream across multiple
> independent consumers, and we need sketches to be both mergeable and order
> independent.  Whilst this is difficult to test in an investigation, we
> typically compare the accuracy of the merged output to satisfy these
> criteria.  To do this, we took the same dataset and shuffled it, splitting
> it randomly over several sketches which are ultimately merged.  The output
> from the merged Frequent Items sketch was compared to the exact method, and
> we could not measure a significant loss in accuracy for the merged result.
> A crucial parameter we had to adjust during our tests was the sketch size,
> which allowed us to trade off different sizes vs accuracy of the final
> result.
>
> *Streaming Implementation*
> Once we were satisfied that the sketch met our initial criteria, we
> divided up the implementation in order to facilitate a phased rollout.
> This gave us more control over the release and allowed for us to accurately
> monitor the operational aspects of the release.  Furthermore, we found it
> useful to firstly integrate the Frequent Items sketch into our streaming
> infrastructure, and implement the changes to our queries once we were
> successfully storing the data.
>
> Since we are dealing with a large quantity of data, we first window the
> Frequent Items sketch for a short period in memory before serialising it to
> disk.  In order to do this safely, we rely on language level concurrency
> primitives to ensure that we do not access the same sketch concurrently,
> for thread safety reasons.  Whilst this is efficient at write-time, we need
> to further combine these smaller sketches into larger composites to ensure
> that our queries are not adversely affected.  This compaction process
> occurs at regular intervals and gradually promotes sketches into lower time
> resolutions, eg. days and weeks.  Duplicates presented a problem - if we
> compacted the same sketch into the larger composite, these would skew the
> results.  The solution was straightforward - we did not incrementally
> compact our sketches and needed to combine pristine copies at each
> resolution for every scheduled compaction update.  As a side note, this
> design allows for late event time data arrival.
>
> Lastly, there was some trial and error in dynamically adjusting the size
> of the sketch with the time resolution to ensure that we did not eliminate
> items from the sketch for larger intervals.
>
> *Query implementation*
> Our proposed solution has had to support the identical set of report
> filter criteria.  This presented us with two alternatives:
>
> - construct a frequent items sketch for every filter permutation
> - integrate the frequent items sketch with our existing body of Theta
> sketches
>
> The former approach presented concerns for high cardinality dimensions,
> and we dismissed this as a viable solution.  Since we already have a large
> body of Theta sketches for our data, we decided to build upon and leverage
> these sketches to extract the distinct cardinality estimates that are
> identified for each frequent item.  This amounted to storing a Theta sketch
> for each frequent item key, as this would allow us to perform additional
> set operations with our existing Theta sketches to perform the correct
> filtering.  However, this did incur a penalty, as we needed N secondary
> queries for the N-most frequent items.  In other words, we first retrieve
> the top N most frequent item keys, and for each of these keys apply a set
> of Theta set operations to extract the final estimate.  The resulting
> report is then assembled after all the subqueries complete.  Fortunately,
> we are able to batch these to lessen roundtrip times.
>
> *Conclusion*
> To conclude, a nested query is not as performant as a single composite
> query, but this is an acceptable trade off since the solution is more
> maintainable, and orders of magnitude more cost efficient!
>
> Unfortunately I'm not able to show graphs or quantify the improvements.
> But, we are pleased with the results - the difference in both frequent
> items and cardinality estimates are difficult to distinguish from our
> existing report that uses exact methods!  Moreover, from a performance
> perspective, we have the ability to control the size of the sketch through
> our configuration options.
>
> This would not have been possible without the extremely helpful
> documentation and community, who have always been helpful with questions we
> had along the way. I would like to thank the team for the library and for
> giving me the opportunity to share this feedback.
>
> Thank you,
> David
>
>
>

Reply via email to