Hi Folks,

I thought those on the dev mailing list might appreciate some details of
how we successfully applied the frequent items sketch to a complex problem
in our application.  Although we are still in the early phases of the
release, we are very pleased with our initial results.

*Background*
For the purposes of establishing context, our customers deploy our SDK
which typically runs on customers' mobile devices.  This SDK collects usage
data in a privacy compliant manner without tracking any user identity.
This data is then forwarded to our backend where it is processed, stored
and made available as reports within our customer facing dashboards. One
such report is a list of most popular items of customer content (e.g. page
title), along with the number of discrete user views pertaining to that
content.  The user is able to apply ad-hoc criteria (e.g. region),  in
conjunction with a date range that may extend back up to two years.

Our initial solution was to use a cloud database to query the data directly
in one composite SQL query.  Whilst this provided a quick solution that was
familiar to most team members, there were several drawbacks:

- It was necessary to retain the data in its raw form for lengthy periods
of time.  Pre-aggregation was not an option - sorting through all matching
items and counting the frequency of each group is not easily mergeable
across combinations of date dimension / filter criteria.   The fundamental
difficulty with merging is in accurately preserving a distinct count of the
unique users alongside the frequent item. This made our solution
cost-prohibitive.
- The data for the report was sourced from different relations.  This
resulted in many complex dynamic predicates that became difficult to
maintain.
- Performance - querying raw data using exact methods required us to access
the complete dataset in its entirety which made queries expensive to
compute and execute.

To address these problems, we decided to investigate an alternative - the
Frequent Items sketch from Java library.

*Investigatory Analysis *
We typically integrate a new sketch by performing an investigatory
analysis, which involves building and reading from the sketch under a
production-like workload.  An entire days' worth of production data from
one of our biggest customers was written into the Frequent Items sketch.
The serialised size of the sketch was roughly 3,5Mb, which is considerably
less than our raw column compressed equivalent in the database.

Another aspect to the investigation was to assess the suitability of the
sketch to streaming updates.  We consume our input stream across multiple
independent consumers, and we need sketches to be both mergeable and order
independent.  Whilst this is difficult to test in an investigation, we
typically compare the accuracy of the merged output to satisfy these
criteria.  To do this, we took the same dataset and shuffled it, splitting
it randomly over several sketches which are ultimately merged.  The output
from the merged Frequent Items sketch was compared to the exact method, and
we could not measure a significant loss in accuracy for the merged result.
A crucial parameter we had to adjust during our tests was the sketch size,
which allowed us to trade off different sizes vs accuracy of the final
result.

*Streaming Implementation*
Once we were satisfied that the sketch met our initial criteria, we divided
up the implementation in order to facilitate a phased rollout.  This gave
us more control over the release and allowed for us to accurately monitor
the operational aspects of the release.  Furthermore, we found it useful to
firstly integrate the Frequent Items sketch into our streaming
infrastructure, and implement the changes to our queries once we were
successfully storing the data.

Since we are dealing with a large quantity of data, we first window the
Frequent Items sketch for a short period in memory before serialising it to
disk.  In order to do this safely, we rely on language level concurrency
primitives to ensure that we do not access the same sketch concurrently,
for thread safety reasons.  Whilst this is efficient at write-time, we need
to further combine these smaller sketches into larger composites to ensure
that our queries are not adversely affected.  This compaction process
occurs at regular intervals and gradually promotes sketches into lower time
resolutions, eg. days and weeks.  Duplicates presented a problem - if we
compacted the same sketch into the larger composite, these would skew the
results.  The solution was straightforward - we did not incrementally
compact our sketches and needed to combine pristine copies at each
resolution for every scheduled compaction update.  As a side note, this
design allows for late event time data arrival.

Lastly, there was some trial and error in dynamically adjusting the size of
the sketch with the time resolution to ensure that we did not eliminate
items from the sketch for larger intervals.

*Query implementation*
Our proposed solution has had to support the identical set of report filter
criteria.  This presented us with two alternatives:

- construct a frequent items sketch for every filter permutation
- integrate the frequent items sketch with our existing body of Theta
sketches

The former approach presented concerns for high cardinality dimensions, and
we dismissed this as a viable solution.  Since we already have a large body
of Theta sketches for our data, we decided to build upon and leverage these
sketches to extract the distinct cardinality estimates that are identified
for each frequent item.  This amounted to storing a Theta sketch for each
frequent item key, as this would allow us to perform additional set
operations with our existing Theta sketches to perform the correct
filtering.  However, this did incur a penalty, as we needed N secondary
queries for the N-most frequent items.  In other words, we first retrieve
the top N most frequent item keys, and for each of these keys apply a set
of Theta set operations to extract the final estimate.  The resulting
report is then assembled after all the subqueries complete.  Fortunately,
we are able to batch these to lessen roundtrip times.

*Conclusion*
To conclude, a nested query is not as performant as a single composite
query, but this is an acceptable trade off since the solution is more
maintainable, and orders of magnitude more cost efficient!

Unfortunately I'm not able to show graphs or quantify the improvements.
But, we are pleased with the results - the difference in both frequent
items and cardinality estimates are difficult to distinguish from our
existing report that uses exact methods!  Moreover, from a performance
perspective, we have the ability to control the size of the sketch through
our configuration options.

This would not have been possible without the extremely helpful
documentation and community, who have always been helpful with questions we
had along the way. I would like to thank the team for the library and for
giving me the opportunity to share this feedback.

Thank you,
David

Reply via email to