Re: Sketch article in CACM

Jelani Nelson Wed, 07 Aug 2019 13:06:23 -0700

Hi all,

Just to follow up on what I said at the end of the chat yesterday, the
paper that does the 'dyadic trick' but with larger branching factor is

Graham Cormode, Marios Hadjieleftheriou: Finding frequent items in
data streams. PVLDB 1(2): 1530-1541 (2008)

The relevant part is Section 3.3, the paragraph with "Finding Frequent
Items using a Hierarchy" in bold.

Recall that to get failure probability p for k-heavy hitters in a
CountMin (CM) sketch, you should have O(k) columns and O(log 1/p)
rows, for total memory O(k log 1/p) words. The update time and query
time for a point query are both log 1/p.

With branching factor b, the height of the tree is log_b n = log n /
log b. The number of queries you'll make to report the heavy hitters
is k*b times that, i.e. kb * log n / log b (for each level there are
at most O(k) items marked as being heavy, and you have to query each
of their b children). Thus to get failure probability q for heavy
hitters, you need to union bound to make sure all your queries
succeed, so you need to set p < q / (kb log n / log b) = q log b / (kb
log n). Therefore the "log 1/p" term above is Theta(log(b log n / q)).
The update time: you have to update one CM sketch per level, so in
total it's log_b n * log 1/p = (log n / log b) * log(b log n / q), and
the space is O(k) times this and query time is O(kb) times this.

To be concrete, let's say you want failure probability 1/n. Then you
can set b = n^{1/10} (that exponent was arbitrary; pick n raised to
any small constant). Then the point is log n / log b is Theta(1). So
the update time is Theta(log n), the space is Theta(k log n), and the
query time is Theta(n^{1/10} k log n) (which is at least better than
the naive approach of a for loop through the universe, which would
have query time proportional to n instead of n^{1/10}).

-Jelani

On Sat, Aug 3, 2019 at 4:01 PM leerho <[email protected]> wrote:
>
> Folks,
>
> Two items:
>
> 1) The most recent issue of CACM (Aug 2019) has an interesting article on 
> turnstile heavy-hitters for L_p tail-heavy-hitters.  It is introduced by a 
> one-page article by Graham Cormode.  Apparently, a preliminary version of the 
> article appeared in IEEE FOCS 2016.
>
> The article discusses both the CountSketch and CountMin sketch, both of which 
> Jelani mentioned to us, when he visited, that our learning to leverage these 
> data structures might be useful for solving a range of problems similar to 
> what this article talks about: turnstile and tail-heavy hitters.
>
> I will have to study the article more closely, but I would need help from you 
> theory guys to figure out a) if this is practical to implement and b) how we 
> could approach it.
>
> 2) Since Justin and Edo (I think) are back in town I suggest we all try to 
> make this next Tuesday's Sketches Research meeting.  We can certainly discuss 
> this paper among other topics.
>
> Cheers,
>
> Lee
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Sketch article in CACM

Reply via email to