Hi,

The CACM article and FOCS paper you mention are authored by me (with
Nguyen, Larsen, and Thorup). There's also a description of how our
algorithm works in this Quanta article:
https://www.quantamagazine.org/best-ever-algorithm-found-for-huge-streams-of-data-20171024/.

It solves a problem that only matters if you're dealing with streams
that have deletions.

Recall the "PointQuery" problem: if you have a histogram x,
PointQuery(i) should return an estimate z_i of x_i such that

|z_i - x_i| < (1/k) * |x|_1 (*)

(often "1/k" is called "epsilon"; also |x|_1 can be improved to
|x_{tail(k)}|_1 in most solutions)

CountSketch solves the version with l_2 norm instead of l_1, but the
form of the guarantee is still like (*). Now, how do you use
PointQuery to solve Heavy Hitters? In insertion-only, you can keep a
minheap of the top O(k) candidates. When you see a new item i in the
stream, get z_i via PointQuery(i). If z_i is smaller than the smallest
item in the heap, deleteMin() and add i to the heap as a key with
value z_i.

Now, when someone queries for the heavy hitters you just return
everyone in the heap. So you can answer the heavy hitters query in
O(k) time.

What about when there are deletions? This heap trick doesn't work,
since someone may delete from all the heap items and now none of them
is heavy anymore, and it's unclear how to find a replacement for who
the new heavy hitters are. The most straightforward thing that does
work is a for loop: PointQuery(i) for each i=1...n then return the 2k
items, say, with the top z_i. The problem is this is slow: > n time
(where n is the universe size). What that CACM/FOCS paper show is how
to get roughly the same good space and update time as this solution,
while bringing the query time for the list of heavy hitters to
something closer to k (we get k * polylog(n)).

-Jelani

On Sat, Aug 3, 2019 at 4:01 PM leerho <[email protected]> wrote:
>
> Folks,
>
> Two items:
>
> 1) The most recent issue of CACM (Aug 2019) has an interesting article on 
> turnstile heavy-hitters for L_p tail-heavy-hitters.  It is introduced by a 
> one-page article by Graham Cormode.  Apparently, a preliminary version of the 
> article appeared in IEEE FOCS 2016.
>
> The article discusses both the CountSketch and CountMin sketch, both of which 
> Jelani mentioned to us, when he visited, that our learning to leverage these 
> data structures might be useful for solving a range of problems similar to 
> what this article talks about: turnstile and tail-heavy hitters.
>
> I will have to study the article more closely, but I would need help from you 
> theory guys to figure out a) if this is practical to implement and b) how we 
> could approach it.
>
> 2) Since Justin and Edo (I think) are back in town I suggest we all try to 
> make this next Tuesday's Sketches Research meeting.  We can certainly discuss 
> this paper among other topics.
>
> Cheers,
>
> Lee
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to