Re: [PROPOSAL] Introduce Elastic Bloom Filter For Flink

sihua zhou Tue, 29 May 2018 22:37:24 -0700

Hi,

I did a survey of the variants of Bloom Filter and the Cuckoo filter these
days. Finally, I found 3 of them maybe adaptable for our purpose.

1. standard bloom filter (which we have implemented base on this and used it on
production with a good experience)
2. cuckoo filter, also a very good filter which is a space-efficient data
structures and support fast query(even faster then BF, but the insert maybe a
little slower than BF), addtional it support delete() operation.
3. count bloom filter, a variant of BF, it supports delete()operation, but need
to cost 4-5x memory than the standard bloom filter(so, I'm not sure whether
it's adaptable in practice).

Anyway, these filters are just the smallest storage unit in this "Elastic Bloom
Filter", we can define a general interface, and provide different
implementation of "storage unit" base on different filter if we want. Maybe I
should change the PROPOSAL name to the "Introduce Elastic Filter For Flink",
the ideal of approach that I outlined in the doc is very similar to the paper
"Optimization and Applications of Dynamic Bloom
Filters(http://ijarcs.info/index.php/Ijarcs/article/viewFile/826/814)"(compare
to the paper, the approach I outlined could have a better query performance and
also support the RELAXED TTL), maybe it can help to understand the desgin doc.
Looking forward any feedback!

Best, Sihua
On 05/24/2018 10:36，sihua zhou<[email protected]> wrote：
Hi,
Thanks for your suggestions @Elias! I have a brief look at "Cuckoo Filter" and
"Golumb Compressed Sequence", my first sensation is that maybe "Golumc
Compressed Sequence" is not a good choose, because it seems to require
non-constant lookup time, but Cuckoo Filter maybe a good choose, I should
definitely have a deeper look at it.

Beside, to me, all of this filters seems to a "variant" of the bloom
filter(which is the smallest unit to store data in the current desgin), the
main challenge for introducing BF into flink is the data skewed(which is common
phenomenon on production) problem, could you maybe also have a look at the
solution that I posted on the google doc
https://docs.google.com/document/d/17UY5RZ1mq--hPzFx-LfBjCAw_kkoIrI9KHovXWkxNYY/edit?usp=sharing
for this problem, It would be nice if you could give us some advice on that.

Best, Sihua

On 05/24/2018 07:21，Elias Levy<[email protected]> wrote：
I would suggest you consider an alternative data structures: a Cuckoo
Filter or a Golumb Compressed Sequence.

The GCS data structure was introduced in Cache-, Hash- and Space-Efficient
Bloom Filters
<http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf> by
F. Putze, P. Sanders, and J. Singler. See section 4.

We should discuss which exact implementation of bloom filters are the best
fit.
@Fabian: There are also implementations of bloom filters that use counting
and therefore support
deletes, but obviously this comes at the cost of a potentially higher
space consumption.

Am 23.05.2018 um 11:29 schrieb Fabian Hueske <[email protected]>:
IMO, such a feature would be very interesting. However, my concerns with
Bloom Filter
is that they are insert-only data structures, i.e., it is not possible to
remove keys once
they were added. This might render the filter useless over time.
In a different thread (see discussion in FLINK-8918 [1]), you mentioned
that the Bloom
Filters would be growing.
If we keep them in memory, how can we prevent them from exceeding memory
boundaries over
time?

Re: [PROPOSAL] Introduce Elastic Bloom Filter For Flink

Reply via email to