[jira] [Commented] (FLINK-7465) Add build-in BloomFilterCount on TableAPI&SQL

Jark Wu (JIRA) Mon, 21 Aug 2017 19:25:06 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16136174#comment-16136174
 ]


Jark Wu commented on FLINK-7465:
--------------------------------

Hi [~sunjincheng121], 
Regarding to the when to configure the accuracy, I think maybe before the 
function is registered is better. (1). the accuracy parameter can be checked 
before runtime  (2) do not need to check whether the bitarray and hash function 
is initialized when every time the accumulator is called. 

Regarding to the de/serialize, 

- de/serialization bitArray every call the accumulate
   It maybe to expensive as I mentioned before.
- de/serialization bitArray in check point 
  We can't we do as I know. We implement the UDAGG by {{State}} interface. If 
we use Heap backend, the deserialization only happens in checkpoint, but if it 
is RocksDB backend, the de/serialization happens in every update.
- de/serialization bitArray in open/close
  Do you mean not use State in UDAGG? How can we do the exactly-once then? 

I'm designing an implementation which is based on MapView(i.e. MapView<Int, 
Long>) , it acts like {{long[]}} which can be used as bitarray or bitmap, but 
have a better performance. In this way, we only have to deserialize several 
longs in every {{accumulate}} call. What do you think? [~fhueske] 
[~sunjincheng121]

> Add build-in BloomFilterCount on TableAPI&SQL
> ---------------------------------------------
>
>                 Key: FLINK-7465
>                 URL: https://issues.apache.org/jira/browse/FLINK-7465
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Table API & SQL
>            Reporter: sunjincheng
>            Assignee: sunjincheng
>         Attachments: bloomfilter.png
>
>
> In this JIRA. use BloomFilter to implement counting functions.
> BloomFilter Algorithm description:
> An empty Bloom filter is a bit array of m bits, all set to 0. There must also 
> be k different hash functions defined, each of which maps or hashes some set 
> element to one of the m array positions, generating a uniform random 
> distribution. Typically, k is a constant, much smaller than m, which is 
> proportional to the number of elements to be added; the precise choice of k 
> and the constant of proportionality of m are determined by the intended false 
> positive rate of the filter.
> To add an element, feed it to each of the k hash functions to get k array 
> positions. Set the bits at all these positions to 1.
> To query for an element (test whether it is in the set), feed it to each of 
> the k hash functions to get k array positions. If any of the bits at these 
> positions is 0, the element is definitely not in the set – if it were, then 
> all the bits would have been set to 1 when it was inserted. If all are 1, 
> then either the element is in the set, or the bits have by chance been set to 
> 1 during the insertion of other elements, resulting in a false positive.
> An example of a Bloom filter, representing the set {x, y, z}. The colored 
> arrows show the positions in the bit array that each set element is mapped 
> to. The element w is not in the set {x, y, z}, because it hashes to one 
> bit-array position containing 0. For this figure, m = 18 and k = 3. The 
> sketch as follows:
> !bloomfilter.png!
> Reference:
> 1. https://en.wikipedia.org/wiki/Bloom_filter
> 2. 
> https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hive/common/util/BloomFilter.java
> Hi [~fhueske] [~twalthr] I appreciated if you can give me some advice. :-)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (FLINK-7465) Add build-in BloomFilterCount on TableAPI&SQL

Reply via email to