Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11077 )

Change subject: [KUDU-2521] Java Implementation for BloomFilter
......................................................................


Patch Set 1:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/11077/1/java/kudu-client/src/main/java/org/apache/kudu/util/BloomFilter.java
File java/kudu-client/src/main/java/org/apache/kudu/util/BloomFilter.java:

http://gerrit.cloudera.org:8080/#/c/11077/1/java/kudu-client/src/main/java/org/apache/kudu/util/BloomFilter.java@22
PS1, Line 22: public class BloomFilter {
Is this meant to be part of the public API or an internal-facing class? I'm 
guessing that consumers of the kudu client need to actually construct the bloom 
filter in some way, so this is public, right?

Either way, we should add an InterfaceAudience and InterfaceStability 
annotation.


http://gerrit.cloudera.org:8080/#/c/11077/1/java/kudu-client/src/main/java/org/apache/kudu/util/BloomFilter.java@40
PS1, Line 40: genBloomKeyProbe
this pattern seems like it will create a new object for each lookup in the 
bloom filter. Although these are short-lived objects, I'm not sure if we can 
count on them being removed by escape analysis. Maybe instead we can make 
BloomKeyProbe a reusable mutable object?


http://gerrit.cloudera.org:8080/#/c/11077/1/java/kudu-client/src/main/java/org/apache/kudu/util/BloomFilter.java@86
PS1, Line 86: CharsetUtil
I think we can use StandardCharsets from java 7 here


http://gerrit.cloudera.org:8080/#/c/11077/1/java/kudu-client/src/main/java/org/apache/kudu/util/BloomFilter.java@139
PS1, Line 139: CityHash
I'm curious why the decision to use CityHash here? We are using murmur hashes 
for partitioning, and internal to the tserver using CityHash for blooms on PK 
lookups.

Is there an advantage to this bloom filter matching the hash used by the blooms 
on the server side? I can't quite see how it would be used. If the choice of 
hash is arbitrary, maybe better to just stick with Murmur since we already 
included the library? Or maybe we can benchmark some alternatives for this use 
case and see which is best? I think performance on small strings is most 
important, even if the hash function is lower "quality". See 
https://bigdata.uni-saarland.de/publications/p249-richter.pdf for some 
interesting discussion

Is there any way we can use the bloom filters to eliminate entire partitions 
while scanning if the bloom is calculated on the partitioning key? I'm not sure 
if the bloom could be used directly, but I wonder if this would be useful for 
joins in some cases (eg by just calculating a simple bitmap of matching 
partitions based on knowledge of the hash partitioning on the build side of the 
join)



--
To view, visit http://gerrit.cloudera.org:8080/11077
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I50c5dce4d24707c068c5ab94e5d311b12b3251b8
Gerrit-Change-Number: 11077
Gerrit-PatchSet: 1
Gerrit-Owner: [email protected]
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-Comment-Date: Mon, 30 Jul 2018 16:05:08 +0000
Gerrit-HasComments: Yes

Reply via email to