Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/11077 )
Change subject: [KUDU-2521] Java Implementation for BloomFilter ...................................................................... Patch Set 1: (4 comments) http://gerrit.cloudera.org:8080/#/c/11077/1/java/kudu-client/src/main/java/org/apache/kudu/util/BloomFilter.java File java/kudu-client/src/main/java/org/apache/kudu/util/BloomFilter.java: http://gerrit.cloudera.org:8080/#/c/11077/1/java/kudu-client/src/main/java/org/apache/kudu/util/BloomFilter.java@22 PS1, Line 22: public class BloomFilter { Is this meant to be part of the public API or an internal-facing class? I'm guessing that consumers of the kudu client need to actually construct the bloom filter in some way, so this is public, right? Either way, we should add an InterfaceAudience and InterfaceStability annotation. http://gerrit.cloudera.org:8080/#/c/11077/1/java/kudu-client/src/main/java/org/apache/kudu/util/BloomFilter.java@40 PS1, Line 40: genBloomKeyProbe this pattern seems like it will create a new object for each lookup in the bloom filter. Although these are short-lived objects, I'm not sure if we can count on them being removed by escape analysis. Maybe instead we can make BloomKeyProbe a reusable mutable object? http://gerrit.cloudera.org:8080/#/c/11077/1/java/kudu-client/src/main/java/org/apache/kudu/util/BloomFilter.java@86 PS1, Line 86: CharsetUtil I think we can use StandardCharsets from java 7 here http://gerrit.cloudera.org:8080/#/c/11077/1/java/kudu-client/src/main/java/org/apache/kudu/util/BloomFilter.java@139 PS1, Line 139: CityHash I'm curious why the decision to use CityHash here? We are using murmur hashes for partitioning, and internal to the tserver using CityHash for blooms on PK lookups. Is there an advantage to this bloom filter matching the hash used by the blooms on the server side? I can't quite see how it would be used. If the choice of hash is arbitrary, maybe better to just stick with Murmur since we already included the library? Or maybe we can benchmark some alternatives for this use case and see which is best? I think performance on small strings is most important, even if the hash function is lower "quality". See https://bigdata.uni-saarland.de/publications/p249-richter.pdf for some interesting discussion Is there any way we can use the bloom filters to eliminate entire partitions while scanning if the bloom is calculated on the partitioning key? I'm not sure if the bloom could be used directly, but I wonder if this would be useful for joins in some cases (eg by just calculating a simple bitmap of matching partitions based on knowledge of the hash partitioning on the build side of the join) -- To view, visit http://gerrit.cloudera.org:8080/11077 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I50c5dce4d24707c068c5ab94e5d311b12b3251b8 Gerrit-Change-Number: 11077 Gerrit-PatchSet: 1 Gerrit-Owner: [email protected] Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-Comment-Date: Mon, 30 Jul 2018 16:05:08 +0000 Gerrit-HasComments: Yes
