gianm opened a new pull request, #15675: URL: https://github.com/apache/druid/pull/15675
This patch adds a new ImmutableLookupMap, which comes with an ImmutableLookupExtractor. It uses a fastutil open hashmap plus two lists to store its data in such a way that forward and reverse lookups can both be done quickly. I also observed footprint to be somewhat smaller than Java HashMap + MapLookupExtractor for a 1 million row lookup. The main advantage, though, is that reverse lookups can be done much more quickly than MapLookupExtractor (which iterates the entire map for each call to unapplyAll). This speeds up the recently added ReverseLookupRule (#15626) during SQL planning with very large lookups. --- More details on benchmarks and size analysis below. In the benchmark report, `keysPerValue` is the number of keys that map to each value. `lookupType` is either `hashmap` (MapLookupExtractor backed by Java HashMap, as we use today) or `immutable` (the new one). The test cases are `lookupApply` (forward lookup), `lookupUnapplyOne` (reverse lookup of a single value), and `lookupUnapplyOneThousand` (reverse lookup of one thousand values). I also took heap dumps from the benchmark datasets, to measure how much space the two implementations take. Findings: - `apply` is about the same speed in both implementations. - All the reverse lookup calls are substantially faster with `immutable`. Performance is closest for the `keysPerValue: 1000`, `lookupUnapplyOneThousand` case. In this case, the reverse lookup is returning the entire key space, because it's 1,000 values * 1,000 keys per value = 1,000,000 keys. - MapLookupExtractor backed by HashMap had a table with 2097152 entries. It measured 168,388,688 bytes total retained. - ImmutableLookupMap had a `keysToEntry` table with 2097153 entries, and `keys` and `values` lists with 1000000 entries. It measured 154,501,488 bytes total retained. It's smaller because it ends up being three string arrays and one int array, which compares favorably to the array of hashtable bucket objects that HashMap uses. Given these findings, I think it's a good idea to use the immutable version wherever we can. So, the patch also updates the main static lookups (URI and JDBC) to use it always. In a future patch, for getting faster reverse Kafka lookups, it's probably a good idea to implement a mutable lookup that supports faster reverse lookups, and run that only on the Broker. (Assuming it will end up being larger footprint than the `ConcurrentHashMap` we use today, we wouldn't want to run it on every data server. Only the Broker needs fast reverse lookups.) ``` Benchmark (keysPerValue) (lookupType) (numKeys) Mode Cnt Score Error Units LookupExtractorBenchmark.lookupApply 1 hashmap 1000000 thrpt 5 299273576.631 ± 4669084.716 ops/s LookupExtractorBenchmark.lookupApply 1 immutable 1000000 thrpt 5 306273517.396 ± 4563560.722 ops/s LookupExtractorBenchmark.lookupApply 1000 hashmap 1000000 thrpt 5 296967596.838 ± 8820926.069 ops/s LookupExtractorBenchmark.lookupApply 1000 immutable 1000000 thrpt 5 306457023.086 ± 2712603.499 ops/s LookupExtractorBenchmark.lookupUnapplyOne 1 hashmap 1000000 avgt 5 33.949 ± 0.916 ms/op LookupExtractorBenchmark.lookupUnapplyOne 1 immutable 1000000 avgt 5 ≈ 10⁻⁴ ms/op LookupExtractorBenchmark.lookupUnapplyOne 1000 hashmap 1000000 avgt 5 35.255 ± 1.230 ms/op LookupExtractorBenchmark.lookupUnapplyOne 1000 immutable 1000000 avgt 5 0.011 ± 0.001 ms/op LookupExtractorBenchmark.lookupUnapplyOneThousand 1 hashmap 1000000 avgt 5 32.402 ± 0.558 ms/op LookupExtractorBenchmark.lookupUnapplyOneThousand 1 immutable 1000000 avgt 5 0.140 ± 0.006 ms/op LookupExtractorBenchmark.lookupUnapplyOneThousand 1000 hashmap 1000000 avgt 5 33.245 ± 0.916 ms/op LookupExtractorBenchmark.lookupUnapplyOneThousand 1000 immutable 1000000 avgt 5 27.472 ± 2.916 ms/op ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
