jihoonson commented on a change in pull request #9800:
URL: https://github.com/apache/druid/pull/9800#discussion_r421006133
##########
File path:
processing/src/main/java/org/apache/druid/query/filter/InDimFilter.java
##########
@@ -132,31 +145,40 @@ public FilterTuning getFilterTuning()
@Override
public byte[] getCacheKey()
{
- boolean hasNull = false;
- for (String value : values) {
- if (value == null) {
- hasNull = true;
- break;
+ if (cacheKey == null) {
+ final List<String> sortedValues = new ArrayList<>(values);
+ sortedValues.sort(Comparator.nullsFirst(Ordering.natural()));
+ final Hasher hasher = Hashing.sha256().newHasher();
Review comment:
The issue with storing all `values` in the cacheKey is that the cache
key can become huge as the size of `values` grows which can result in pollution
in the cache.
Regarding using SHA-256, I believe academical research is more trustable
than my testing. [According to the probability table, the required number of
hashed elements such that probability of at least one hash collision >= 10^−18
is 4.8×10^29 under assumption that the hash function is
perfect](https://en.wikipedia.org/wiki/Birthday_problem). This is negligible
when it comes to practice even though the SHA-256 hash function is not proven
to be perfect.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]