Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/23946 )

Change subject: IMPALA-14623: Optimize memory usage for Iceberg file path hashes
......................................................................


Patch Set 1:

(4 comments)

Thanks for working on this!

http://gerrit.cloudera.org:8080/#/c/23946/1/be/src/util/thash128-util.h
File be/src/util/thash128-util.h:

http://gerrit.cloudera.org:8080/#/c/23946/1/be/src/util/thash128-util.h@20
PS1, Line 20: #include <tuple>
Unnecessary include of 'tuple'.

But we should probably add:

 #include "gen-cpp/CatalogObjects_types.h"


http://gerrit.cloudera.org:8080/#/c/23946/1/common/thrift/CatalogObjects.thrift
File common/thrift/CatalogObjects.thrift:

http://gerrit.cloudera.org:8080/#/c/23946/1/common/thrift/CatalogObjects.thrift@667
PS1, Line 667: Murmur3
If we already change things, we should also switch to XXH3.


http://gerrit.cloudera.org:8080/#/c/23946/1/fe/src/main/java/org/apache/impala/util/IcebergUtil.java
File fe/src/main/java/org/apache/impala/util/IcebergUtil.java:

http://gerrit.cloudera.org:8080/#/c/23946/1/fe/src/main/java/org/apache/impala/util/IcebergUtil.java@748
PS1, Line 748:     // Divide 16 bytes into two longs.
             :     java.nio.ByteBuffer buffer = 
java.nio.ByteBuffer.wrap(hashBytes);
             :     long high = buffer.getLong();
             :     long low = buffer.getLong();
ByteBuffer adds unnecessary allocation overhead. I think we should optimize 
this method as much as we can because it will get invoked many times. We could 
use a manual algorithm:

 long high = 0;
 for (int i = 0; i < 8; i++) {
   high = (high << 8) | (hashBytes[i] & 0xFFL);
 }
 long low = 0;
 for (int i = 8; i < 16; i++) {
   low = (low << 8) | (hashBytes[i] & 0xFFL);
 }

Probably the fastest would be to use OpenHFT and XXH3:

 import net.openhft.hashing.LongTupleHashFunction;

 LongTupleHashFunction hasher = LongTupleHashFunction.xx_h3();
 long[] result = hasher.hashChars(path);
 return new Hash128(result[0], result[1]);

Out of curiosity, would it be possible to do some micro-benchmarking and share 
the results?


http://gerrit.cloudera.org:8080/#/c/23946/1/fe/src/test/java/org/apache/impala/util/Hash128Test.java
File fe/src/test/java/org/apache/impala/util/Hash128Test.java:

http://gerrit.cloudera.org:8080/#/c/23946/1/fe/src/test/java/org/apache/impala/util/Hash128Test.java@20
PS1, Line 20: *
nit: we prefer non-wildcard imports



--
To view, visit http://gerrit.cloudera.org:8080/23946
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie0de793de2434dae3b60c3aa4f87dba203eee3c1
Gerrit-Change-Number: 23946
Gerrit-PatchSet: 1
Gerrit-Owner: Arnab Karmakar <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Fri, 06 Feb 2026 10:59:59 +0000
Gerrit-HasComments: Yes

Reply via email to