Zoltán Borók-Nagy created IMPALA-14623:
------------------------------------------
Summary: Use the raw bytes of the 128-bit Murmur hash of Iceberg
file paths
Key: IMPALA-14623
URL: https://issues.apache.org/jira/browse/IMPALA-14623
Project: IMPALA
Issue Type: Improvement
Components: Catalog, Frontend
Reporter: Zoltán Borók-Nagy
Currently we use the following method to store the hash of an Iceberg file path:
In IcebergUtil:
{noformat}
public static String getFilePathHash(String path) {
Hasher hasher = Hashing.murmur3_128().newHasher();
hasher.putUnencodedChars(path);
return hasher.hash().toString();
}{noformat}
There are 16 raw bytes, but the String representation stores it on 2 * 16 = 32
characters. And a character in java is 2 bytes. So it consumes 2 * 32 = 64
bytes which is 4 times more than needed.
For tables with large number of files this can cause a significant overhead.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)