difin commented on code in PR #5792: URL: https://github.com/apache/hive/pull/5792#discussion_r2069608953
########## iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergAcidUtil.java: ########## @@ -243,14 +247,49 @@ public static long getDeleteFilePosition(Record rec) { return rec.get(DELETE_FILE_META_COLS.get(MetadataColumns.ROW_POSITION), Long.class); } + private static long hashObjectArray(Object[] values) { + Hasher hasher = Hashing.murmur3_128().newHasher(); + + for (Object val : values) { + if (val == null) { + // Unique constant for null + hasher.putInt(0xDEADBEEF); + } else if (val instanceof Integer) { + hasher.putInt((Integer) val); + } else if (val instanceof Long) { + hasher.putLong((Long) val); + } else if (val instanceof String) { + hasher.putString((String) val, StandardCharsets.UTF_8); + } else if (val instanceof Boolean) { + hasher.putBoolean((Boolean) val); + } else if (val instanceof Short) { + hasher.putShort((Short) val); + } else if (val instanceof Byte) { + hasher.putByte((Byte) val); + } else if (val instanceof Character) { + hasher.putChar((Character) val); + } else if (val instanceof Double) { + hasher.putDouble((Double) val); + } else if (val instanceof Float) { + hasher.putFloat((Float) val); + } else { + // Fallback to object's string representation + hasher.putLong(Objects.hash(val)); + } + } + + HashCode hashCode = hasher.hash(); + return hashCode.asLong(); + } + public static long computeHash(StructLike struct) { Review Comment: I re-implemented with Iceberg's code as you suggested, and tested. The q-test results with bucketing and partition evolution with nulls were even worse than the original approach. It leads to collisions with 0 and null. I pushed this code version to a separate branch: https://github.com/difin/hive/commit/a03bf02eeebdee7e2a16400603b85fc8ed12c6fd Here is a portion from the q.out file with duplicates: ``` PREHOOK: query: SELECT * FROM default.srcbucket_big ORDER BY id PREHOOK: type: QUERY PREHOOK: Input: default@srcbucket_big @@ -316,10 +316,52 @@ NULL val_102 2 105 val_105 5 NULL NULL 6 101 val_101 7 101 val_101 7 101 val_101 7 101 val_101 7 101 val_101 7 101 val_101 7 ... ``` Iceberg hashing function uses the same `Objects::hashCode` method which I'm attempting to get rid of in this PR because it leads to collisions ``` static <T> JavaHash<T> forType(Type type) { switch (type.typeId()) { case STRING: return JavaHashes.strings(); case STRUCT: return JavaHashes.struct(type.asStructType()); case LIST: return JavaHashes.list(type.asListType()); default: return Objects::hashCode; } } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For additional commands, e-mail: gitbox-h...@hive.apache.org