findepi commented on a change in pull request #3966:
URL: https://github.com/apache/iceberg/pull/3966#discussion_r801579852
##########
File path: core/src/main/java/org/apache/iceberg/util/ZOrderByteUtils.java
##########
@@ -108,54 +108,70 @@ private ZOrderByteUtils() {
* Strings are lexicographically sortable BUT if different byte array
lengths will
* ruin the Z-Ordering. (ZOrder requires that a given column contribute the
same number of bytes every time).
* This implementation just uses a set size to for all output byte
representations. Truncating longer strings
- * and right padding 0 for shorter strings.
+ * and right padding 0 for shorter strings. Requires UTF8 (or ASCII)
encoding for ordering guarantees to hold.
Review comment:
Please don't add a comment. Just replace `String.getBytes()` with
`String.getBytes(java.nio.charset.StandardCharsets#UTF_8)`.
Of course, i don't know the rules of the project, but otherwise i don't see
why we would want to rely on JVM's default encoding. This is z-order, not
sorting, but z-order should be consistent with sorting, and varchar value
ordering is codepoint-based (in the absence of collation which doesn't exist in
Iceberg world). The UTF-8 encoding produces bytes that, if treated as unsigned,
lead to the same order as codepoint-based lexicographical ordering.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]