[GitHub] [iceberg] kbendick commented on a diff in pull request #4937: [Specification] Clarify truncate for Iceberg strings is based on code points

GitBox Wed, 01 Jun 2022 18:29:13 -0700


kbendick commented on code in PR #4937:
URL: https://github.com/apache/iceberg/pull/4937#discussion_r887422988



##########
format/spec.md:
##########
@@ -343,12 +343,13 @@ For hash function details by type, see Appendix B.
 | **`int`**     | `W`, width            | `v - (v % W)`        remainders must 
be positive     [1]                    | `W=10`: `1` ￫ `0`, `-1` ￫ `-10`  |
 | **`long`**    | `W`, width            | `v - (v % W)`        remainders must 
be positive     [1]                    | `W=10`: `1` ￫ `0`, `-1` ￫ `-10`  |
 | **`decimal`** | `W`, width (no scale) | `scaled_W = decimal(W, scale(v))` `v 
- (v % scaled_W)`               [1, 2] | `W=50`, `s=2`: `10.65` ￫ `10.50` |
-| **`string`**  | `L`, length           | Substring of length `L`: 
`v.substring(0, L)`                     | `L=3`: `iceberg` ￫ `ice`         |
+| **`string`**  | `L`, length           | Substring of length `L`: 
`v.substring(0, L)` [3]                    | `L=3`: `iceberg` ￫ `ice`         |
 
 Notes:
 
 1. The remainder, `v % W`, must be positive. For languages where `%` can 
produce negative values, the correct truncate function is: `v - (((v % W) + W) 
% W)`
 2. The width, `W`, used to truncate decimal values is applied using the scale 
of the decimal column to avoid additional (and potentially conflicting) 
parameters.
+3. For strings truncation is based on on unicode code point length (not byte 
length).

Review Comment:
   Nit: I would say “String truncation is based on Unicode code point length, 
which does not always correspond to byte length.” It does often correspond to 
byte length and this clarification might add some unnecessary confusion for 
average developers who might have the luxury of only dealing with ASCII text.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] kbendick commented on a diff in pull request #4937: [Specification] Clarify truncate for Iceberg strings is based on code points

Reply via email to