kbendick commented on code in PR #4937: URL: https://github.com/apache/iceberg/pull/4937#discussion_r887423432
########## format/spec.md: ########## @@ -343,12 +343,13 @@ For hash function details by type, see Appendix B. | **`int`** | `W`, width | `v - (v % W)` remainders must be positive [1] | `W=10`: `1` → `0`, `-1` → `-10` | | **`long`** | `W`, width | `v - (v % W)` remainders must be positive [1] | `W=10`: `1` → `0`, `-1` → `-10` | | **`decimal`** | `W`, width (no scale) | `scaled_W = decimal(W, scale(v))` `v - (v % scaled_W)` [1, 2] | `W=50`, `s=2`: `10.65` → `10.50` | -| **`string`** | `L`, length | Substring of length `L`: `v.substring(0, L)` | `L=3`: `iceberg` → `ice` | +| **`string`** | `L`, length | Substring of length `L`: `v.substring(0, L)` [3] | `L=3`: `iceberg` → `ice` | Notes: 1. The remainder, `v % W`, must be positive. For languages where `%` can produce negative values, the correct truncate function is: `v - (((v % W) + W) % W)` 2. The width, `W`, used to truncate decimal values is applied using the scale of the decimal column to avoid additional (and potentially conflicting) parameters. +3. For strings truncation is based on on unicode code point length (not byte length). Review Comment: Or possibly reusing the language from here? https://github.com/apache/iceberg/blob/5009949ba4377ac5a8572ff7ae70e886c9e33bec/api/src/main/java/org/apache/iceberg/util/UnicodeUtil.java#L39-L40 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
