kbendick commented on code in PR #4937: URL: https://github.com/apache/iceberg/pull/4937#discussion_r887422988
########## format/spec.md: ########## @@ -343,12 +343,13 @@ For hash function details by type, see Appendix B. | **`int`** | `W`, width | `v - (v % W)` remainders must be positive [1] | `W=10`: `1` → `0`, `-1` → `-10` | | **`long`** | `W`, width | `v - (v % W)` remainders must be positive [1] | `W=10`: `1` → `0`, `-1` → `-10` | | **`decimal`** | `W`, width (no scale) | `scaled_W = decimal(W, scale(v))` `v - (v % scaled_W)` [1, 2] | `W=50`, `s=2`: `10.65` → `10.50` | -| **`string`** | `L`, length | Substring of length `L`: `v.substring(0, L)` | `L=3`: `iceberg` → `ice` | +| **`string`** | `L`, length | Substring of length `L`: `v.substring(0, L)` [3] | `L=3`: `iceberg` → `ice` | Notes: 1. The remainder, `v % W`, must be positive. For languages where `%` can produce negative values, the correct truncate function is: `v - (((v % W) + W) % W)` 2. The width, `W`, used to truncate decimal values is applied using the scale of the decimal column to avoid additional (and potentially conflicting) parameters. +3. For strings truncation is based on on unicode code point length (not byte length). Review Comment: Nit: I would say “String truncation is based on Unicode code point length, which does not always correspond to byte length.” It does often correspond to byte length and this clarification might add some unnecessary confusion for average developers who might have the luxury of only dealing with ASCII text. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
