goel-skd commented on issue #613:
URL: https://github.com/apache/iceberg-cpp/issues/613#issuecomment-4713612373

   **On the library question:** ICU would definitely work, but it feels pretty 
heavy to pull in just for case folding on field names. `utf8proc` might be a 
better fit. It's tiny and used by Arrow already. Worth at least considering 
before committing to ICU.
   
   One thing I'd flag on the "match Python/Java" goal : full parity probably 
isn't reachable. Java lowercases with `Locale.ROOT` and Python uses 
`str.lower()`, and those two don't even agree on every character (the Turkish 
dotted/dotless `I` is the usual example).
   
   And as @nvartolomei noted above, Glue itself isn't consistent about whether 
it lowercases at all. So I think the realistic target is *"be Unicode-aware and 
match Java's `Locale.ROOT` behavior"* rather than byte-for-byte agreement with 
every catalog.
   
   Separately, the current `ToLower` calls `std::tolower` on raw `char` values. 
For any non-ASCII byte that's technically UB (the argument has to be 
representable as `unsigned char`), so even independent of the Unicode work it'd 
be good to fix that.
   
   Happy to take a stab at it if there's agreement on `utf8proc`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to