On 2018-08-17 at 10:36 -0000, Jasen Betts via Exim-dev wrote: > > and add ulength_1 for being UTF-8 aware? > > Would also need utf8-aware also substr and strlen.
Yes, I was using length as an exemplar, not as an exhaustive list. :) I favored ulength too, but didn't want to just add a slew of new expansion operators, items and conditions without at least mentioning it somewhere first. > is it going to count code-points or glyphs? Code-points. Exim has no business knowing about how a layout engine might or might not choose to render code-points to glyphs. I could see a possibility for normalization handling as another function, for correct SASLprep for authentication. I'd really rather not, though. Exim is setuid root and the main system for handling such things, ICU, does lots of tricky sensitive stuff with a history of security problems. > > Look at the top-bit being set and assume UTF-8, or > > will that break too much with all the places which are still ISO-8859-1? > > Just looking at that bit won't tell you enough to count code-points or > glyphs. I know, this was a suggestion for determining if the string should be treated as UTF-8 for changing the current expansion o/i/c features; it sucks but it was the only viable alternative I could think of and I wanted to at least present an _idea_ of something else, for inciting feedback. I know a fair bit about UTF-8 internals and how to work with the various aspects in multiple programming languages. :) > parts of ${utf8clean can probably be re-used. Yes, I thought of that, when pondering a new `utf8valid` expansion condition. > "${lc" "${uc" and "${if eqi" need consideraton too Only if we go the ICU route and include normalization forms. Which ... is more bloat than I'm happy with in Exim's current architecture. -Phil -- ## List details at https://lists.exim.org/mailman/listinfo/exim-dev Exim details at http://www.exim.org/ ##