On Thu, 15 Dec 2022, Paul Crovella wrote: > On 12/15/2022 7:34 AM, Derick Rethans wrote: > > https://wiki.php.net/rfc/unicode_text_processing > > A few quick thoughts: > > > The constructor will also convert the given text to Unicode Canonical Form. > > By this do you mean Normalization Form C (NFC)? "Unicode Canonical Form" isn't > a phrase I'm familiar with.
Yes. I've seen both phrases used, so I'll add NFC in brackets. > Assuming so, are modified texts (e.g. via join, replaceText, reverse) > re-normalized? Yes — although I do not expect that to change anything, as normalisation usually happens *in* a grapheme, and not between them. I suspect there might be some indian languages where that is proven wrong though. > > The constructor will also strip out a BOM (Byte-Order-Mark) > > character, if present. > > This is also known as ZWNBSP (Zero Width No-Break Space). Will only a > leading instance be stripped? If so, how can someone search for it (or > a substring beginning with it) given that: > > > If an argument to any of the methods is listed as string|Text, passing in a > > string value will have the same semantics as replacing the passed value with > > new Text($string). > > and all the search methods take `string|Text $search`. I hadn't realised this is now used for both use cases. I've just read[1] "If the BOM character appears in the middle of a data stream, Unicode says it should be interpreted as a "zero-width non-breaking space" (inhibits line-breaking between word-glyphs). In Unicode 3.2, this usage is deprecated in favor of the "Word Joiner" character, U+2060.[1] This allows U+FEFF to be used only as a BOM. " This indicates that this might not be a problem. Would you have a better suggestion? > --- > > Why is this being introduced directly into PHP core rather than first an > extension where it's easier to shake out the interface and behavior? It will be developed as an extension inside the ext/ branch, pretty much like ext/standard or ext/date; but if it is not in core, very few people will use it, defeauting the whole point of the effort. cheers, Derick [1] https://en.wikipedia.org/wiki/Byte_order_mark#Usage -- https://derickrethans.nl | https://xdebug.org | https://dram.io Author of Xdebug. Like it? Consider supporting me: https://xdebug.org/support Host of PHP Internals News: https://phpinternals.news mastodon: @derickr@phpc.social @xdebug@phpc.social twitter: @derickr and @xdebug
-- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php