On Thu, 15 Dec 2022, Paul Crovella wrote:

> On 12/15/2022 7:34 AM, Derick Rethans wrote:
> > https://wiki.php.net/rfc/unicode_text_processing
> 
> A few quick thoughts:
> 
> > The constructor will also convert the given text to Unicode Canonical Form.
> 
> By this do you mean Normalization Form C (NFC)? "Unicode Canonical Form" isn't
> a phrase I'm familiar with.

Yes. I've seen both phrases used, so I'll add NFC in brackets.

> Assuming so, are modified texts (e.g. via join, replaceText, reverse)
> re-normalized?

Yes — although I do not expect that to change anything, as normalisation 
usually happens *in* a grapheme, and not between them. I suspect there 
might be some indian languages where that is proven wrong though.

> > The constructor will also strip out a BOM (Byte-Order-Mark) 
> > character, if present.
> 
> This is also known as ZWNBSP (Zero Width No-Break Space). Will only a 
> leading instance be stripped? If so, how can someone search for it (or 
> a substring beginning with it) given that:
> 
> > If an argument to any of the methods is listed as string|Text, passing in a
> > string value will have the same semantics as replacing the passed value with
> > new Text($string).
> 
> and all the search methods take `string|Text $search`.

I hadn't realised this is now used for both use cases. I've just read[1]

"If the BOM character appears in the middle of a data stream, Unicode 
says it should be interpreted as a "zero-width non-breaking space" 
(inhibits line-breaking between word-glyphs). In Unicode 3.2, this usage 
is deprecated in favor of the "Word Joiner" character, U+2060.[1] This 
allows U+FEFF to be used only as a BOM. "

This indicates that this might not be a problem. Would you have a better 
suggestion?

> ---
> 
> Why is this being introduced directly into PHP core rather than first an
> extension where it's easier to shake out the interface and behavior?

It will be developed as an extension inside the ext/ branch, pretty much 
like ext/standard or ext/date; but if it is not in core, very few people 
will use it, defeauting the whole point of the effort.

cheers,
Derick

[1] https://en.wikipedia.org/wiki/Byte_order_mark#Usage

-- 
https://derickrethans.nl | https://xdebug.org | https://dram.io

Author of Xdebug. Like it? Consider supporting me: https://xdebug.org/support
Host of PHP Internals News: https://phpinternals.news

mastodon: @derickr@phpc.social @xdebug@phpc.social
twitter: @derickr and @xdebug
-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Reply via email to