On Tue, 2014-10-21 at 21:42 +0100, Rowan Collins wrote: > On 21/10/2014 08:06, Joe Watkins wrote: > > Morning internalz, > > > > https://wiki.php.net/rfc/ustring > > > > This is the result of work done by a few of us, we won't be opening any > > vote in a fortnight. We have a long time before 7, there is no rush > > whatever. > > > > Now seems like a good time to start the conversation so we can hash out > > the details, or get on with other things ;) > > > > Cheers > > Joe > > > > > > I think this looks like a really great start at creating something > actually useful, rather than getting stuck at the drawing board. I like > that the scope is quite small initially - where does the "single > responsibility" of a class that represents a string end, anyway? :) > > A few opinions: > > 1) Global / static defaults are bad. > > The existence of the setDefaultCodepage method feels like an > anti-pattern to me. It means libraries can't rely on this class working > the same way in two different host environments, or even at two > re-entries in the same program. Effectively, if you don't know what the > second argument to the constructor will default to, you can't actually > treat it as optional unless you're writing monolithic code. This is a > common pattern in PHP, but http_build_query() would be so much more > pleasant if I could safely call it with 1 argument instead of 3. > > I think the default should be hard-coded to UTF-8, which according to > previous discussion is always the default *output* encoding, so would > mean this would always work: $aUString = new UString( (string)$aUString > ); Any other encoding will be dependent on, and known from, the context > where the object is created - if grabbing data from an HTTP request, a > header should tell them; if from a database, a connection parameter; and > so on. >
Could be true, it feels quite horrible to me today too, I think someone else suggested it, but it might have been me. I'll look at doing something about that ... > The only case I can see where a default encoding would be sensible would > be where source code itself is in a different encoding, so that > u('literal string') works as expected. I guess if we ever went down the > route of special literal syntax like u'literal string', the declared > source encoding could be used. > > Actually, the u() shortcut function appears to be missing the encoding > parameter completely; is this deliberate? > Fixed that. > 2) Clarify relationship to a "byte string" > > Most of the API acts like this is an abstract object representing a > bunch of Unicode code points. As such, I'm not sure what getCodepage() > does - a code page (or more properly encoding) is a property of a stream > of bytes, so has no meaning in this context, surely? The internal > implementation could use UTF-8, UTF-16, or some made-up encoding (like > Perl6's "NFG" system) and the user should never need to know (other than > to understand performance implications). > > On the other hand, when you *do* want a stream of bytes, the class > doesn't seem to have an explicit way to get one. The (currently > undocumented) behaviour is apparently to spit out UTF-8 if cast to a > string, but it would be nice to have an explicit function which could be > passed a parameter in order to serialise to, say, UTF-16, instead. > I reused the terminology used by ICU, it made sense in their documentation. So we want a ::getBytes or something like that ... I'll do that ... > 3) The Grapheme Question > > This has been raised a few times, so I won't labour the point, just > mention my current thinking. > > Unicode is complicated. Partly, that's because of a series of > compromises in its design; but partly, it's because writing systems are > complicated, and Unicode tries harder than most previous systems to > acknowledge that. So, there's a tradeoff to be made between giving users > what they think they need, thus hiding the messy details, and giving > users the power to do things right, in a more complex way. > > There is also a namespace mess if you insist on every function and > property having to declare what level of abstraction it's talking about > - e.g. $codePointLength instead of $length. > > An idea I've been toying with is rather than having one class > representing the slippery notion of "a Unicode string", having (at > least) two, closely tied, classes: CodePointString (roughly = UString > right now) and GraphemeString (a higher level abstraction tied to the > same internal representation). > > I intend to mock this up as a set of interfaces at some point, but the > basic idea is that you could write this: > > // Get an abstract object from a byte string, probably a GraphemeString, > parsing the input as UTF-8 > $str = u('some text'); > // Perform an operation that explicitly deals in Code Points > $str = $str->asCodePoints()->normalise('NFC'); > // Get information using a higher level of abstraction > $length = $str->asGraphemes()->length; > // Perform a high-level mutation, then convert right back to a concrete > string of bytes > echo $str->asGraphemes()->reverse()->asByteString('UTF-16'); > > Calling asGraphemes() on a GraphemeString or asCodePoints() on a > CodePointString would be legal but a no-op, so it would be safe to > accept both as input to a function, then switch to whichever level the > task required. > > I'm not sure if this finds a good balance between complexity and > user-friendliness, and would welcome anyone's thoughts. > I'd rather higher level stuff existed at a higher level, I'd rather solve for ustring the problems that are solved for normal strings and leave the rest up to whatever the framework/component/library or wants to do. > -- > Rowan Collins > [IMSoP] > > -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php