On Tue, 2014-10-21 at 21:42 +0100, Rowan Collins wrote:
> On 21/10/2014 08:06, Joe Watkins wrote:
> > Morning internalz,
> >
> > https://wiki.php.net/rfc/ustring
> >
> > This is the result of work done by a few of us, we won't be opening any
> > vote in a fortnight. We have a long time before 7, there is no rush
> > whatever.
> >
> > Now seems like a good time to start the conversation so we can hash out
> > the details, or get on with other things ;)
> >
> > Cheers
> > Joe
> >
> >
>
> I think this looks like a really great start at creating something
> actually useful, rather than getting stuck at the drawing board. I like
> that the scope is quite small initially - where does the "single
> responsibility" of a class that represents a string end, anyway? :)
>
> A few opinions:
>
> 1) Global / static defaults are bad.
>
> The existence of the setDefaultCodepage method feels like an
> anti-pattern to me. It means libraries can't rely on this class working
> the same way in two different host environments, or even at two
> re-entries in the same program. Effectively, if you don't know what the
> second argument to the constructor will default to, you can't actually
> treat it as optional unless you're writing monolithic code. This is a
> common pattern in PHP, but http_build_query() would be so much more
> pleasant if I could safely call it with 1 argument instead of 3.
>
> I think the default should be hard-coded to UTF-8, which according to
> previous discussion is always the default *output* encoding, so would
> mean this would always work: $aUString = new UString( (string)$aUString
> ); Any other encoding will be dependent on, and known from, the context
> where the object is created - if grabbing data from an HTTP request, a
> header should tell them; if from a database, a connection parameter; and
> so on.
>
Could be true, it feels quite horrible to me today too, I think someone
else suggested it, but it might have been me.
I'll look at doing something about that ...
> The only case I can see where a default encoding would be sensible would
> be where source code itself is in a different encoding, so that
> u('literal string') works as expected. I guess if we ever went down the
> route of special literal syntax like u'literal string', the declared
> source encoding could be used.
>
> Actually, the u() shortcut function appears to be missing the encoding
> parameter completely; is this deliberate?
>
Fixed that.
> 2) Clarify relationship to a "byte string"
>
> Most of the API acts like this is an abstract object representing a
> bunch of Unicode code points. As such, I'm not sure what getCodepage()
> does - a code page (or more properly encoding) is a property of a stream
> of bytes, so has no meaning in this context, surely? The internal
> implementation could use UTF-8, UTF-16, or some made-up encoding (like
> Perl6's "NFG" system) and the user should never need to know (other than
> to understand performance implications).
>
> On the other hand, when you *do* want a stream of bytes, the class
> doesn't seem to have an explicit way to get one. The (currently
> undocumented) behaviour is apparently to spit out UTF-8 if cast to a
> string, but it would be nice to have an explicit function which could be
> passed a parameter in order to serialise to, say, UTF-16, instead.
>
I reused the terminology used by ICU, it made sense in their
documentation.
So we want a ::getBytes or something like that ... I'll do that ...
> 3) The Grapheme Question
>
> This has been raised a few times, so I won't labour the point, just
> mention my current thinking.
>
> Unicode is complicated. Partly, that's because of a series of
> compromises in its design; but partly, it's because writing systems are
> complicated, and Unicode tries harder than most previous systems to
> acknowledge that. So, there's a tradeoff to be made between giving users
> what they think they need, thus hiding the messy details, and giving
> users the power to do things right, in a more complex way.
>
> There is also a namespace mess if you insist on every function and
> property having to declare what level of abstraction it's talking about
> - e.g. $codePointLength instead of $length.
>
> An idea I've been toying with is rather than having one class
> representing the slippery notion of "a Unicode string", having (at
> least) two, closely tied, classes: CodePointString (roughly = UString
> right now) and GraphemeString (a higher level abstraction tied to the
> same internal representation).
>
> I intend to mock this up as a set of interfaces at some point, but the
> basic idea is that you could write this:
>
> // Get an abstract object from a byte string, probably a GraphemeString,
> parsing the input as UTF-8
> $str = u('some text');
> // Perform an operation that explicitly deals in Code Points
> $str = $str->asCodePoints()->normalise('NFC');
> // Get information using a higher level of abstraction
> $length = $str->asGraphemes()->length;
> // Perform a high-level mutation, then convert right back to a concrete
> string of bytes
> echo $str->asGraphemes()->reverse()->asByteString('UTF-16');
>
> Calling asGraphemes() on a GraphemeString or asCodePoints() on a
> CodePointString would be legal but a no-op, so it would be safe to
> accept both as input to a function, then switch to whichever level the
> task required.
>
> I'm not sure if this finds a good balance between complexity and
> user-friendliness, and would welcome anyone's thoughts.
>
I'd rather higher level stuff existed at a higher level, I'd rather
solve for ustring the problems that are solved for normal strings and
leave the rest up to whatever the framework/component/library or wants
to do.
> --
> Rowan Collins
> [IMSoP]
>
>
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php