On Monday, 10 March 2014 at 19:48:34 UTC, H. S. Teoh wrote:
On Mon, Mar 10, 2014 at 07:49:04PM +0100, Johannes Pfau wrote:
Am Mon, 10 Mar 2014 11:30:07 -0700
schrieb Walter Bright <[email protected]>:
> On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:
> > An idea to fix the whole problems I see with char[] being
> > treated
> > specially by phobos: introduce an actual string type, with
> > char[]
> > as backing, that is a dchar range, that actually dictates
> > the
> > rules we want. Then, make the compiler use this type for
> > literals.
>
> Proposals to make a string class for D have come up many
> times. I
> have a kneejerk dislike for it. It's a really strong feature
> for D
> to have strings be an array type, and I'll go to great
> lengths to
> keep it that way.
I'm on the fence about this one. The nice thing about strings
being an
array type, is that it is a familiar concept to C coders, and
it allows
array slicing for extracting substrings, etc., which fits
nicely with
the C view of strings as character arrays. As a C coder myself,
I like
it this way too. But the bad thing about strings being an array
type, is
that it's a holdover from C, and it allows slicing for
extracting
substrings -- malformed substrings by permitting slicing a
multibyte
(multiword) character.
Basically, the nice aspects of strings being arrays only apply
when
you're dealing with ASCII (or mostly-ASCII) strings. These very
same
"nice" aspects turn into problems when dealing with anything
non-ASCII.
The only way the user can get it right using only array
operations, is
if they understand the whole of Unicode in their head and are
willing to
reinvent Unicode algorithms every time they slice a string or
do some
operation on it. Since D purportedly supports Unicode by
default, it
shouldn't be this way. D should *actually* support Unicode all
the way
-- use proper Unicode algorithms for substring extraction,
collation,
line-breaking, normalization, etc.. Being a systems language,
of course,
means that D should allow you to get under the hood and do
things
directly with the raw string representation -- but this
shouldn't be the
*default* modus operandi. The default should be a
properly-encapsulated
string type with Unicode algorithms to operate on it (with the
option of
reaching into the raw representation where necessary).
You started off on the fence, but you seem pretty convinced by
the end!