Re: If X sorts before Y, then XZ sorts before YZ ... example of where that's not true?

2013-01-07 Thread Christopher Fynn
On 07/01/2013, Costello, Roger L. coste...@mitre.org wrote:
 Hi Folks,

 In the book, Unicode Demystified (p. xxii) it says:

 An English-speaking  programmer might assume,
 for example, that given the three characters X, Y,
 and Z, that if X sorts before Y, then XZ sorts before
 YZ. This works for English, but fails for many
 languages.

 Would you give an example of where character 1 sorts before character 2 but
 character 1, character 3 does not sort before character 2, character 3?

 /Roger

Look at the collation for Dzongkha or Tibetan:

http://developer.mimer.com/charts/dzongkha.htm

http://developer.mimer.com/charts/tibetan.htm



Re: Why is endianness relevant when storing data on disks but not when in memory?

2013-01-07 Thread Leif Halvard Silli
Doug Ewell, Sun, 6 Jan 2013 20:57:58 -0700:
 We are pretty much going round and round on this. The bottom line for 
 me is, it would be nice if there were a shorthand way of saying 
 big-endian UTF-16, and many people (including you?) feel that 
 UTF-16BE is that way, but it is not. That term has a DIFFERENT 
 MEANING. The following stream:
 
 FE FF 00 48 00 65 00 6C 00 6C 00 6F
 
 is valid big-endian UTF-16, but it is NOT valid UTF-16BE unless the 
 leading U+FEFF is explicitly meant as a zero-width no-break space, 
 which may not be stripped.

I don't remember if the RFC defines one of the 3 MIME charsets as the 
default, but given that UTF-16 is supposed to be used whenever one 
doesn't know the endianness, then it seems logical to assume that the 
above example defaults to be treated as UTF-16. But apart from that, 
then we can also say that the example also not valid UTF-16, unless 
the U+FEFF is meant as a BOM …

I see the 3 as 3 MIME charsets. 

It does anyhow seem like a definition question.
-- 
leif h silli




Re: Why is endianness relevant when storing data on disks but not when in memory?

2013-01-07 Thread Leif Halvard Silli
Doug Ewell, Sun, 6 Jan 2013 20:57:58 -0700:
Doug Ewell, Sun, 6 Jan 2013 20:57:58 -0700:

 The bottom line for me is, it would be nice if there were a 
 shorthand way of saying big-endian UTF-16, and many people 
 (including you?) feel that UTF-16BE is that way, but it is not.

One could say UTF-16, big-endian. Or big-endian UTF-16. That’s 
pretty short.

 That term has a DIFFERENT MEANING. The following stream:
 
 FE FF 00 48 00 65 00 6C 00 6C 00 6F
 
 is valid big-endian UTF-16, but it is NOT valid UTF-16BE unless the 
 leading U+FEFF is explicitly meant as a zero-width no-break space, 
 which may not be stripped.

I believe I understand this reasonably well. I think we are looking for 
a term is unaffacted by how we label it.
leif halvard silli




Re: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Markus Scherer
Unicode libraries commonly provide functions that take a code point and
return a value, for example a property value. Such a function normally
accepts the whole range 0..10 (and may even return a default value for
out-of-range inputs).

Also, we commonly read code points from 16-bit Unicode strings, and
unpaired surrogates are returned as themselves and treated as such (e.g.,
in collation). That would not be well-formed UTF-16, but it's generally
harmless in text processing.

markus


RE: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Doug Ewell
Markus Scherer markus dot icu at gmail dot com wrote:

 Also, we commonly read code points from 16-bit Unicode strings, and
 unpaired surrogates are returned as themselves and treated as such
 (e.g., in collation). That would not be well-formed UTF-16, but it's
 generally harmless in text processing.

But still non-conformant.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell ­





Re: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Markus Scherer
On Mon, Jan 7, 2013 at 10:48 AM, Doug Ewell d...@ewellic.org wrote:

 Markus Scherer markus dot icu at gmail dot com wrote:

  Also, we commonly read code points from 16-bit Unicode strings, and
  unpaired surrogates are returned as themselves and treated as such
  (e.g., in collation). That would not be well-formed UTF-16, but it's
  generally harmless in text processing.

 But still non-conformant.


Not really, that's why there is a definition of a 16-bit Unicode string in
the standard.

markus


Re: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Mark Davis ☕
 But still non-conformant.

That's incorrect.

The point I was making above is that in order to say that something is
non-conformant, you have to be very clear what it is non-conformant *TO*
.

 Also, we commonly read code points from 16-bit Unicode strings, and
 unpaired surrogates are returned as themselves and treated as such
 (e.g., in collation).

   - That *is* conformant for *Unicode 16-bit strings.*
   - That is *not* conformant for *UTF-16*.

There is an important difference.

Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Mon, Jan 7, 2013 at 10:48 AM, Doug Ewell d...@ewellic.org wrote:

 But still non-conformant.


RE: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Doug Ewell
You're right, and I stand corrected. I read Markus's post too quickly.

Mark Davis ☕ mark at macchiato dot com wrote:

 But still non-conformant.

 That's incorrect. 

 The point I was making above is that in order to say that something is 
 non-conformant, you have to be very clear what it is non-conformant TO.

 Also, we commonly read code points from 16-bit Unicode strings, and
 unpaired surrogates are returned as themselves and treated as such
 (e.g., in collation).

 + That is conformant for Unicode 16-bit strings.

 + That is not conformant for UTF-16.

 There is an important difference.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell ­





Re: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Philippe Verdy
Well then I don't know why you need a definition of an Unicode 16-bit
string. For me it just means exactly the same as 16-bit string, and
the encoding in it is not relevant given you can put anything in it
without even needing to be conformant to Unicode. So a Java string is
exactly the same, a 16-bit string. The same also as Windows API 16-bit
strings, or wide strings in a C compiler where wide is mapped by a
compiler option to 16-bit code units for wchar_t (or short but more
safely as UINT16 if you don't want to be dependant of compiler options
or OS environments when compiling, when you need to manage the exact
memory allocation), or the same as a U-string in Perl.

Only UTF-16 (not UTF-16BE and UTF-16LE which are encoding schemes with
concreate byte orders, without any leading BOM) is relevant to Unicode
because a 16-bit string does not itself specify any encoding scheme or
byte order.

One confusion comes with the name UTF-16 when it is also used as an
encoding scheme with a possible leading BOM and implied default
UTF-16LE determined by guesses on the first few characters : this
encoding scheme (with support of BOM and implicit guess of byte order
if it's missing) should have been given a distinct encoding name like
'UTF-16XE. Reserving UTF-16 for what the stadnard discusses as a
16-bit string, except that it should still require UTF-16
conformance (no unpaired surrogates and no non-characters) plus **no**
BOM supported for this level (which is still not materialized by a
concrete byte order or by an implicit size in storage bits, as long as
it can store distinctly the whole range of code units 0x..0x
minus the few non-characters, and enforces all surrogates to be
paired, but does not enforce any character to be allocated).

Note that such relaxed version of UTF-16 would still allow an internal
alternate representation of 0x for interoperating with various
APIs without changing the storage requirement : 0x could perfectly
be used to replace 0x if that last code units plays a special role
as a string terminator. But even if this is done, a storage unit like
0x would still be percied as if it was really the code unit
0x.

In other words, the concept of completely relaxed Unicode 16-bit
string is unneeded, given that it's single requirement is to make
sure that it defines a length in terms of 16-bit code units, and code
units being large enough to store any unsigned 16-bit value
(internally it could still be 18-bit on systems with 6-bit or 9-bit
addressable memory cells ; the sizeof() property of this code units
could still be 2, or 3 or other, as long as it is large enough to
store the value. On some devices (not so exotic...) there are memory
areas that is 4-bit addressable or even 1-bit addressable (in that
later case the sizeof() property for the code unit type would return
16, not 2). Some devices only have 16-bit or 32-bit addressable memory
and sizeof() would return 1 (and the C types char and wchar_t would
most likely be the same).


2013/1/7 Doug Ewell d...@ewellic.org:
 You're right, and I stand corrected. I read Markus's post too quickly.

 Mark Davis ☕ mark at macchiato dot com wrote:

 But still non-conformant.

 That's incorrect.

 The point I was making above is that in order to say that something is 
 non-conformant, you have to be very clear what it is non-conformant TO.

 Also, we commonly read code points from 16-bit Unicode strings, and
 unpaired surrogates are returned as themselves and treated as such
 (e.g., in collation).

 + That is conformant for Unicode 16-bit strings.

 + That is not conformant for UTF-16.

 There is an important difference.

 --
 Doug Ewell | Thornton, CO, USA
 http://ewellic.org | @DougEwell ­







RE: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Whistler, Ken
Philippe Verdy said:

 Well then I don't know why you need a definition of an Unicode 16-bit
 string. For me it just means exactly the same as 16-bit string, and
 the encoding in it is not relevant given you can put anything in it
 without even needing to be conformant to Unicode. So a Java string is
 exactly the same, a 16-bit string. The same also as Windows API 16-bit
 strings, or wide strings in a C compiler where wide is mapped by a
 compiler option to 16-bit code units for wchar_t ...

And elaborating on Mark's response a little:

[0x0061,0x0062,0x4E00,0x,0x0410]

Is a Unicode 16-bit string. It contains a, b, a Han character, a 
noncharacter, and a Cyrillic character.

Because it is also well-formed as UTF-16, it is also a UTF-16 string, by the 
definitions in the standard. 

[0x0061,0xD800,0x4E00,0x,0x0410]

Is a Unicode 16-bit string. It contains a, a high-surrogate code unit, a 
Han character, a noncharacter, and a Cyrillic character.

Because an unpaired high-surrogate code unit is not allowed in UTF-16, this is 
*NOT* a UTF-16 string.

On the other hand, consider:

[0x0061,0x0062,0x88EA,0x8440]

That is *NOT* a Unicode 16-bit string. It contains a, b, a Han character, 
and a Cyrillic character. How do I know? Because I know the character set 
context. It is a wchar_t implementation of the Shift-JIS code page 932.

The difference is the declaration of the standard one uses to interpret what 
the 16-bit units mean. In a Unicode 16-bit string I go to the Unicode 
Standard to figure out how to interpret the numbers. In a wide code Page 932 
string I go to the specification of Code Page 932 to figure out how to 
interpret the numbers.

This is no different, really, than talking about a Latin-1 string versus a 
KOI-8 string.

--Ken






RE: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Whistler, Ken
Philippe also said:

 ... Reserving UTF-16 for what the stadnard discusses as a
 16-bit string, except that it should still require UTF-16
 conformance (no unpaired surrogates and no non-characters) ...

For those following along, conformance to UTF-16 does *NOT* require no 
non-characters. Noncharacters are perfectly valid in UTF-16.

--Ken





Are there Unicode processors?

2013-01-07 Thread Costello, Roger L.
Hi Folks,

An XML processor breaks up an XML  document into its parts -- here's a start 
tag, here's element content, here's an end tag, etc. -- and then makes those 
parts (along with information about each part such as this part is a start 
tag and this part is element content) available to XML applications via an 
API. 

Are there Unicode processors?

That is, are there processors that break up Unicode text into its parts -- 
here's a character, here's another character, here's still another character, 
etc. -- and then makes those parts (along with information about each part such 
as this part is the Latin Capital Letter T and this part is the Latin Small 
Letter o) available to Unicode applications (such as XML processors) via an 
API?

I did a Google search for Unicode processor and came up empty so I am 
guessing the answer is that there are no Unicode processors. Or perhaps they go 
by a different name? If there are no Unicode processors, why not?

/Roger




Re: Are there Unicode processors?

2013-01-07 Thread David Starner
On Mon, Jan 7, 2013 at 2:34 PM, Costello, Roger L. coste...@mitre.org wrote:
 Are there Unicode processors?

 That is, are there processors that break up Unicode text into its parts -- 
 here's a character, here's another character, here's still another character, 
 etc. -- and then makes those parts (along with information about each part 
 such as this part is the Latin Capital Letter T and this part is the Latin 
 Small Letter o) available to Unicode applications (such as XML processors) 
 via an API?

 I did a Google search for Unicode processor and came up empty so I am 
 guessing the answer is that there are no Unicode processors. Or perhaps they 
 go by a different name? If there are no Unicode processors, why not?

I don't really think I understand what you want. KR C had this, at
least for the ASCII subset of Unicode; it has arrays of characters and
you can access each character individually. If you want to know if the
third character in your array s is the Latin capital letter T, you
write s[2] == T. If you want to know if it's a letter, you write
isalpha(s[2]). Naturally speaking, Unicode support is slightly more
complex, but it's still a matter of sequences of characters and
functions to query the properties. It's plain text, it doesn't have
XML's complex hierarchical features.



-- 
Kie ekzistas vivo, ekzistas espero.




Re: Are there Unicode processors?

2013-01-07 Thread Mark Davis ☕
That is not the typical way that Unicode text is processed.

Typically whatever OS you are using will supply mechanisms for iterating
through any Unicode string, returning each of the code points. It may also
offer APIs for returning information about each character (called 'property
values', or you can get libraries like ICU (http://site.icu-project.org/)
that have full-featured property support (
http://userguide.icu-project.org/strings/properties).


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Mon, Jan 7, 2013 at 2:34 PM, Costello, Roger L. coste...@mitre.orgwrote:

 Hi Folks,

 An XML processor breaks up an XML  document into its parts -- here's a
 start tag, here's element content, here's an end tag, etc. -- and then
 makes those parts (along with information about each part such as this
 part is a start tag and this part is element content) available to XML
 applications via an API.

 Are there Unicode processors?

 That is, are there processors that break up Unicode text into its parts --
 here's a character, here's another character, here's still another
 character, etc. -- and then makes those parts (along with information about
 each part such as this part is the Latin Capital Letter T and this part
 is the Latin Small Letter o) available to Unicode applications (such as
 XML processors) via an API?

 I did a Google search for Unicode processor and came up empty so I am
 guessing the answer is that there are no Unicode processors. Or perhaps
 they go by a different name? If there are no Unicode processors, why not?

 /Roger





RE: Are there Unicode processors?

2013-01-07 Thread Phillips, Addison
Unicode processor??

If what you're looking for is code that breaks text into grapheme 
clusters/words/lines/etc., that's called text segmentation and is described 
in:

   http://www.unicode.org/reports/tr29/

But you go on to talk about characters and their properties.. if you're 
looking for APIs that provide access to stuff like Unicode character 
properties, programming languages or libraries provide such capabilities (Java, 
perl, Python, ICU...) in various appropriate ways. See, for example:

http://docs.oracle.com/javase/7/docs/api/java/lang/Character.html

Or:

http://perldoc.perl.org/5.14.0/perlunicode.html#Unicode-Character-Properties

Or:

http://userguide.icu-project.org/strings/properties


Addison

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.




 -Original Message-
 From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On
 Behalf Of Costello, Roger L.
 Sent: Monday, January 07, 2013 2:35 PM
 To: unicode@unicode.org
 Subject: Are there Unicode processors?
 
 Hi Folks,
 
 An XML processor breaks up an XML  document into its parts -- here's a start
 tag, here's element content, here's an end tag, etc. -- and then makes those
 parts (along with information about each part such as this part is a start 
 tag
 and this part is element content) available to XML applications via an API.
 
 Are there Unicode processors?
 
 That is, are there processors that break up Unicode text into its parts -- 
 here's a
 character, here's another character, here's still another character, etc. -- 
 and
 then makes those parts (along with information about each part such as this
 part is the Latin Capital Letter T and this part is the Latin Small Letter 
 o)
 available to Unicode applications (such as XML processors) via an API?
 
 I did a Google search for Unicode processor and came up empty so I am
 guessing the answer is that there are no Unicode processors. Or perhaps they
 go by a different name? If there are no Unicode processors, why not?
 
 /Roger
 





Re: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Martin J. Dürst

On 2013/01/08 3:27, Markus Scherer wrote:


Also, we commonly read code points from 16-bit Unicode strings, and
unpaired surrogates are returned as themselves and treated as such (e.g.,
in collation). That would not be well-formed UTF-16, but it's generally
harmless in text processing.


Things like this are called garbage in, garbage-out (GIGO). It may be 
harmless, or it may hurt you later.


Regards,   Martin.



Re: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Mark Davis ☕
That's not the point (see successive messages).


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Mon, Jan 7, 2013 at 4:59 PM, Martin J. Dürst due...@it.aoyama.ac.jpwrote:

 On 2013/01/08 3:27, Markus Scherer wrote:

  Also, we commonly read code points from 16-bit Unicode strings, and
 unpaired surrogates are returned as themselves and treated as such (e.g.,
 in collation). That would not be well-formed UTF-16, but it's generally
 harmless in text processing.


 Things like this are called garbage in, garbage-out (GIGO). It may be
 harmless, or it may hurt you later.

 Regards,   Martin.




Q is a Roman numeral?

2013-01-07 Thread Ben Scarborough
This isn't directly related to Unicode, but I thought this would be a
good place to ask.

Specifically, I'm curious about figure 14 (Gordon 1982) from WG2 N3218
[http://std.dkuug.dk/jtc1/sc2/wg2/docs/N3218.pdf], which says:
 Whereas our so-called Arabic numerals
 are ten in number (0–9), the Roman nu-
 merals number nine: I = 1 (one), V = 5, X
 = 10, L = 50, C = 100, Đ = 500 (D reg-
 ularly with middle bar, the modern form
 being simply D), a symbol for 1,000 (see
 below), Q = 500,000, and a rather strange
 symbol for 6: ↅ.

Now that Q = 500,000 bit seems a little odd to me. I've never seen
that anywhere else. Does anyone know where it came from? Is there real
usage of Q for 500,000?

—Ben Scarborough




RE: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Whistler, Ken
Martin,

The kind of situation Markus is talking about is illustrated particularly well 
in collation. And there is a section 7.1.1 in UTS #10 specifically devoted to 
this issue,:

http://www.unicode.org/reports/tr10/#Handline_Illformed

When weighting Unicode 16-bit strings for collation, you can, of course, always 
detect an unpaired surrogate and return an error code or throw an exception, 
but that may not be the best strategy for an implementation.

The problem derives in part from the fact that for sorting, the comparison 
routine is generally buried deep down as a primitive comparison function in 
what may be a rather complicated sorting algorithm. Those algorithms often 
assume that the comparison routine is analogous to strcmp(), and will always 
return -1/0/1 (or negative/0/positive), and that it is not going to fail 
because it decides that some byte value in an input string is not valid in some 
particular character encoding. (Of course, the calling code needs to ensure it 
isn't handing off null pointers or unallocated objects, but that is par for the 
course for any string handling.)

Now if I want to adopt a particular sorting algorithm so it uses a 
UCA-compliant, multi-level collation algorithm for the actual string 
comparison, then by far the easiest way to do so is to build a function 
essentially comparable to strcmp() in structure, e.g. UCA_strcmp(context, 
string1, string2), which also always returns -1/0/1 for any two Unicode 16-bit 
strings. If I introduce a string validation aspect to this comparison routine, 
and return an error code or raise an exception, then I run the risk of 
marginally slowing down the most time-critical part of the sorting loop, as 
well as complicating the adaptation of the sorting code, to deal with extra 
error conditions. It is faster, more reliable and robust, and easier to adapt 
the code, if I simply specify for the weighting exactly what happens to any 
isolated surrogate in input strings, and compare accordingly. Hence the two 
alternative strategies suggested in Section 7.1.1 of UTS #10: either weight 
each maximal ill-for!
 med subsequence as if it were U+FFFD (with a primary weight), or weight each 
surrogate code point with a generated implicit weight, as if it were an 
unassigned code point. Either strategy works. And in fact, the conformance 
tests in CollationTest.zip for UCA include some ill-formed strings in the test 
data, so that implementations can test their handling of them, if they choose.

So in this kind of a case, what we are actually dealing with is: garbage in, 
principled, correct results out. ;-)

--Ken

 -Original Message-
 
 On 2013/01/08 3:27, Markus Scherer wrote:
 
  Also, we commonly read code points from 16-bit Unicode strings, and
  unpaired surrogates are returned as themselves and treated as such (e.g.,
  in collation). That would not be well-formed UTF-16, but it's generally
  harmless in text processing.
 
 Things like this are called garbage in, garbage-out (GIGO). It may be
 harmless, or it may hurt you later.
 
 Regards,   Martin.





RE: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Whistler, Ken

 
 http://www.unicode.org/reports/tr10/#Handline_Illformed

Grrr. 

http://www.unicode.org/reports/tr10/#Handling_Illformed

I seem unable to handle ill-formed spelling today. :(

--Ken






RE: Q is a Roman numeral?

2013-01-07 Thread Whistler, Ken
I'm gonna take a wild stab here and assume that this is Q as the medieval 
Latin abbreviation for quingenti, which usually means 500, but also gets 
glossed just as a big number, as in milia quingenta thousands upon 
thousands. Maybe some medieval scribe substituted a Q for |V| (with an 
overscore on the V), which would be the more normal way to write 5,000 and then 
500,000.

--Ken
 
 Now that Q = 500,000 bit seems a little odd to me. I've never seen
 that anywhere else. Does anyone know where it came from? Is there real
 usage of Q for 500,000?
 
 —Ben Scarborough
 





Re: Are there Unicode processors?

2013-01-07 Thread Doug Ewell

Costello, Roger L. wrote:


Are there Unicode processors?


Bottom line, you need to be more specific about what level of 
processing you are talking about. As many have said, parsing a byte 
stream into UTF-{8, 16, 32} characters is everywhere. Converting between 
normalization forms Is a bit less common. Intricate text analysis is 
generally the domain of specialized tools.


--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell ­ 





Re: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Stephan Stiller



Things like this are called garbage in, garbage-out (GIGO). It may be
harmless, or it may hurt you later.

So in this kind of a case, what we are actually dealing with is: garbage in, 
principled, correct results out. ;-)


Wouldn't the clean way be to ensure valid strings (only) when they're 
built and then make sure that string algorithms (only) preserve 
well-formedness of input?


Perhaps this is how the system grew, but it seems to be that it's
yet another legacy of C pointer arithmetic and
about convenience of implementation
rather than a
safety or
performance
issue.

Stephan




Re: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Mark Davis ☕
In practice and by design, treating isolated surrogates the same as
reserved code points in processing, and then cleaning up on conversion to
UTFs works just fine. It is a tradeoff that is up to the implementation.

It has nothing to do with a legacy of C pointer arithmetic. It does
represent a pragmatic choice some time ago, but there is no need getting
worked up about it. Human scripts and their representation on computers is
quite complex enough; in the grand scheme of things the handling of
surrogates in implementations pales in significance.


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Mon, Jan 7, 2013 at 9:43 PM, Stephan Stiller
stephan.stil...@gmail.comwrote:


  Things like this are called garbage in, garbage-out (GIGO). It may be
 harmless, or it may hurt you later.

 So in this kind of a case, what we are actually dealing with is: garbage
 in, principled, correct results out. ;-)


 Wouldn't the clean way be to ensure valid strings (only) when they're
 built and then make sure that string algorithms (only) preserve
 well-formedness of input?

 Perhaps this is how the system grew, but it seems to be that it's
 yet another legacy of C pointer arithmetic and
 about convenience of implementation
 rather than a
 safety or
 performance
 issue.

 Stephan