Why is endianness relevant when storing data on disks but not when in memory?

2013-01-05 Thread Costello, Roger L.
Hi Folks,

In the book Fonts  Encodings it says (I think) that endianness is relevant 
only when storing data on disks.

Why is endianness is not relevant when data is in memory? 

On page 62 it says:

... when we store ... data on disk, we write 
not 32-bit (or 16-bit) numbers but series of 
four (or two) bytes. And according to the 
type of processor (Intel or RISC), the most 
significant byte will be written either first 
(the little-endian system) or last (the
big-endian system). Therefore we have 
both a UTF-32BE and a UTF-32LE, a UTF-16BE
and a UTF-16LE.

Then, on page 63 it says:

... UTF-16 or UTF-32 ... if we specify one of
   these, either we are in memory, in which case
the issue of representation as a sequence of
bytes does not arise, or we are using a method
that enables us to detect the endianness of the
document.

When data is in memory isn't it important to know whether the most significant 
byte is first or last?

Does this mean that when exchanging Unicode data across the Internet the 
endianness is not relevant?

Are these stated correctly:

When Unicode data is in a file we would say, for example, The file 
contains UTF-32BE data. 

When Unicode data is in memory we would say, There is UTF-32 data in 
memory. 

When Unicode data is sent across the Internet we would say, The UTF-32 
data was sent across the Internet.

/Roger




Re: Why is endianness relevant when storing data on disks but not when in memory?

2013-01-05 Thread Bill Poser
Endian-ness of data stored in memory is relevant but only if you are
working at a very low level. Suppose that you have UTF32 data stored as
unsigned C integers. On pretty much any modern computer, each codepoint
will occupy four 8-bit bytes. So long as you deal with that data via C, as
unsigned 32 bit integers, you don't need to know about endian-ness. The C
compiler and run-time routines take care of that for you. Endian-ness is
still relevant, in that your unsigned 32 bit integers could be composed of
bytes in different ways, but unless you work at the byte level, you don't
need to know about it.

The reason that endian-ness is relevant to data stored on disk is that
there is no agreement between disks and other external storage devices and
your programming language as to what constitutes an unsigned 32 bit
integer. Whereas your program can ask the system for a 32 bit unsigned
integer from memory, it can't ask the disk for one because there isn't any
agreement between the disk and your program as to what one of those
consists of. Your program has to ask the disk for four bytes and figure out
how to make them into a 32 bit unsigned integer.

Generally speaking, if you are working in a programming language that has
notions like Unicode character or 32 bit unsigned integer, the system
knows how those notions correspond to what is in memory and you don't have
to worry about it. In general the system cannot know what format some stuff
on an external storage device is in so you may be forced to deal with the
details of representation.

On Sat, Jan 5, 2013 at 2:21 PM, Costello, Roger L. coste...@mitre.orgwrote:

 Hi Folks,

 In the book Fonts  Encodings it says (I think) that endianness is
 relevant only when storing data on disks.

 Why is endianness is not relevant when data is in memory?

 On page 62 it says:

 ... when we store ... data on disk, we write
 not 32-bit (or 16-bit) numbers but series of
 four (or two) bytes. And according to the
 type of processor (Intel or RISC), the most
 significant byte will be written either first
 (the little-endian system) or last (the
 big-endian system). Therefore we have
 both a UTF-32BE and a UTF-32LE, a UTF-16BE
 and a UTF-16LE.

 Then, on page 63 it says:

 ... UTF-16 or UTF-32 ... if we specify one of
these, either we are in memory, in which case
 the issue of representation as a sequence of
 bytes does not arise, or we are using a method
 that enables us to detect the endianness of the
 document.

 When data is in memory isn't it important to know whether the most
 significant byte is first or last?

 Does this mean that when exchanging Unicode data across the Internet the
 endianness is not relevant?

 Are these stated correctly:

 When Unicode data is in a file we would say, for example, The file
 contains UTF-32BE data.

 When Unicode data is in memory we would say, There is UTF-32 data in
 memory.

 When Unicode data is sent across the Internet we would say, The
 UTF-32 data was sent across the Internet.

 /Roger





Re: Why is endianness relevant when storing data on disks but not when in memory?

2013-01-05 Thread Bjoern Hoehrmann
* Costello, Roger L. wrote:
On page 62 it says:

... when we store ... data on disk, we write 
not 32-bit (or 16-bit) numbers but series of 
four (or two) bytes. And according to the 
type of processor (Intel or RISC), the most 
significant byte will be written either first 
(the little-endian system) or last (the
big-endian system). Therefore we have 
both a UTF-32BE and a UTF-32LE, a UTF-16BE
and a UTF-16LE.

Then, on page 63 it says:

... UTF-16 or UTF-32 ... if we specify one of
   these, either we are in memory, in which case
the issue of representation as a sequence of
bytes does not arise, or we are using a method
that enables us to detect the endianness of the
document.

When data is in memory isn't it important to know
whether the most significant byte is first or last?

The idea is that this knowledge is implied because there is only a
single system with a single convention involved, with the assumption
that you do not look behind the curtain (do not access the first
byte of a multi-byte integer, for instance).
-- 
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 



Re: Why is endianness relevant when storing data on disks but not when in memory?

2013-01-05 Thread Martin J. Dürst

On 2013/01/06 7:21, Costello, Roger L. wrote:


Does this mean that when exchanging Unicode data across the Internet the 
endianness is not relevant?

Are these stated correctly:

 When Unicode data is in a file we would say, for example, The file contains 
UTF-32BE data.

 When Unicode data is in memory we would say, There is UTF-32 data in 
memory.

 When Unicode data is sent across the Internet we would say, The UTF-32 data 
was sent across the Internet.


The first is correct. The second is correct. The third is wrong. The 
Internet deals with data as a series of bytes, and by its nature has to 
pass data between big-endian and little-endian machines. Therefore, 
endianness is very important on the Internet. So you would say:


The UTF-32BE data was sent across the Internet.

Actually, as far as I'm aware of, the labels UTF-16BE and UTF-16LE were 
first defined in the IETF, see 
http://tools.ietf.org/html/rfc2781#appendix-A.1.


Because of this, Internet protocols mostly prefer UTF-8 over UTF-16 (or 
UTF-32), and actual data is also heavily UTF-8. So it would be better to 
say:


When Unicode data is sent across the Internet we would say, The UTF-8 
data was sent across the Internet.


Regards,   Martin.



Re: Why is endianness relevant when storing data on disks but not when in memory?

2013-01-05 Thread Doug Ewell

Martin J. Dürst wrote:


When Unicode data is in a file we would say, for example, The file
contains UTF-32BE data.

When Unicode data is in memory we would say, There is UTF-32 data in
memory.

When Unicode data is sent across the Internet we would say, The
UTF-32 data was sent across the Internet.


The first is correct. The second is correct. The third is wrong. The
Internet deals with data as a series of bytes, and by its nature has
to pass data between big-endian and little-endian machines. Therefore,
endianness is very important on the Internet. So you would say:

The UTF-32BE data was sent across the Internet.


The larger problem here is that most civilians don't understand what is 
truly meant by UTF-32BE and UTF-32LE.


In general, people think these terms simply mean big-endian UTF-32 and 
little-endian UTF-32 respectively, without the additional connotation 
(defined in D99 and D100) that U+FEFF at the beginning of a stream 
defined as UTF-32BE or UTF-32LE is supposed to be interpreted, 
against all logic, as a zero-width no-break space.


Because of this, it's not automatically the case that the file contains 
UTF-32BE data. That statement implies that there is no initial U+FEFF, 
or if there is one, that it is meant to be a ZWNBSP. You could just as 
easily have a UTF-32 file, which might have an initial U+FEFF (which 
then defines the endianness of the data) or might not (which means the 
data is big-endian unless a higher-level protocol dictates otherwise).


--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell ­ 





Re: Why is endianness relevant when storing data on disks but not when in memory?

2013-01-05 Thread Leif Halvard Silli
Doug Ewell, Sat, 5 Jan 2013 18:11:59 -0700:
 Martin J. Dürst wrote:

 When Unicode data is sent across the Internet we would say, The
 UTF-32 data was sent across the Internet.
 
 The first is correct. The second is correct. The third is wrong. 
 [ snip ] you would say:
 
 The UTF-32BE data was sent across the Internet.
 
 The larger problem here is that most civilians don't understand what 
 is truly meant by UTF-32BE and UTF-32LE.
 
 In general, people think these terms simply mean big-endian UTF-32 
 and little-endian UTF-32 respectively, without the additional 
 connotation (defined in D99 and D100) that U+FEFF at the beginning of 
 a stream defined as UTF-32BE or UTF-32LE is supposed to be 
 interpreted, against all logic, as a zero-width no-break space.

(I agree that it is against logic.)

 Because of this, it's not automatically the case that the file 
 contains UTF-32BE data. That statement implies that there is no 
 initial U+FEFF, or if there is one, that it is meant to be a ZWNBSP. 
 You could just as easily have a UTF-32 file, which might have an 
 initial U+FEFF (which then defines the endianness of the data) or 
 might not (which means the data is big-endian unless a higher-level 
 protocol dictates otherwise).

I believe that even the U+FEFF *itself* is either UTF-32LE or UTF-32BE. 
Thus, there is, per se, no implication of lack of byte-order mark in 
Martin’s statement. Assuming that the label UTF-32 is defined the 
same way as the label UTF-16, then it is an umbrella label or a 
macro label (hint: macro language) which covers the two *real* 
encodings - UTF-32LE and UTF-32BE.

Just my 5 øre.
-- 
leif halvard silli




Re: What does it mean to not be a valid string in Unicode?

2013-01-05 Thread Stephan Stiller
 If for example I sit on a committee that devises a new encoding form, I
 would need to be concerned with the question which *sequences of Unicode
 code points* are sound. If this is the same as sequences of Unicode
 scalar values, I would need to exclude surrogates, if I read the standard
 correctly (this wasn't obvious to me on first inspection btw). If for
 example I sit on a committee that designs an optimized compression
 algorithm for Unicode strings (yep, I do know about SCSU), I might want to
 first convert them to some canonical internal form (say, my array of
 non-negative integers). If U+surrogate values can be assumed to not
 exist, there are 2048 fewer values a code point can assume; that's good for
 compression, and I'll subtract 2048 from those large scalar values in a
 first step. Etc etc. So I do think there are a number of very general use
 cases where this question arises.


In fact, these questions have arisen in the past and have found answers
then. A present-day use case is if I author a programming language and need
to decide which values for val I accept in a statement like this:
someEncodingFormIndependentUnicodeStringType str = val, specified in
some PL-specific way

I've looked at the Standard, and I must admit I'm a bit perplexed. Because
of C1, which explicitly states

A process shall not interpret a high-surrogate code point or a
low-surrogate code point as an abstract character.

I do not know why surrogate values are defined as code points in the
first place. It seems to me that surrogates are (or should be) an encoding
form–specific notion, whereas I have always thought of code points as
encoding form–independent. Turns out this was wrong. I have always been
thinking that code point conceptually meant Unicode scalar value, which
is explicitly forbidden to have a surrogate value. Is this only
terminological confusion? I would like to ask: Why do we need the notion of
a surrogate code point; why isn't the notion of surrogate code units [in
some specific encoding form] enough? Conceptually surrogate values are
byte sequences used in encoding forms (modulo endianness). Why would one
define an expression (Unicode code point) that conceptually lumps
Unicode scalar value (an encoding form–independent notion) and surrogate
code point (a notion that I wouldn't expect to exist outside of specific
encoding forms) together?

An encoding form maps only Unicode scalar values (that is all Unicode code
points excluding the surrogate code points), by definition. D80 and what
follows (Unicode string and Unicode X-bit string) exist, as I
understand it, *only* in order for us to be able to have terminology for
discussing ill-formed code unit sequences in the various encoding forms;
but all of this talk seems to me to be encoding form–dependent.

I think the answer to the question I had in mind is that the legal
sequences of Unicode scalar values are (by definition)
({U+, ..., U+10} \ {U+D800, ..., U+DFFF})* .
But then there is the notion of Unicode string, which is conceptually
different, by definition. Maybe this is a terminological issue only. But is
there an expression in the Standard that is defined as sequence of Unicode
scalar values, a notion that seems to me to be conceptually important? I
can see that the Standard defines the various well-formed encoding form
code unit sequence. Have I overlooked something?

Why is it even possible to store a surrogate value in something like the
icu::UnicodeString datatype? In other words, why are we concerned with
storing Unicode *code points* in data structures instead of Unicode *scalar
values* (which can be serialized via encoding forms)?

Stephan