Why is endianness relevant when storing data on disks but not when in memory?
Hi Folks, In the book Fonts Encodings it says (I think) that endianness is relevant only when storing data on disks. Why is endianness is not relevant when data is in memory? On page 62 it says: ... when we store ... data on disk, we write not 32-bit (or 16-bit) numbers but series of four (or two) bytes. And according to the type of processor (Intel or RISC), the most significant byte will be written either first (the little-endian system) or last (the big-endian system). Therefore we have both a UTF-32BE and a UTF-32LE, a UTF-16BE and a UTF-16LE. Then, on page 63 it says: ... UTF-16 or UTF-32 ... if we specify one of these, either we are in memory, in which case the issue of representation as a sequence of bytes does not arise, or we are using a method that enables us to detect the endianness of the document. When data is in memory isn't it important to know whether the most significant byte is first or last? Does this mean that when exchanging Unicode data across the Internet the endianness is not relevant? Are these stated correctly: When Unicode data is in a file we would say, for example, The file contains UTF-32BE data. When Unicode data is in memory we would say, There is UTF-32 data in memory. When Unicode data is sent across the Internet we would say, The UTF-32 data was sent across the Internet. /Roger
Re: Why is endianness relevant when storing data on disks but not when in memory?
Endian-ness of data stored in memory is relevant but only if you are working at a very low level. Suppose that you have UTF32 data stored as unsigned C integers. On pretty much any modern computer, each codepoint will occupy four 8-bit bytes. So long as you deal with that data via C, as unsigned 32 bit integers, you don't need to know about endian-ness. The C compiler and run-time routines take care of that for you. Endian-ness is still relevant, in that your unsigned 32 bit integers could be composed of bytes in different ways, but unless you work at the byte level, you don't need to know about it. The reason that endian-ness is relevant to data stored on disk is that there is no agreement between disks and other external storage devices and your programming language as to what constitutes an unsigned 32 bit integer. Whereas your program can ask the system for a 32 bit unsigned integer from memory, it can't ask the disk for one because there isn't any agreement between the disk and your program as to what one of those consists of. Your program has to ask the disk for four bytes and figure out how to make them into a 32 bit unsigned integer. Generally speaking, if you are working in a programming language that has notions like Unicode character or 32 bit unsigned integer, the system knows how those notions correspond to what is in memory and you don't have to worry about it. In general the system cannot know what format some stuff on an external storage device is in so you may be forced to deal with the details of representation. On Sat, Jan 5, 2013 at 2:21 PM, Costello, Roger L. coste...@mitre.orgwrote: Hi Folks, In the book Fonts Encodings it says (I think) that endianness is relevant only when storing data on disks. Why is endianness is not relevant when data is in memory? On page 62 it says: ... when we store ... data on disk, we write not 32-bit (or 16-bit) numbers but series of four (or two) bytes. And according to the type of processor (Intel or RISC), the most significant byte will be written either first (the little-endian system) or last (the big-endian system). Therefore we have both a UTF-32BE and a UTF-32LE, a UTF-16BE and a UTF-16LE. Then, on page 63 it says: ... UTF-16 or UTF-32 ... if we specify one of these, either we are in memory, in which case the issue of representation as a sequence of bytes does not arise, or we are using a method that enables us to detect the endianness of the document. When data is in memory isn't it important to know whether the most significant byte is first or last? Does this mean that when exchanging Unicode data across the Internet the endianness is not relevant? Are these stated correctly: When Unicode data is in a file we would say, for example, The file contains UTF-32BE data. When Unicode data is in memory we would say, There is UTF-32 data in memory. When Unicode data is sent across the Internet we would say, The UTF-32 data was sent across the Internet. /Roger
Re: Why is endianness relevant when storing data on disks but not when in memory?
* Costello, Roger L. wrote: On page 62 it says: ... when we store ... data on disk, we write not 32-bit (or 16-bit) numbers but series of four (or two) bytes. And according to the type of processor (Intel or RISC), the most significant byte will be written either first (the little-endian system) or last (the big-endian system). Therefore we have both a UTF-32BE and a UTF-32LE, a UTF-16BE and a UTF-16LE. Then, on page 63 it says: ... UTF-16 or UTF-32 ... if we specify one of these, either we are in memory, in which case the issue of representation as a sequence of bytes does not arise, or we are using a method that enables us to detect the endianness of the document. When data is in memory isn't it important to know whether the most significant byte is first or last? The idea is that this knowledge is implied because there is only a single system with a single convention involved, with the assumption that you do not look behind the curtain (do not access the first byte of a multi-byte integer, for instance). -- Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: Why is endianness relevant when storing data on disks but not when in memory?
On 2013/01/06 7:21, Costello, Roger L. wrote: Does this mean that when exchanging Unicode data across the Internet the endianness is not relevant? Are these stated correctly: When Unicode data is in a file we would say, for example, The file contains UTF-32BE data. When Unicode data is in memory we would say, There is UTF-32 data in memory. When Unicode data is sent across the Internet we would say, The UTF-32 data was sent across the Internet. The first is correct. The second is correct. The third is wrong. The Internet deals with data as a series of bytes, and by its nature has to pass data between big-endian and little-endian machines. Therefore, endianness is very important on the Internet. So you would say: The UTF-32BE data was sent across the Internet. Actually, as far as I'm aware of, the labels UTF-16BE and UTF-16LE were first defined in the IETF, see http://tools.ietf.org/html/rfc2781#appendix-A.1. Because of this, Internet protocols mostly prefer UTF-8 over UTF-16 (or UTF-32), and actual data is also heavily UTF-8. So it would be better to say: When Unicode data is sent across the Internet we would say, The UTF-8 data was sent across the Internet. Regards, Martin.
Re: Why is endianness relevant when storing data on disks but not when in memory?
Martin J. Dürst wrote: When Unicode data is in a file we would say, for example, The file contains UTF-32BE data. When Unicode data is in memory we would say, There is UTF-32 data in memory. When Unicode data is sent across the Internet we would say, The UTF-32 data was sent across the Internet. The first is correct. The second is correct. The third is wrong. The Internet deals with data as a series of bytes, and by its nature has to pass data between big-endian and little-endian machines. Therefore, endianness is very important on the Internet. So you would say: The UTF-32BE data was sent across the Internet. The larger problem here is that most civilians don't understand what is truly meant by UTF-32BE and UTF-32LE. In general, people think these terms simply mean big-endian UTF-32 and little-endian UTF-32 respectively, without the additional connotation (defined in D99 and D100) that U+FEFF at the beginning of a stream defined as UTF-32BE or UTF-32LE is supposed to be interpreted, against all logic, as a zero-width no-break space. Because of this, it's not automatically the case that the file contains UTF-32BE data. That statement implies that there is no initial U+FEFF, or if there is one, that it is meant to be a ZWNBSP. You could just as easily have a UTF-32 file, which might have an initial U+FEFF (which then defines the endianness of the data) or might not (which means the data is big-endian unless a higher-level protocol dictates otherwise). -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: Why is endianness relevant when storing data on disks but not when in memory?
Doug Ewell, Sat, 5 Jan 2013 18:11:59 -0700: Martin J. Dürst wrote: When Unicode data is sent across the Internet we would say, The UTF-32 data was sent across the Internet. The first is correct. The second is correct. The third is wrong. [ snip ] you would say: The UTF-32BE data was sent across the Internet. The larger problem here is that most civilians don't understand what is truly meant by UTF-32BE and UTF-32LE. In general, people think these terms simply mean big-endian UTF-32 and little-endian UTF-32 respectively, without the additional connotation (defined in D99 and D100) that U+FEFF at the beginning of a stream defined as UTF-32BE or UTF-32LE is supposed to be interpreted, against all logic, as a zero-width no-break space. (I agree that it is against logic.) Because of this, it's not automatically the case that the file contains UTF-32BE data. That statement implies that there is no initial U+FEFF, or if there is one, that it is meant to be a ZWNBSP. You could just as easily have a UTF-32 file, which might have an initial U+FEFF (which then defines the endianness of the data) or might not (which means the data is big-endian unless a higher-level protocol dictates otherwise). I believe that even the U+FEFF *itself* is either UTF-32LE or UTF-32BE. Thus, there is, per se, no implication of lack of byte-order mark in Martin’s statement. Assuming that the label UTF-32 is defined the same way as the label UTF-16, then it is an umbrella label or a macro label (hint: macro language) which covers the two *real* encodings - UTF-32LE and UTF-32BE. Just my 5 øre. -- leif halvard silli
Re: What does it mean to not be a valid string in Unicode?
If for example I sit on a committee that devises a new encoding form, I would need to be concerned with the question which *sequences of Unicode code points* are sound. If this is the same as sequences of Unicode scalar values, I would need to exclude surrogates, if I read the standard correctly (this wasn't obvious to me on first inspection btw). If for example I sit on a committee that designs an optimized compression algorithm for Unicode strings (yep, I do know about SCSU), I might want to first convert them to some canonical internal form (say, my array of non-negative integers). If U+surrogate values can be assumed to not exist, there are 2048 fewer values a code point can assume; that's good for compression, and I'll subtract 2048 from those large scalar values in a first step. Etc etc. So I do think there are a number of very general use cases where this question arises. In fact, these questions have arisen in the past and have found answers then. A present-day use case is if I author a programming language and need to decide which values for val I accept in a statement like this: someEncodingFormIndependentUnicodeStringType str = val, specified in some PL-specific way I've looked at the Standard, and I must admit I'm a bit perplexed. Because of C1, which explicitly states A process shall not interpret a high-surrogate code point or a low-surrogate code point as an abstract character. I do not know why surrogate values are defined as code points in the first place. It seems to me that surrogates are (or should be) an encoding form–specific notion, whereas I have always thought of code points as encoding form–independent. Turns out this was wrong. I have always been thinking that code point conceptually meant Unicode scalar value, which is explicitly forbidden to have a surrogate value. Is this only terminological confusion? I would like to ask: Why do we need the notion of a surrogate code point; why isn't the notion of surrogate code units [in some specific encoding form] enough? Conceptually surrogate values are byte sequences used in encoding forms (modulo endianness). Why would one define an expression (Unicode code point) that conceptually lumps Unicode scalar value (an encoding form–independent notion) and surrogate code point (a notion that I wouldn't expect to exist outside of specific encoding forms) together? An encoding form maps only Unicode scalar values (that is all Unicode code points excluding the surrogate code points), by definition. D80 and what follows (Unicode string and Unicode X-bit string) exist, as I understand it, *only* in order for us to be able to have terminology for discussing ill-formed code unit sequences in the various encoding forms; but all of this talk seems to me to be encoding form–dependent. I think the answer to the question I had in mind is that the legal sequences of Unicode scalar values are (by definition) ({U+, ..., U+10} \ {U+D800, ..., U+DFFF})* . But then there is the notion of Unicode string, which is conceptually different, by definition. Maybe this is a terminological issue only. But is there an expression in the Standard that is defined as sequence of Unicode scalar values, a notion that seems to me to be conceptually important? I can see that the Standard defines the various well-formed encoding form code unit sequence. Have I overlooked something? Why is it even possible to store a surrogate value in something like the icu::UnicodeString datatype? In other words, why are we concerned with storing Unicode *code points* in data structures instead of Unicode *scalar values* (which can be serialized via encoding forms)? Stephan