Re: "A Programmer's Introduction to Unicode"
On Tue, 14 Mar 2017 08:51:18 + Alastair Houghtonwrote: > On 14 Mar 2017, at 02:03, Richard Wordingham > wrote: > > > > On Mon, 13 Mar 2017 19:18:00 + > > Alastair Houghton wrote: > > The problem is that UTF-16 based code can very easily overlook the > > handling of surrogate pairs, and one very easily get confused over > > what string lengths mean. > > Yet the same problem exists for UCS-4; it could very easily overlook > the handling of combining characters. That's a different issue. I presume you mean the issues of canonical equivalence and detecting text boundaries. Again, there is the problem of remembering to consider the whole surrogate pair when using UTF-16. (I suppose this could be largely handled by avoiding the concept of arrays.) Now, the supplementary characters where these issues arise are very infrequently used. An error in UTF-16 code might easily not come to attention, whereas a problem with UCS-4 (or UTF-8) comes to light as soon as one handles Thai or IPA. > As for string lengths, string > lengths in code points are no more meaningful than string lengths in > UTF-16 code units. They don’t tell you anything about the number of > user-visible characters; or anything about the width the string will > take up if rendered on the display (even in a fixed-width font); or > anything about the number of glyphs that a given string might be > transformed into by glyph mapping. The *only* think a string length > of a Unicode string will tell you is the number of code units. A string length in codepoints does have the advantage of being independent of encoding. I'm actually using an index for UTF-16 text (I don't know whether its denominated in codepoints or code units) to index into the UTF-8 source code. However, the number of code units is the more commonly used quantity, as it tells one how much memory is required for simple array storage. Richard.
RE: "A Programmer's Introduction to Unicode"
Philippe Verdy wrote: >>> Well, you do have eleven bits for flags per codepoint, for example. >> >> That's not UCS-4; that's a custom encoding. >> >> (any UCS-4 code unit) & 0xFFE0 == 0 (changing to "UTF-32" per Ken's observation) > Per definition yes, but UTC-4 is not Unicode. I guess it's not. What is UTC-4, anyway? Another name for a UWG meeting held in 1989? > As well (any UCS-4 code unit) & 0xFFE0 == 0 (i.e. 21 bits) is not > Unicode, UTF-32 is Unicode (more restrictive than just 21 bits which > would allow 32 planes instead of just the 17 first ones). I used bitwise arithmetic strictly to address Steffen's premise that the 11 "unused bits" in a UTF-32 code unit were available to store metadata about the code point. Of course UTF-32 does not allow 0x11 through 0x1F either. > I suppose he meant 21 bits, not 11 bits which covers only a small part > of the BMP. No, his comment "you do have eleven bits for flags per codepoint" pretty clearly referred to using the "extra" 11 bits beyond what is needed to hold the Unicode scalar value. -- Doug Ewell | Thornton, CO, US | ewellic.org
Re: "A Programmer's Introduction to Unicode"
Per definition yes, but UTC-4 is not Unicode. As well (any UCS-4 code unit) & 0xFFE0 == 0 (i.e. 21 bits) is not Unicode, UTF-32 is Unicode (more restrictive than just 21 bits which would allow 32 planes instead of just the 17 first ones). I suppose he meant 21 bits, not 11 bits which covers only a small part of the BMP. 2017-03-14 16:14 GMT+01:00 Doug Ewell: > Steffen Nurpmeso wrote: > > >> I didn’t say you never needed to work with code points. What I said > >> is that there’s no advantage to UCS-4 as an encoding, and that > > > > Well, you do have eleven bits for flags per codepoint, for example. > > That's not UCS-4; that's a custom encoding. > > (any UCS-4 code unit) & 0xFFE0 == 0 > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > >
Re: "A Programmer's Introduction to Unicode"
Steffen Nurpmeso wrote: >> I didn’t say you never needed to work with code points. What I said >> is that there’s no advantage to UCS-4 as an encoding, and that > > Well, you do have eleven bits for flags per codepoint, for example. That's not UCS-4; that's a custom encoding. (any UCS-4 code unit) & 0xFFE0 == 0 -- Doug Ewell | Thornton, CO, US | ewellic.org
Re: "A Programmer's Introduction to Unicode"
Alastair Houghtonwrote: |On 13 Mar 2017, at 21:10, Khaled Hosny wrote: |> On Mon, Mar 13, 2017 at 07:18:00PM +, Alastair Houghton wrote: |>> On 13 Mar 2017, at 17:55, J Decker wrote: |>>> |>>> I liked the Go implementation of character type - a rune type - \ |>>> which is a codepoint. and strings that return runes from by index. |>>> https://blog.golang.org/strings |>> |>> IMO, returning code points by index is a mistake. It over-emphasises |>> the importance of the code point, which helps to continue the notion |>> in some developers’ minds that code points are somehow “characters”. |>> It also leads to people unnecessarily using UCS-4 as an internal |>> representation, which seems to have very few advantages in practice |>> over UTF-16. |> |> But there are many text operations that require access to Unicode code |> points. Take for example text layout, as mapping characters to glyphs |> and back has to operate on code points. The idea that you never need to |> work with code points is too simplistic. | |I didn’t say you never needed to work with code points. What I said \ |is that there’s no advantage to UCS-4 as an encoding, and that there’s \ Well, you do have eleven bits for flags per codepoint, for example. |no advantage to being able to index a string by code point. As it \ With UTF-32 you can take the very codepoint and look up Unicode classification tables. |happens, I’ve written the kind of code you cite as an example, including \ |glyph mapping and OpenType processing, and the fact is that it’s no \ |harder to do it with a UTF-16 string than it is with a UCS-4 string. \ | Yes, certainly, surrogate pairs need to be decoded to map to glyphs; \ |but that’s a *trivial* matter, particularly as the code point to glyph \ |mapping is not 1:1 or even 1:N - it’s N:M, so you already need to cope \ |with being able to map multiple code units in the string to multiple \ |glyphs in the result. If you have to iterate over a string to perform some high-level processing then UTF-8 is a choice almost equally fine, for the very same reasons you bring in. And if the usage pattern "hotness" pictures that this thread has shown up at the beginning is correct, then the size overhead of UTF-8 that the UTF-16 proponents point out turns out to be a flop. But i for one gave up on making a stand against UTF-16 or BOMs. In fact i have turned to think UTF-16 is a pretty nice in-memory representation, and it is a small step to get from it to the real codepoint that you need to decide what something is, and what has to be done with it. I don't know whether i would really use it for this purpose, though, i am pretty sure that my core Unicode functions will (start to /) continue to use UTF-32, because the codepoint to codepoint(s) is what is described, and onto which anything else can be implemented. I.e., you can store three UTF-32 codepoints in a single uint64_t, and i would shoot myself in the foot if i would make this accessible via an UTF-16 or UTF-8 converter, imho; instead, i (will) make it accessible directly as UTF-32, and that serves equally well all other formats. Of course, if it is clear that you are UTF-16 all-through-the-way then you can save the conversion, but (the) most (widespread) Uni(x|ces) are UTF-8 based and it looks as if that would stay. Yes, yes, you can nonetheless use UTF-16, but it will most likely not safe you something on the database side due to storage alignment requirements, and the necessity to be able to access data somewhere. You can have a single index-lookup array and a dynamically sized database storage which uses two-byte alignment, of course, then i can imagine UTF-16 is for the better. I never looked how ICU does it, but i have been impressed by sheer data facts ^.^ --steffen
Re: "A Programmer's Introduction to Unicode"
On 14 Mar 2017, at 02:03, Richard Wordinghamwrote: > > On Mon, 13 Mar 2017 19:18:00 + > Alastair Houghton wrote: > >> IMO, returning code points by index is a mistake. It over-emphasises >> the importance of the code point, which helps to continue the notion >> in some developers’ minds that code points are somehow “characters”. >> It also leads to people unnecessarily using UCS-4 as an internal >> representation, which seems to have very few advantages in practice >> over UTF-16. > > The problem is that UTF-16 based code can very easily overlook the > handling of surrogate pairs, and one very easily get confused over what > string lengths mean. Yet the same problem exists for UCS-4; it could very easily overlook the handling of combining characters. As for string lengths, string lengths in code points are no more meaningful than string lengths in UTF-16 code units. They don’t tell you anything about the number of user-visible characters; or anything about the width the string will take up if rendered on the display (even in a fixed-width font); or anything about the number of glyphs that a given string might be transformed into by glyph mapping. The *only* think a string length of a Unicode string will tell you is the number of code units. Kind regards, Alastair. -- http://alastairs-place.net
Re: "A Programmer's Introduction to Unicode"
On 13 Mar 2017, at 21:10, Khaled Hosnywrote: > > On Mon, Mar 13, 2017 at 07:18:00PM +, Alastair Houghton wrote: >> On 13 Mar 2017, at 17:55, J Decker wrote: >>> >>> I liked the Go implementation of character type - a rune type - which is a >>> codepoint. and strings that return runes from by index. >>> https://blog.golang.org/strings >> >> IMO, returning code points by index is a mistake. It over-emphasises >> the importance of the code point, which helps to continue the notion >> in some developers’ minds that code points are somehow “characters”. >> It also leads to people unnecessarily using UCS-4 as an internal >> representation, which seems to have very few advantages in practice >> over UTF-16. > > But there are many text operations that require access to Unicode code > points. Take for example text layout, as mapping characters to glyphs > and back has to operate on code points. The idea that you never need to > work with code points is too simplistic. I didn’t say you never needed to work with code points. What I said is that there’s no advantage to UCS-4 as an encoding, and that there’s no advantage to being able to index a string by code point. As it happens, I’ve written the kind of code you cite as an example, including glyph mapping and OpenType processing, and the fact is that it’s no harder to do it with a UTF-16 string than it is with a UCS-4 string. Yes, certainly, surrogate pairs need to be decoded to map to glyphs; but that’s a *trivial* matter, particularly as the code point to glyph mapping is not 1:1 or even 1:N - it’s N:M, so you already need to cope with being able to map multiple code units in the string to multiple glyphs in the result. Kind regards, Alastair. -- http://alastairs-place.net
Re: "A Programmer's Introduction to Unicode"
Ah, it was what I thought you were talking about -- I wasn't aware they were considered word boundaries :) Thanks for the links! On Mar 13, 2017 4:54 PM, "Richard Wordingham" < richard.wording...@ntlworld.com> wrote: On Mon, 13 Mar 2017 15:26:00 -0700 Manish Goregaokarwrote: > Do you have examples of AA being split that way (and further reading)? > I think I'm aware of what you're talking about, but would love to read > more about it. Just googling for the three words 'Sanskrit', 'sandhi' and 'resolution' brings up plenty of papers and discussion, e.g. Hellwig's at http://ltc.amu.edu.pl/book/papers/LRL-1.pdf and a multi-author paper at https://www.aclweb.org/anthology/C/C16/C16-1048.pdf. There are even technical terms for before and after. Unsplit text is 'samhita text', and text split into words is 'pada text'. Richard.