Re: "A Programmer's Introduction to Unicode"
"Doug Ewell"wrote: |Philippe Verdy wrote: |>>> Well, you do have eleven bits for flags per codepoint, for example. |>> |>> That's not UCS-4; that's a custom encoding. |>> |>> (any UCS-4 code unit) & 0xFFE0 == 0 | |(changing to "UTF-32" per Ken's observation) | |> Per definition yes, but UTC-4 is not Unicode. | |I guess it's not. What is UTC-4, anyway? Another name for a UWG meeting |held in 1989? | |> As well (any UCS-4 code unit) & 0xFFE0 == 0 (i.e. 21 bits) is not |> Unicode, UTF-32 is Unicode (more restrictive than just 21 bits which |> would allow 32 planes instead of just the 17 first ones). | |I used bitwise arithmetic strictly to address Steffen's premise that the |11 "unused bits" in a UTF-32 code unit were available to store metadata |about the code point. Of course UTF-32 does not allow 0x11 through |0x1F either. | |> I suppose he meant 21 bits, not 11 bits which covers only a small part |> of the BMP. | |No, his comment "you do have eleven bits for flags per codepoint" pretty |clearly referred to using the "extra" 11 bits beyond what is needed to |hold the Unicode scalar value. It surely is a weak argument for a general string encoding. But sometimes, and for local use cases it surely is valid. You could store the wcwidth(3) plus a graphem codepoint count both in these bits of the first codepoint of a cluster, for example, and, then, that storage detail hidden under an access method interface. --steffen
Re: "A Programmer's Introduction to Unicode"
On Tue, 14 Mar 2017 08:51:18 + Alastair Houghtonwrote: > On 14 Mar 2017, at 02:03, Richard Wordingham > wrote: > > > > On Mon, 13 Mar 2017 19:18:00 + > > Alastair Houghton wrote: > > The problem is that UTF-16 based code can very easily overlook the > > handling of surrogate pairs, and one very easily get confused over > > what string lengths mean. > > Yet the same problem exists for UCS-4; it could very easily overlook > the handling of combining characters. That's a different issue. I presume you mean the issues of canonical equivalence and detecting text boundaries. Again, there is the problem of remembering to consider the whole surrogate pair when using UTF-16. (I suppose this could be largely handled by avoiding the concept of arrays.) Now, the supplementary characters where these issues arise are very infrequently used. An error in UTF-16 code might easily not come to attention, whereas a problem with UCS-4 (or UTF-8) comes to light as soon as one handles Thai or IPA. > As for string lengths, string > lengths in code points are no more meaningful than string lengths in > UTF-16 code units. They don’t tell you anything about the number of > user-visible characters; or anything about the width the string will > take up if rendered on the display (even in a fixed-width font); or > anything about the number of glyphs that a given string might be > transformed into by glyph mapping. The *only* think a string length > of a Unicode string will tell you is the number of code units. A string length in codepoints does have the advantage of being independent of encoding. I'm actually using an index for UTF-16 text (I don't know whether its denominated in codepoints or code units) to index into the UTF-8 source code. However, the number of code units is the more commonly used quantity, as it tells one how much memory is required for simple array storage. Richard.
RE: "A Programmer's Introduction to Unicode"
Philippe Verdy wrote: >>> Well, you do have eleven bits for flags per codepoint, for example. >> >> That's not UCS-4; that's a custom encoding. >> >> (any UCS-4 code unit) & 0xFFE0 == 0 (changing to "UTF-32" per Ken's observation) > Per definition yes, but UTC-4 is not Unicode. I guess it's not. What is UTC-4, anyway? Another name for a UWG meeting held in 1989? > As well (any UCS-4 code unit) & 0xFFE0 == 0 (i.e. 21 bits) is not > Unicode, UTF-32 is Unicode (more restrictive than just 21 bits which > would allow 32 planes instead of just the 17 first ones). I used bitwise arithmetic strictly to address Steffen's premise that the 11 "unused bits" in a UTF-32 code unit were available to store metadata about the code point. Of course UTF-32 does not allow 0x11 through 0x1F either. > I suppose he meant 21 bits, not 11 bits which covers only a small part > of the BMP. No, his comment "you do have eleven bits for flags per codepoint" pretty clearly referred to using the "extra" 11 bits beyond what is needed to hold the Unicode scalar value. -- Doug Ewell | Thornton, CO, US | ewellic.org
Re: "A Programmer's Introduction to Unicode"
Per definition yes, but UTC-4 is not Unicode. As well (any UCS-4 code unit) & 0xFFE0 == 0 (i.e. 21 bits) is not Unicode, UTF-32 is Unicode (more restrictive than just 21 bits which would allow 32 planes instead of just the 17 first ones). I suppose he meant 21 bits, not 11 bits which covers only a small part of the BMP. 2017-03-14 16:14 GMT+01:00 Doug Ewell: > Steffen Nurpmeso wrote: > > >> I didn’t say you never needed to work with code points. What I said > >> is that there’s no advantage to UCS-4 as an encoding, and that > > > > Well, you do have eleven bits for flags per codepoint, for example. > > That's not UCS-4; that's a custom encoding. > > (any UCS-4 code unit) & 0xFFE0 == 0 > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > >
Re: "A Programmer's Introduction to Unicode"
Steffen Nurpmeso wrote: >> I didn’t say you never needed to work with code points. What I said >> is that there’s no advantage to UCS-4 as an encoding, and that > > Well, you do have eleven bits for flags per codepoint, for example. That's not UCS-4; that's a custom encoding. (any UCS-4 code unit) & 0xFFE0 == 0 -- Doug Ewell | Thornton, CO, US | ewellic.org
Re: "A Programmer's Introduction to Unicode"
Alastair Houghtonwrote: |On 13 Mar 2017, at 21:10, Khaled Hosny wrote: |> On Mon, Mar 13, 2017 at 07:18:00PM +, Alastair Houghton wrote: |>> On 13 Mar 2017, at 17:55, J Decker wrote: |>>> |>>> I liked the Go implementation of character type - a rune type - \ |>>> which is a codepoint. and strings that return runes from by index. |>>> https://blog.golang.org/strings |>> |>> IMO, returning code points by index is a mistake. It over-emphasises |>> the importance of the code point, which helps to continue the notion |>> in some developers’ minds that code points are somehow “characters”. |>> It also leads to people unnecessarily using UCS-4 as an internal |>> representation, which seems to have very few advantages in practice |>> over UTF-16. |> |> But there are many text operations that require access to Unicode code |> points. Take for example text layout, as mapping characters to glyphs |> and back has to operate on code points. The idea that you never need to |> work with code points is too simplistic. | |I didn’t say you never needed to work with code points. What I said \ |is that there’s no advantage to UCS-4 as an encoding, and that there’s \ Well, you do have eleven bits for flags per codepoint, for example. |no advantage to being able to index a string by code point. As it \ With UTF-32 you can take the very codepoint and look up Unicode classification tables. |happens, I’ve written the kind of code you cite as an example, including \ |glyph mapping and OpenType processing, and the fact is that it’s no \ |harder to do it with a UTF-16 string than it is with a UCS-4 string. \ | Yes, certainly, surrogate pairs need to be decoded to map to glyphs; \ |but that’s a *trivial* matter, particularly as the code point to glyph \ |mapping is not 1:1 or even 1:N - it’s N:M, so you already need to cope \ |with being able to map multiple code units in the string to multiple \ |glyphs in the result. If you have to iterate over a string to perform some high-level processing then UTF-8 is a choice almost equally fine, for the very same reasons you bring in. And if the usage pattern "hotness" pictures that this thread has shown up at the beginning is correct, then the size overhead of UTF-8 that the UTF-16 proponents point out turns out to be a flop. But i for one gave up on making a stand against UTF-16 or BOMs. In fact i have turned to think UTF-16 is a pretty nice in-memory representation, and it is a small step to get from it to the real codepoint that you need to decide what something is, and what has to be done with it. I don't know whether i would really use it for this purpose, though, i am pretty sure that my core Unicode functions will (start to /) continue to use UTF-32, because the codepoint to codepoint(s) is what is described, and onto which anything else can be implemented. I.e., you can store three UTF-32 codepoints in a single uint64_t, and i would shoot myself in the foot if i would make this accessible via an UTF-16 or UTF-8 converter, imho; instead, i (will) make it accessible directly as UTF-32, and that serves equally well all other formats. Of course, if it is clear that you are UTF-16 all-through-the-way then you can save the conversion, but (the) most (widespread) Uni(x|ces) are UTF-8 based and it looks as if that would stay. Yes, yes, you can nonetheless use UTF-16, but it will most likely not safe you something on the database side due to storage alignment requirements, and the necessity to be able to access data somewhere. You can have a single index-lookup array and a dynamically sized database storage which uses two-byte alignment, of course, then i can imagine UTF-16 is for the better. I never looked how ICU does it, but i have been impressed by sheer data facts ^.^ --steffen
Re: "A Programmer's Introduction to Unicode"
On 14 Mar 2017, at 02:03, Richard Wordinghamwrote: > > On Mon, 13 Mar 2017 19:18:00 + > Alastair Houghton wrote: > >> IMO, returning code points by index is a mistake. It over-emphasises >> the importance of the code point, which helps to continue the notion >> in some developers’ minds that code points are somehow “characters”. >> It also leads to people unnecessarily using UCS-4 as an internal >> representation, which seems to have very few advantages in practice >> over UTF-16. > > The problem is that UTF-16 based code can very easily overlook the > handling of surrogate pairs, and one very easily get confused over what > string lengths mean. Yet the same problem exists for UCS-4; it could very easily overlook the handling of combining characters. As for string lengths, string lengths in code points are no more meaningful than string lengths in UTF-16 code units. They don’t tell you anything about the number of user-visible characters; or anything about the width the string will take up if rendered on the display (even in a fixed-width font); or anything about the number of glyphs that a given string might be transformed into by glyph mapping. The *only* think a string length of a Unicode string will tell you is the number of code units. Kind regards, Alastair. -- http://alastairs-place.net
Re: "A Programmer's Introduction to Unicode"
On 13 Mar 2017, at 21:10, Khaled Hosnywrote: > > On Mon, Mar 13, 2017 at 07:18:00PM +, Alastair Houghton wrote: >> On 13 Mar 2017, at 17:55, J Decker wrote: >>> >>> I liked the Go implementation of character type - a rune type - which is a >>> codepoint. and strings that return runes from by index. >>> https://blog.golang.org/strings >> >> IMO, returning code points by index is a mistake. It over-emphasises >> the importance of the code point, which helps to continue the notion >> in some developers’ minds that code points are somehow “characters”. >> It also leads to people unnecessarily using UCS-4 as an internal >> representation, which seems to have very few advantages in practice >> over UTF-16. > > But there are many text operations that require access to Unicode code > points. Take for example text layout, as mapping characters to glyphs > and back has to operate on code points. The idea that you never need to > work with code points is too simplistic. I didn’t say you never needed to work with code points. What I said is that there’s no advantage to UCS-4 as an encoding, and that there’s no advantage to being able to index a string by code point. As it happens, I’ve written the kind of code you cite as an example, including glyph mapping and OpenType processing, and the fact is that it’s no harder to do it with a UTF-16 string than it is with a UCS-4 string. Yes, certainly, surrogate pairs need to be decoded to map to glyphs; but that’s a *trivial* matter, particularly as the code point to glyph mapping is not 1:1 or even 1:N - it’s N:M, so you already need to cope with being able to map multiple code units in the string to multiple glyphs in the result. Kind regards, Alastair. -- http://alastairs-place.net
Re: "A Programmer's Introduction to Unicode"
Ah, it was what I thought you were talking about -- I wasn't aware they were considered word boundaries :) Thanks for the links! On Mar 13, 2017 4:54 PM, "Richard Wordingham" < richard.wording...@ntlworld.com> wrote: On Mon, 13 Mar 2017 15:26:00 -0700 Manish Goregaokarwrote: > Do you have examples of AA being split that way (and further reading)? > I think I'm aware of what you're talking about, but would love to read > more about it. Just googling for the three words 'Sanskrit', 'sandhi' and 'resolution' brings up plenty of papers and discussion, e.g. Hellwig's at http://ltc.amu.edu.pl/book/papers/LRL-1.pdf and a multi-author paper at https://www.aclweb.org/anthology/C/C16/C16-1048.pdf. There are even technical terms for before and after. Unsplit text is 'samhita text', and text split into words is 'pada text'. Richard.
Re: "A Programmer's Introduction to Unicode"
On Mon, 13 Mar 2017 19:18:00 + Alastair Houghtonwrote: > IMO, returning code points by index is a mistake. It over-emphasises > the importance of the code point, which helps to continue the notion > in some developers’ minds that code points are somehow “characters”. > It also leads to people unnecessarily using UCS-4 as an internal > representation, which seems to have very few advantages in practice > over UTF-16. The problem is that UTF-16 based code can very easily overlook the handling of surrogate pairs, and one very easily get confused over what string lengths mean. Richard.
Re: "A Programmer's Introduction to Unicode"
On Mon, 13 Mar 2017 20:20:25 -0400 "Mark E. Shoulson"wrote: > Sanskrit external vowel sandhi is comparatively > straightforward (compared to consonant sandhi), and it frequently > loses information. A *or* AA plus I is E; A *or* AA plus U is O (you > need A + O to get AU). Indeed, E can not only be A or AA plus I or II: it can also be E + A. In the latter case avagraha is usual, at least in European practice. (Would that generally be locale sa_Deva_GB?) I'd like advice on modern Indian practice, and on the spacing and syllable division. I've seen a claim that avagraha always belongs with the preceding vowel, but I'm not sure that that rule applies in this case. In a similar fashion, O can -AS + A-, an interesting case of visarga sandhi. However, I'm not sure that one would want to *divide* the E or O. Richard.
Re: "A Programmer's Introduction to Unicode"
A word ending in A *or* AA preceding a word beginning in A *or* AA will all coalesce to a single AA in Sanskrit. That's four possibilities, and that doesn't count a word ending in a consonant preceding a word beginning in AA, which would be written the same. My memory is rusty, so I should actually be looking things up, but I think these are valid constructions: न + अगच्छत् → नागच्छत् न + आगच्छत् → नागच्छत् (and indeed, आगच्छत् is the upasarga आ plus अगच्छत्, so there too the A + AA coalesced.) I should probably find you examples for all the other possibilities. Sanskrit external vowel sandhi is comparatively straightforward (compared to consonant sandhi), and it frequently loses information. A *or* AA plus I is E; A *or* AA plus U is O (you need A + O to get AU). ~mark On 03/13/2017 06:26 PM, Manish Goregaokar wrote: Do you have examples of AA being split that way (and further reading)? I think I'm aware of what you're talking about, but would love to read more about it. -Manish On Mon, Mar 13, 2017 at 2:47 PM, Richard Wordinghamwrote: On Mon, 13 Mar 2017 23:10:11 +0200 Khaled Hosny wrote: But there are many text operations that require access to Unicode code points. Take for example text layout, as mapping characters to glyphs and back has to operate on code points. The idea that you never need to work with code points is too simplistic. There are advantages to interpreting and operating on text as though it were in form NFD. However, there are still cases where one needs fractions of a character, such as word boundaries in Sanskrit, though I think the locations are liable to be specified in a language-specific form. U+093E DEVANAGARI VOWEL SIGN AA can have a word boundary in it in at least 4 ways. Richard.
Re: "A Programmer's Introduction to Unicode"
On Mon, 13 Mar 2017 15:26:00 -0700 Manish Goregaokarwrote: > Do you have examples of AA being split that way (and further reading)? > I think I'm aware of what you're talking about, but would love to read > more about it. Just googling for the three words 'Sanskrit', 'sandhi' and 'resolution' brings up plenty of papers and discussion, e.g. Hellwig's at http://ltc.amu.edu.pl/book/papers/LRL-1.pdf and a multi-author paper at https://www.aclweb.org/anthology/C/C16/C16-1048.pdf. There are even technical terms for before and after. Unsplit text is 'samhita text', and text split into words is 'pada text'. Richard.
Re: "A Programmer's Introduction to Unicode"
Do you have examples of AA being split that way (and further reading)? I think I'm aware of what you're talking about, but would love to read more about it. -Manish On Mon, Mar 13, 2017 at 2:47 PM, Richard Wordinghamwrote: > On Mon, 13 Mar 2017 23:10:11 +0200 > Khaled Hosny wrote: > >> But there are many text operations that require access to Unicode code >> points. Take for example text layout, as mapping characters to glyphs >> and back has to operate on code points. The idea that you never need >> to work with code points is too simplistic. > > There are advantages to interpreting and operating on text as though it > were in form NFD. However, there are still cases where one needs > fractions of a character, such as word boundaries in Sanskrit, though I > think the locations are liable to be specified in a language-specific > form. U+093E DEVANAGARI VOWEL SIGN AA can have a word boundary in it > in at least 4 ways. > > Richard.
Re: "A Programmer's Introduction to Unicode"
On Mon, 13 Mar 2017 23:10:11 +0200 Khaled Hosnywrote: > But there are many text operations that require access to Unicode code > points. Take for example text layout, as mapping characters to glyphs > and back has to operate on code points. The idea that you never need > to work with code points is too simplistic. There are advantages to interpreting and operating on text as though it were in form NFD. However, there are still cases where one needs fractions of a character, such as word boundaries in Sanskrit, though I think the locations are liable to be specified in a language-specific form. U+093E DEVANAGARI VOWEL SIGN AA can have a word boundary in it in at least 4 ways. Richard.
Re: "A Programmer's Introduction to Unicode"
On Mon, Mar 13, 2017 at 07:18:00PM +, Alastair Houghton wrote: > On 13 Mar 2017, at 17:55, J Deckerwrote: > > > > I liked the Go implementation of character type - a rune type - which is a > > codepoint. and strings that return runes from by index. > > https://blog.golang.org/strings > > IMO, returning code points by index is a mistake. It over-emphasises > the importance of the code point, which helps to continue the notion > in some developers’ minds that code points are somehow “characters”. > It also leads to people unnecessarily using UCS-4 as an internal > representation, which seems to have very few advantages in practice > over UTF-16. But there are many text operations that require access to Unicode code points. Take for example text layout, as mapping characters to glyphs and back has to operate on code points. The idea that you never need to work with code points is too simplistic. Regards, Khaled
Re: "A Programmer's Introduction to Unicode"
On 13 Mar 2017, at 17:55, J Deckerwrote: > > I liked the Go implementation of character type - a rune type - which is a > codepoint. and strings that return runes from by index. > https://blog.golang.org/strings IMO, returning code points by index is a mistake. It over-emphasises the importance of the code point, which helps to continue the notion in some developers’ minds that code points are somehow “characters”. It also leads to people unnecessarily using UCS-4 as an internal representation, which seems to have very few advantages in practice over UTF-16. > Doesn't solve the problem for composited codepoints though... > > texel looks to be defined as a graphic element already. TEXture ELement. Yes, but I thought the proposal was “textel”, with the extra “t”. Re-using “texel” would be quite inappropriate; there are certainly people who work on rendering software who would strongly object to that, for very good reasons. I would caution, however, that there’s already a lot of terminology associated with Unicode, perhaps for understandable reasons, but if the word “textel” is going to have a definition that differs from (say) an extended grapheme cluster, I think a great deal of consideration should be given to what exactly that definition should be. We already have “characters”, code units, code points, combining sequences, graphemes, grapheme clusters, extended grapheme clusters and probably other things I’ve missed off that list. Merely adding yet another bit of terminology isn’t going to fix the problem of developers misunderstanding or simply not being aware of the correct terminology or of some aspect of Unicode’s behaviour. Kind regards, Alastair. -- http://alastairs-place.net
Re: "A Programmer's Introduction to Unicode"
I liked the Go implementation of character type - a rune type - which is a codepoint. and strings that return runes from by index. https://blog.golang.org/strings Doesn't solve the problem for composited codepoints though... texel looks to be defined as a graphic element already. TEXture ELement. On Mon, Mar 13, 2017 at 10:15 AM, Janusz S. Bienwrote: > Quote/Cytat - Asmus Freytag (Mon 13 Mar 2017 > 06:00:08 PM CET): > > [...] > > This (or similar) scenarios indicate the impossibility to come to a > single, universal definition of a "textel" -- the main reason why this > term is of lower utility than "pixel". > > I agree that it is impossible to come to a single, universal definition > of text elements, but it seems possible to reach a consensus on a kind of > the least common denominator of them and call it "textel" or something else. > > > Best regards > > Janusz > > -- > Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra > Lingwistyki Formalnej) > Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department) > jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~ > jsbien/ > >
Re: "A Programmer's Introduction to Unicode"
Quote/Cytat - J Decker(Mon 13 Mar 2017 06:55:18 PM CET): texel looks to be defined as a graphic element already. TEXture ELement. I'm aware of it, but homonymy/polysemy is something we have to live with. I think there is no risk of confusing texture elements with text elements, despite the fact that 'texture' and 'text' have similar origin. Best regards Janusz -- Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department) jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
Re: "A Programmer's Introduction to Unicode"
Quote/Cytat - Asmus Freytag(Mon 13 Mar 2017 06:00:08 PM CET): [...] This (or similar) scenarios indicate the impossibility to come to a single, universal definition of a "textel" -- the main reason why this term is of lower utility than "pixel". I agree that it is impossible to come to a single, universal definition of text elements, but it seems possible to reach a consensus on a kind of the least common denominator of them and call it "textel" or something else. Best regards Janusz -- Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department) jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
Re: "A Programmer's Introduction to Unicode"
On 3/13/2017 3:31 AM, Janusz S. Bien wrote: Just yet another reason for introducing the notion of textel? The main difference between "textel" and "pixel" is that the unit of processing /displaying text is not uniform and fixed, unlike a pixel. In other words, different operations may need to look at text differently, and I don't mean the trivial case of storage (byte level) vs. any higher level. Correspondingly the discussion of "text element" at least in the early versions of the Unicode Standard, left the particular division of the text into "text elements" unspecified. There are closely related tasks that might demonstrate this. Assume a script where multiple code points make up a syllable, yet that syllable is the intuitive basic unit of reading and writing. One task is cursor placement. For that task, you need to be able to divide *any* text so that the cursor ideally does not get positioned in the middle of a syllalbel. However, the definition of a "syllable" has to allow degenerate and 'defective' cases. Which is which is of no importance, as long as it is possible to find a valid cursor position. The other task would be to assert that a string contains only well-formed syllables. Here, it is crucially necessary to be able to define which syllables are well-formed. Finding divisions in parts of the string that does not contain well-formed syllables is not necessary. You may also find that in some cases, even though the syllable is the basic unit, there may be a need to edit it in ways other than as a unit. Some syllables may have some optional marks, signs or symbols added that may need to be edited or traversed explicitly, while a "core" syllable may be more likely to be a unit. This (or similar) scenarios indicate the impossibility to come to a single, universal definition of a "textel" -- the main reason why this term is of lower utility than "pixel". A./
Re: "A Programmer's Introduction to Unicode"
Prof. Janusz S. Bień wrote: > Just yet another reason for introducing the notion of textel? I opine that it would be a good idea to introduce several new words, of which textel would be one, with each such new word having a precisely-defined meaning so that in precise discussions of programming techniques people could discuss the situation without needing to use any of the words character, code point, grapheme cluster. How many such new words would be needed? I remember how in electronics the introduction of the term Hertz to be used instead of cycles per second helped discussions. After the introduction of the term Hertz it became easy to refer to twenty cycles of a fifty Hertz signal without confusion over one's meaning. So introducing several new precisely-defined words now could help lots of discussions in the future. Perhaps, apart from textel, the definitions could be produced first and then people can decide, for each such definition, which new word would be a good word to have that definition. The recent introduction into Unicode of ZWJ sequences for some emoji and the introduction into Unicode of tag sequences applied to a base character does could mean that the introducing of such new words becomes of increasing importance due to the programming implications of those recently introduced techniques. William Overington Monday 13 March 2017
Re: "A Programmer's Introduction to Unicode"
Quote/Cytat - William_J_G Overington(Mon 13 Mar 2017 12:24:13 PM CET): Prof. Janusz S. Bień wrote: Just yet another reason for introducing the notion of textel? I opine that it would be a good idea to introduce several new words, of which textel would be one, with each such new word having a precisely-defined meaning so that in precise discussions of programming techniques people could discuss the situation without needing to use any of the words character, code point, grapheme cluster. How many such new words would be needed? In my paper (in Polish) http://bc.klf.uw.edu.pl/480/ I propose also the term "texton" meaning a code point from a specific subset, not yet fully defined, but including at least the components of composite characters. Best regards Janusz -- Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department) jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
Re: "A Programmer's Introduction to Unicode"
Quote/Cytat - Richard Wordingham(Sun 12 Mar 2017 09:10:22 PM CET): On Sun, 12 Mar 2017 20:02:28 +0100 "Janusz S. Bien" wrote: If the basic notion has to be referred in a cumbersome way as "extended grapheme cluster" then it is easier to talk about "Unicode characters" despite the fact that they have a rather loose relation to real-life/user-perceived characters. The notion that extended grapheme clusters corresponds to user-perceived characters is also rather dodgy. The idea is not mine, but it appears from time to time on the list in a more or less explicit way. Whereas it may work for French, it is getting very dubious by the time one adds Hebrew cantillation marks or Vedic accentuation. The Thais revolted when their preposed vowels were joined with the following consonant in the same extended grapheme cluster, and Unicode had to revoke that union. Just yet another reason for introducing the notion of textel? Best regards Janusz -- Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department) jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
Re: "A Programmer's Introduction to Unicode"
On Sun, 12 Mar 2017 20:02:28 +0100 "Janusz S. Bien"wrote: > If the basic notion has to be referred in a cumbersome way as > "extended grapheme cluster" then it is easier to talk about "Unicode > characters" despite the fact that they have a rather loose relation > to real-life/user-perceived characters. The notion that extended grapheme clusters corresponds to user-perceived characters is also rather dodgy. Whereas it may work for French, it is getting very dubious by the time one adds Hebrew cantillation marks or Vedic accentuation. The Thais revolted when their preposed vowels were joined with the following consonant in the same extended grapheme cluster, and Unicode had to revoke that union. Richard.
Re: "A Programmer's Introduction to Unicode"
Quote/Cytat - Manish Goregaokar(Sun 12 Mar 2017 07:43:22 PM CET): This is just another confirmation that the present Unicode terminology is confusing. I find this to be a symptom of our pedagogy around "characters" in programming; most folks get taught that characters are bytes are code points, especially because many languages try to make this the case. The name "grapheme cluster" could be improved upon, but it's not the primary source of this confusion. I agree that it's not the primary source. However the pedagogy depends on the terminology used. If the basic notion has to be referred in a cumbersome way as "extended grapheme cluster" then it is easier to talk about "Unicode characters" despite the fact that they have a rather loose relation to real-life/user-perceived characters. Best regards Janusz -- Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department) jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
Re: "A Programmer's Introduction to Unicode"
> This is just another confirmation that the present Unicode terminology is confusing. I find this to be a symptom of our pedagogy around "characters" in programming; most folks get taught that characters are bytes are code points, especially because many languages try to make this the case. The name "grapheme cluster" could be improved upon, but it's not the primary source of this confusion. -Manish On Sat, Mar 11, 2017 at 10:04 PM, Janusz S. Bieńwrote: > On Fri, Mar 10 2017 at 19:55 CET, man...@mozilla.com writes: >> I recently wrote >> http://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/ >> , which sort of addresses the whole hangup programmers have with >> treating code points as "characters". > > [...] > > This is just another confirmation that the present Unicode terminology > is confusing. Let me remind below a fragment of an old thread about > "textels". > > Best regards > > Janusz > > > On Thu, Sep 15 2016 at 21:12 CEST, jsb...@mimuw.edu.pl writes: >> On Thu, Sep 15 2016 at 16:36 CEST, john.w.kenn...@gmail.com writes: >> >> [...] >> >>> In the new Swift programming language, which is white-hot in the Apple >>> community, Apple is moving toward a model of a transparent, generic >>> Unicode that can be “viewed” as UTF-8, UTF-16, or UTF-32 if necessary, >>> but in which a “character” contains however many code points it needs >>> (“e” with a stacked macron, acute accent, and dieresis is >>> algorithmically one “character” in Swift). Moreover, >>> e-with-an-acute-accent and e followed by a combining acute accent, for >>> example, compare as equal. At present, the underlying code is still >>> UTF-16LE. >> >> For several years I use the name "textel" (text element, in Polish >> "tekstel") for such objects. I do it mostly orally in my presentations >> for my students, but I used it also in writing e.g. in >> http://bc.klf.uw.edu.pl/118/, unfortunately without a proper >> definition. A rudymentary definition was provided for me only in my >> recent paper in Polish: http://bc.klf.uw.edu.pl/480/. It states simply >> (on p. 69) "an elementary text element independently of its Unicode >> representation" (meaning in particular composed vs precomposed). I still >> hope to formulate sooner or later a more satisfactory definition :-) >> >> I think Swift confirms that such a notion is really needed. >> >> Best regards >> >> Janusz > > On Wed, Sep 21 2016 at 6:44 CEST, jsb...@mimuw.edu.pl writes: >> On Tue, Sep 20 2016 at 18:09 CEST, d...@ewellic.org writes: >>> Janusz Bień wrote: >>> For me it means that Swift's characters are equivalence classes of the set of extended grapheme clusters by canonical equivalence relation. >>> >>> I still hope we can come to some conclusion on the correct Unicode name >>> for this concept. I don't think non-Unicode interpretations of terms >>> like "grapheme" are grounds for throwing out "grapheme cluster," >> >> I agree. >> >>> but I can see that the equivalence class itself is lacking a name. >> >> I'glad. >> >>> >>> Note that the Swift definition doesn't say that <00E9> and <0065 0301> >>> are identical entities, only that the language compares them as equal. >> >> I'm fully aware of this. >> >> Best regards >> >> Janusz > > > -- >, > Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki > Formalnej) > Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) > jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ >
Re: "A Programmer's Introduction to Unicode"
On Fri, Mar 10 2017 at 19:55 CET, man...@mozilla.com writes: > I recently wrote > http://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/ > , which sort of addresses the whole hangup programmers have with > treating code points as "characters". [...] This is just another confirmation that the present Unicode terminology is confusing. Let me remind below a fragment of an old thread about "textels". Best regards Janusz On Thu, Sep 15 2016 at 21:12 CEST, jsb...@mimuw.edu.pl writes: > On Thu, Sep 15 2016 at 16:36 CEST, john.w.kenn...@gmail.com writes: > > [...] > >> In the new Swift programming language, which is white-hot in the Apple >> community, Apple is moving toward a model of a transparent, generic >> Unicode that can be “viewed” as UTF-8, UTF-16, or UTF-32 if necessary, >> but in which a “character” contains however many code points it needs >> (“e” with a stacked macron, acute accent, and dieresis is >> algorithmically one “character” in Swift). Moreover, >> e-with-an-acute-accent and e followed by a combining acute accent, for >> example, compare as equal. At present, the underlying code is still >> UTF-16LE. > > For several years I use the name "textel" (text element, in Polish > "tekstel") for such objects. I do it mostly orally in my presentations > for my students, but I used it also in writing e.g. in > http://bc.klf.uw.edu.pl/118/, unfortunately without a proper > definition. A rudymentary definition was provided for me only in my > recent paper in Polish: http://bc.klf.uw.edu.pl/480/. It states simply > (on p. 69) "an elementary text element independently of its Unicode > representation" (meaning in particular composed vs precomposed). I still > hope to formulate sooner or later a more satisfactory definition :-) > > I think Swift confirms that such a notion is really needed. > > Best regards > > Janusz On Wed, Sep 21 2016 at 6:44 CEST, jsb...@mimuw.edu.pl writes: > On Tue, Sep 20 2016 at 18:09 CEST, d...@ewellic.org writes: >> Janusz Bień wrote: >> >>> For me it means that Swift's characters are equivalence classes of the >>> set of extended grapheme clusters by canonical equivalence relation. >> >> I still hope we can come to some conclusion on the correct Unicode name >> for this concept. I don't think non-Unicode interpretations of terms >> like "grapheme" are grounds for throwing out "grapheme cluster," > > I agree. > >> but I can see that the equivalence class itself is lacking a name. > > I'glad. > >> >> Note that the Swift definition doesn't say that <00E9> and <0065 0301> >> are identical entities, only that the language compares them as equal. > > I'm fully aware of this. > > Best regards > > Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
Re: "A Programmer's Introduction to Unicode"
I recently wrote http://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/ , which sort of addresses the whole hangup programmers have with treating code points as "characters". I also wrote http://manishearth.github.io/blog/2017/01/15/breaking-our-latin-1-assumptions/ that provides a useful list of scripts to check against when figuring out if your design makes sense uniformly across scripts. There's also https://eev.ee/blog/2015/09/12/dark-corners-of-unicode/ -Manish On Fri, Mar 10, 2017 at 9:00 AM, Peter Constablewrote: > FYI: > > > > http://reedbeta.com/blog/programmers-intro-to-unicode/ > > > > The visuals may be the most interesting part. E.g., in the usage heat map, > Arabic Presentation Forms-B lights up much more than I would have expected – > as much as a lot of emoji. > > > > > > > > Peter
Re: "A Programmer's Introduction to Unicode"
On Fri, Mar 10, 2017 at 05:00:55PM +, Peter Constable wrote: > FYI: > > http://reedbeta.com/blog/programmers-intro-to-unicode/ > > The visuals may be the most interesting part. E.g., in the usage heat > map, Arabic Presentation Forms-B lights up much more than I would have > expected I often see U+FEFB and other lam-alef ligatures used on social media (I easily spot it because my default font does not have them so they end up using fallback font). My guess is that might be because some keyboard layouts (Xorg, Android?) use them for the lam-alef keys on the keyboard (I’m guilty of doing this for Xorg keyboard layout because it didn’t handle more than one character per key, this was then decomposed back inside XIM input method, but many people don’t use XIM and the decomposition does not happen, it was messy overall). Regards, Khaled