Re: "A Programmer's Introduction to Unicode"

2017-03-15 Thread Steffen Nurpmeso
"Doug Ewell"  wrote:
 |Philippe Verdy wrote:
 |>>> Well, you do have eleven bits for flags per codepoint, for example.
 |>>
 |>> That's not UCS-4; that's a custom encoding.
 |>>
 |>> (any UCS-4 code unit) & 0xFFE0 == 0
 |
 |(changing to "UTF-32" per Ken's observation)
 |
 |> Per definition yes, but UTC-4 is not Unicode.
 |
 |I guess it's not. What is UTC-4, anyway? Another name for a UWG meeting
 |held in 1989?
 |
 |> As well (any UCS-4 code unit) & 0xFFE0 == 0 (i.e. 21 bits) is not
 |> Unicode, UTF-32 is Unicode (more restrictive than just 21 bits which
 |> would allow 32 planes instead of just the 17 first ones).
 |
 |I used bitwise arithmetic strictly to address Steffen's premise that the
 |11 "unused bits" in a UTF-32 code unit were available to store metadata
 |about the code point. Of course UTF-32 does not allow 0x11 through
 |0x1F either.
 |
 |> I suppose he meant 21 bits, not 11 bits which covers only a small part
 |> of the BMP.
 |
 |No, his comment "you do have eleven bits for flags per codepoint" pretty
 |clearly referred to using the "extra" 11 bits beyond what is needed to
 |hold the Unicode scalar value.

It surely is a weak argument for a general string encoding.  But
sometimes, and for local use cases it surely is valid.  You could
store the wcwidth(3) plus a graphem codepoint count both in these
bits of the first codepoint of a cluster, for example, and, then,
that storage detail hidden under an access method interface.

--steffen


Re: "A Programmer's Introduction to Unicode"

2017-03-14 Thread Richard Wordingham
On Tue, 14 Mar 2017 08:51:18 +
Alastair Houghton  wrote:

> On 14 Mar 2017, at 02:03, Richard Wordingham
>  wrote:
> > 
> > On Mon, 13 Mar 2017 19:18:00 +
> > Alastair Houghton  wrote:

> > The problem is that UTF-16 based code can very easily overlook the
> > handling of surrogate pairs, and one very easily get confused over
> > what string lengths mean.  
> 
> Yet the same problem exists for UCS-4; it could very easily overlook
> the handling of combining characters.

That's a different issue.  I presume you mean the issues of canonical
equivalence and detecting text boundaries.  Again, there is the problem
of remembering to consider the whole surrogate pair when using
UTF-16.  (I suppose this could be largely handled by avoiding the
concept of arrays.)  Now, the supplementary characters where these
issues arise are very infrequently used.  An error in UTF-16 code might
easily not come to attention, whereas a problem with UCS-4 (or UTF-8)
comes to light as soon as one handles Thai or IPA.

> As for string lengths, string
> lengths in code points are no more meaningful than string lengths in
> UTF-16 code units.  They don’t tell you anything about the number of
> user-visible characters; or anything about the width the string will
> take up if rendered on the display (even in a fixed-width font); or
> anything about the number of glyphs that a given string might be
> transformed into by glyph mapping.  The *only* think a string length
> of a Unicode string will tell you is the number of code units.

A string length in codepoints does have the advantage of being
independent of encoding.  I'm actually using an index for UTF-16
text (I don't know whether its denominated in codepoints or code
units) to index into the UTF-8 source code.  However, the number of code
units is the more commonly used quantity, as it tells one how much
memory is required for simple array storage.

Richard.



RE: "A Programmer's Introduction to Unicode"

2017-03-14 Thread Doug Ewell
Philippe Verdy wrote:

>>> Well, you do have eleven bits for flags per codepoint, for example.
>>
>> That's not UCS-4; that's a custom encoding.
>>
>> (any UCS-4 code unit) & 0xFFE0 == 0

(changing to "UTF-32" per Ken's observation)

> Per definition yes, but UTC-4 is not Unicode.

I guess it's not. What is UTC-4, anyway? Another name for a UWG meeting
held in 1989?

> As well (any UCS-4 code unit) & 0xFFE0 == 0 (i.e. 21 bits) is not
> Unicode, UTF-32 is Unicode (more restrictive than just 21 bits which
> would allow 32 planes instead of just the 17 first ones).

I used bitwise arithmetic strictly to address Steffen's premise that the
11 "unused bits" in a UTF-32 code unit were available to store metadata
about the code point. Of course UTF-32 does not allow 0x11 through
0x1F either.

> I suppose he meant 21 bits, not 11 bits which covers only a small part
> of the BMP.

No, his comment "you do have eleven bits for flags per codepoint" pretty
clearly referred to using the "extra" 11 bits beyond what is needed to
hold the Unicode scalar value.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org




Re: "A Programmer's Introduction to Unicode"

2017-03-14 Thread Philippe Verdy
Per definition yes, but UTC-4 is not Unicode.
As well (any UCS-4 code unit) & 0xFFE0 == 0 (i.e. 21 bits) is not
Unicode, UTF-32 is Unicode (more restrictive than just 21 bits which would
allow 32 planes instead of just the 17 first ones).
I suppose he meant 21 bits, not 11 bits which covers only a small part of
the BMP.

2017-03-14 16:14 GMT+01:00 Doug Ewell :

> Steffen Nurpmeso wrote:
>
> >> I didn’t say you never needed to work with code points. What I said
> >> is that there’s no advantage to UCS-4 as an encoding, and that
> >
> > Well, you do have eleven bits for flags per codepoint, for example.
>
> That's not UCS-4; that's a custom encoding.
>
> (any UCS-4 code unit) & 0xFFE0 == 0
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>


Re: "A Programmer's Introduction to Unicode"

2017-03-14 Thread Doug Ewell
Steffen Nurpmeso wrote:

>> I didn’t say you never needed to work with code points. What I said
>> is that there’s no advantage to UCS-4 as an encoding, and that
>
> Well, you do have eleven bits for flags per codepoint, for example. 

That's not UCS-4; that's a custom encoding.

(any UCS-4 code unit) & 0xFFE0 == 0
 
--
Doug Ewell | Thornton, CO, US | ewellic.org



Re: "A Programmer's Introduction to Unicode"

2017-03-14 Thread Steffen Nurpmeso
Alastair Houghton  wrote:
 |On 13 Mar 2017, at 21:10, Khaled Hosny  wrote:
 |> On Mon, Mar 13, 2017 at 07:18:00PM +, Alastair Houghton wrote:
 |>> On 13 Mar 2017, at 17:55, J Decker  wrote:
 |>>> 
 |>>> I liked the Go implementation of character type - a rune type - \
 |>>> which is a codepoint.  and strings that return runes from by index.
 |>>> https://blog.golang.org/strings
 |>> 
 |>> IMO, returning code points by index is a mistake.  It over-emphasises
 |>> the importance of the code point, which helps to continue the notion
 |>> in some developers’ minds that code points are somehow “characters”.
 |>> It also leads to people unnecessarily using UCS-4 as an internal
 |>> representation, which seems to have very few advantages in practice
 |>> over UTF-16.
 |> 
 |> But there are many text operations that require access to Unicode code
 |> points. Take for example text layout, as mapping characters to glyphs
 |> and back has to operate on code points. The idea that you never need to
 |> work with code points is too simplistic.
 |
 |I didn’t say you never needed to work with code points.  What I said \
 |is that there’s no advantage to UCS-4 as an encoding, and that there’s \

Well, you do have eleven bits for flags per codepoint, for example.

 |no advantage to being able to index a string by code point.  As it \

With UTF-32 you can take the very codepoint and look up Unicode
classification tables.

 |happens, I’ve written the kind of code you cite as an example, including \
 |glyph mapping and OpenType processing, and the fact is that it’s no \
 |harder to do it with a UTF-16 string than it is with a UCS-4 string. \
 | Yes, certainly, surrogate pairs need to be decoded to map to glyphs; \
 |but that’s a *trivial* matter, particularly as the code point to glyph \
 |mapping is not 1:1 or even 1:N - it’s N:M, so you already need to cope \
 |with being able to map multiple code units in the string to multiple \
 |glyphs in the result.

If you have to iterate over a string to perform some high-level
processing then UTF-8 is a choice almost equally fine, for the
very same reasons you bring in.  And if the usage pattern
"hotness" pictures that this thread has shown up at the beginning
is correct, then the size overhead of UTF-8 that the UTF-16
proponents point out turns out to be a flop.

But i for one gave up on making a stand against UTF-16 or BOMs.
In fact i have turned to think UTF-16 is a pretty nice in-memory
representation, and it is a small step to get from it to the real
codepoint that you need to decide what something is, and what has
to be done with it.  I don't know whether i would really use it
for this purpose, though, i am pretty sure that my core Unicode
functions will (start to /) continue to use UTF-32, because the
codepoint to codepoint(s) is what is described, and onto which
anything else can be implemented.  I.e., you can store three
UTF-32 codepoints in a single uint64_t, and i would shoot myself
in the foot if i would make this accessible via an UTF-16 or UTF-8
converter, imho; instead, i (will) make it accessible directly as
UTF-32, and that serves equally well all other formats.  Of
course, if it is clear that you are UTF-16 all-through-the-way
then you can save the conversion, but (the) most (widespread)
Uni(x|ces) are UTF-8 based and it looks as if that would stay.
Yes, yes, you can nonetheless use UTF-16, but it will most likely
not safe you something on the database side due to storage
alignment requirements, and the necessity to be able to access
data somewhere.  You can have a single index-lookup array and
a dynamically sized database storage which uses two-byte
alignment, of course, then i can imagine UTF-16 is for the better.
I never looked how ICU does it, but i have been impressed by sheer
data facts ^.^

--steffen



Re: "A Programmer's Introduction to Unicode"

2017-03-14 Thread Alastair Houghton
On 14 Mar 2017, at 02:03, Richard Wordingham  
wrote:
> 
> On Mon, 13 Mar 2017 19:18:00 +
> Alastair Houghton  wrote:
> 
>> IMO, returning code points by index is a mistake.  It over-emphasises
>> the importance of the code point, which helps to continue the notion
>> in some developers’ minds that code points are somehow “characters”.
>> It also leads to people unnecessarily using UCS-4 as an internal
>> representation, which seems to have very few advantages in practice
>> over UTF-16.
> 
> The problem is that UTF-16 based code can very easily overlook the
> handling of surrogate pairs, and one very easily get confused over what
> string lengths mean.

Yet the same problem exists for UCS-4; it could very easily overlook the 
handling of combining characters.  As for string lengths, string lengths in 
code points are no more meaningful than string lengths in UTF-16 code units.  
They don’t tell you anything about the number of user-visible characters; or 
anything about the width the string will take up if rendered on the display 
(even in a fixed-width font); or anything about the number of glyphs that a 
given string might be transformed into by glyph mapping.  The *only* think a 
string length of a Unicode string will tell you is the number of code units.

Kind regards,

Alastair.

--
http://alastairs-place.net




Re: "A Programmer's Introduction to Unicode"

2017-03-14 Thread Alastair Houghton
On 13 Mar 2017, at 21:10, Khaled Hosny  wrote:
> 
> On Mon, Mar 13, 2017 at 07:18:00PM +, Alastair Houghton wrote:
>> On 13 Mar 2017, at 17:55, J Decker  wrote:
>>> 
>>> I liked the Go implementation of character type - a rune type - which is a 
>>> codepoint.  and strings that return runes from by index.
>>> https://blog.golang.org/strings
>> 
>> IMO, returning code points by index is a mistake.  It over-emphasises
>> the importance of the code point, which helps to continue the notion
>> in some developers’ minds that code points are somehow “characters”.
>> It also leads to people unnecessarily using UCS-4 as an internal
>> representation, which seems to have very few advantages in practice
>> over UTF-16.
> 
> But there are many text operations that require access to Unicode code
> points. Take for example text layout, as mapping characters to glyphs
> and back has to operate on code points. The idea that you never need to
> work with code points is too simplistic.

I didn’t say you never needed to work with code points.  What I said is that 
there’s no advantage to UCS-4 as an encoding, and that there’s no advantage to 
being able to index a string by code point.  As it happens, I’ve written the 
kind of code you cite as an example, including glyph mapping and OpenType 
processing, and the fact is that it’s no harder to do it with a UTF-16 string 
than it is with a UCS-4 string.  Yes, certainly, surrogate pairs need to be 
decoded to map to glyphs; but that’s a *trivial* matter, particularly as the 
code point to glyph mapping is not 1:1 or even 1:N - it’s N:M, so you already 
need to cope with being able to map multiple code units in the string to 
multiple glyphs in the result.

Kind regards,

Alastair.

--
http://alastairs-place.net




Re: "A Programmer's Introduction to Unicode"

2017-03-14 Thread Manish Goregaokar
Ah, it was what I thought you were talking about -- I wasn't aware they
were considered word boundaries :)

Thanks for the links!

On Mar 13, 2017 4:54 PM, "Richard Wordingham" <
richard.wording...@ntlworld.com> wrote:

On Mon, 13 Mar 2017 15:26:00 -0700
Manish Goregaokar  wrote:

> Do you have examples of AA being split that way (and further reading)?
> I think I'm aware of what you're talking about, but would love to read
> more about it.

Just googling for the three words 'Sanskrit', 'sandhi' and 'resolution'
brings up plenty of papers and discussion, e.g. Hellwig's at
http://ltc.amu.edu.pl/book/papers/LRL-1.pdf and a multi-author paper at
https://www.aclweb.org/anthology/C/C16/C16-1048.pdf.

There are even technical terms for before and after.  Unsplit text is
'samhita text', and text split into words is 'pada text'.

Richard.


Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread Richard Wordingham
On Mon, 13 Mar 2017 19:18:00 +
Alastair Houghton  wrote:

> IMO, returning code points by index is a mistake.  It over-emphasises
> the importance of the code point, which helps to continue the notion
> in some developers’ minds that code points are somehow “characters”.
> It also leads to people unnecessarily using UCS-4 as an internal
> representation, which seems to have very few advantages in practice
> over UTF-16.

The problem is that UTF-16 based code can very easily overlook the
handling of surrogate pairs, and one very easily get confused over what
string lengths mean.

Richard.



Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread Richard Wordingham
On Mon, 13 Mar 2017 20:20:25 -0400
"Mark E. Shoulson"  wrote:

> Sanskrit external vowel sandhi is comparatively 
> straightforward (compared to consonant sandhi), and it frequently
> loses information.  A *or* AA plus I is E; A *or* AA plus U is O (you
> need A + O to get AU).

Indeed, E can not only be A or AA plus I or II: it can also be E + A.
In the latter case avagraha is usual, at least in European practice.
(Would that generally be locale sa_Deva_GB?) I'd like advice on modern
Indian practice, and on the spacing and syllable division. I've seen a
claim that avagraha always belongs with the preceding vowel, but I'm
not sure that that rule applies in this case.

In a similar fashion, O can -AS + A-, an interesting case of visarga
sandhi. However, I'm not sure that one would want to *divide* the E or
O.

Richard.


Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread Mark E. Shoulson
A word ending in A *or* AA preceding a word beginning in A *or* AA will 
all coalesce to a single AA in Sanskrit.  That's four possibilities, and 
that doesn't count a word ending in a consonant preceding a word 
beginning in AA, which would be written the same.  My memory is rusty, 
so I should actually be looking things up, but I think these are valid 
constructions:


न + अगच्छत्  →  नागच्छत्
न + आगच्छत्  → नागच्छत्

(and indeed, आगच्छत् is the upasarga आ plus अगच्छत्, so there too the A 
+ AA coalesced.)  I should probably find you examples for all the other 
possibilities.  Sanskrit external vowel sandhi is comparatively 
straightforward (compared to consonant sandhi), and it frequently loses 
information.  A *or* AA plus I is E; A *or* AA plus U is O (you need A + 
O to get AU).


~mark


On 03/13/2017 06:26 PM, Manish Goregaokar wrote:

Do you have examples of AA being split that way (and further reading)?
I think I'm aware of what you're talking about, but would love to read
more about it.
-Manish


On Mon, Mar 13, 2017 at 2:47 PM, Richard Wordingham
 wrote:

On Mon, 13 Mar 2017 23:10:11 +0200
Khaled Hosny  wrote:


But there are many text operations that require access to Unicode code
points. Take for example text layout, as mapping characters to glyphs
and back has to operate on code points. The idea that you never need
to work with code points is too simplistic.

There are advantages to interpreting and operating on text as though it
were in form NFD.  However, there are still cases where one needs
fractions of a character, such as word boundaries in Sanskrit, though I
think the locations are liable to be specified in a language-specific
form.  U+093E DEVANAGARI VOWEL SIGN AA can have a word boundary in it
in at least 4 ways.

Richard.





Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread Richard Wordingham
On Mon, 13 Mar 2017 15:26:00 -0700
Manish Goregaokar  wrote:

> Do you have examples of AA being split that way (and further reading)?
> I think I'm aware of what you're talking about, but would love to read
> more about it.

Just googling for the three words 'Sanskrit', 'sandhi' and 'resolution'
brings up plenty of papers and discussion, e.g. Hellwig's at
http://ltc.amu.edu.pl/book/papers/LRL-1.pdf and a multi-author paper at
https://www.aclweb.org/anthology/C/C16/C16-1048.pdf.

There are even technical terms for before and after.  Unsplit text is
'samhita text', and text split into words is 'pada text'.

Richard.


Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread Manish Goregaokar
Do you have examples of AA being split that way (and further reading)?
I think I'm aware of what you're talking about, but would love to read
more about it.
-Manish


On Mon, Mar 13, 2017 at 2:47 PM, Richard Wordingham
 wrote:
> On Mon, 13 Mar 2017 23:10:11 +0200
> Khaled Hosny  wrote:
>
>> But there are many text operations that require access to Unicode code
>> points. Take for example text layout, as mapping characters to glyphs
>> and back has to operate on code points. The idea that you never need
>> to work with code points is too simplistic.
>
> There are advantages to interpreting and operating on text as though it
> were in form NFD.  However, there are still cases where one needs
> fractions of a character, such as word boundaries in Sanskrit, though I
> think the locations are liable to be specified in a language-specific
> form.  U+093E DEVANAGARI VOWEL SIGN AA can have a word boundary in it
> in at least 4 ways.
>
> Richard.


Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread Richard Wordingham
On Mon, 13 Mar 2017 23:10:11 +0200
Khaled Hosny  wrote:
 
> But there are many text operations that require access to Unicode code
> points. Take for example text layout, as mapping characters to glyphs
> and back has to operate on code points. The idea that you never need
> to work with code points is too simplistic.

There are advantages to interpreting and operating on text as though it
were in form NFD.  However, there are still cases where one needs
fractions of a character, such as word boundaries in Sanskrit, though I
think the locations are liable to be specified in a language-specific
form.  U+093E DEVANAGARI VOWEL SIGN AA can have a word boundary in it
in at least 4 ways.

Richard.


Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread Khaled Hosny
On Mon, Mar 13, 2017 at 07:18:00PM +, Alastair Houghton wrote:
> On 13 Mar 2017, at 17:55, J Decker  wrote:
> > 
> > I liked the Go implementation of character type - a rune type - which is a 
> > codepoint.  and strings that return runes from by index.
> > https://blog.golang.org/strings
> 
> IMO, returning code points by index is a mistake.  It over-emphasises
> the importance of the code point, which helps to continue the notion
> in some developers’ minds that code points are somehow “characters”.
> It also leads to people unnecessarily using UCS-4 as an internal
> representation, which seems to have very few advantages in practice
> over UTF-16.

But there are many text operations that require access to Unicode code
points. Take for example text layout, as mapping characters to glyphs
and back has to operate on code points. The idea that you never need to
work with code points is too simplistic.

Regards,
Khaled


Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread Alastair Houghton
On 13 Mar 2017, at 17:55, J Decker  wrote:
> 
> I liked the Go implementation of character type - a rune type - which is a 
> codepoint.  and strings that return runes from by index.
> https://blog.golang.org/strings

IMO, returning code points by index is a mistake.  It over-emphasises the 
importance of the code point, which helps to continue the notion in some 
developers’ minds that code points are somehow “characters”.  It also leads to 
people unnecessarily using UCS-4 as an internal representation, which seems to 
have very few advantages in practice over UTF-16.

> Doesn't solve the problem for composited codepoints though... 
> 
> texel looks to be defined as a graphic element already.  TEXture ELement.

Yes, but I thought the proposal was “textel”, with the extra “t”.  Re-using 
“texel” would be quite inappropriate; there are certainly people who work on 
rendering software who would strongly object to that, for very good reasons.

I would caution, however, that there’s already a lot of terminology associated 
with Unicode, perhaps for understandable reasons, but if the word “textel” is 
going to have a definition that differs from (say) an extended grapheme 
cluster, I think a great deal of consideration should be given to what exactly 
that definition should be.  We already have “characters”, code units, code 
points, combining sequences, graphemes, grapheme clusters, extended grapheme 
clusters and probably other things I’ve missed off that list.  Merely adding 
yet another bit of terminology isn’t going to fix the problem of developers 
misunderstanding or simply not being aware of the correct terminology or of 
some aspect of Unicode’s behaviour.

Kind regards,

Alastair.

--
http://alastairs-place.net




Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread J Decker
I liked the Go implementation of character type - a rune type - which is a
codepoint.  and strings that return runes from by index.
https://blog.golang.org/strings

Doesn't solve the problem for composited codepoints though...

texel looks to be defined as a graphic element already.  TEXture ELement.



On Mon, Mar 13, 2017 at 10:15 AM, Janusz S. Bien 
wrote:

> Quote/Cytat - Asmus Freytag  (Mon 13 Mar 2017
> 06:00:08 PM CET):
>
> [...]
>
> This (or similar) scenarios indicate the impossibility to come to a
> single, universal definition of a "textel" -- the main reason why this
> term is of lower utility than "pixel".
>
> I agree that it is impossible  to come to a single, universal definition
> of text elements, but it seems possible to reach a consensus on a kind of
> the least common denominator of them and call it "textel" or something else.
>
>
> Best regards
>
> Janusz
>
> --
> Prof. dr hab. Janusz S. Bień -  Uniwersytet Warszawski (Katedra
> Lingwistyki Formalnej)
> Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
> jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~
> jsbien/
>
>


Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread Janusz S. Bien

Quote/Cytat - J Decker  (Mon 13 Mar 2017 06:55:18 PM CET):


texel looks to be defined as a graphic element already.  TEXture ELement.


I'm aware of it, but homonymy/polysemy is something we have to live  
with. I think there is no risk of confusing texture elements with text  
elements, despite the fact that 'texture' and 'text' have similar  
origin.


Best regards

Janusz

--
Prof. dr hab. Janusz S. Bień -  Uniwersytet Warszawski (Katedra  
Lingwistyki Formalnej)

Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/



Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread Janusz S. Bien
Quote/Cytat - Asmus Freytag  (Mon 13 Mar 2017  
06:00:08 PM CET):


[...]

This (or similar) scenarios indicate the impossibility to come to a
single, universal definition of a "textel" -- the main reason why this
term is of lower utility than "pixel".

I agree that it is impossible  to come to a single, universal  
definition of text elements, but it seems possible to reach a  
consensus on a kind of the least common denominator of them and call  
it "textel" or something else.


Best regards

Janusz

--
Prof. dr hab. Janusz S. Bień -  Uniwersytet Warszawski (Katedra  
Lingwistyki Formalnej)

Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/



Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread Asmus Freytag

  
  
On 3/13/2017 3:31 AM, Janusz S. Bien
  wrote:

Just yet another reason for introducing the notion of
  textel?
  

The main difference between "textel" and "pixel"
is that the unit of processing /displaying text is not uniform
and fixed, unlike a pixel. In other words, different operations
may need to look at text differently, and I don't mean the trivial
case of storage (byte level) vs. any higher level.
Correspondingly the discussion of "text
element" at least in the early versions of the Unicode Standard,
left the particular division of the text into "text elements"
unspecified.
There are closely related tasks that might
demonstrate this. Assume a script where multiple code points
make up a syllable, yet that syllable is the intuitive basic
unit of reading and writing.
  
One task is cursor placement. For that task,
you need to be able to divide *any* text so that the cursor
ideally does not get positioned in the middle of a syllalbel.
However, the definition of a "syllable" has to allow degenerate
and 'defective' cases. Which is which is of no importance, as
long as it is possible to find a valid cursor position.
  
The other task would be to assert that a
string contains only well-formed syllables. Here, it is crucially
necessary to be able to define which syllables are well-formed.
Finding divisions in parts of the string that does not contain
well-formed syllables is not necessary.
You may also find that in some cases, even though
the syllable is the basic unit, there may be a need to edit it
in ways other than as a unit. Some syllables may have some
optional marks, signs or symbols added that may need to be edited
or traversed explicitly, while a "core" syllable may be more likely
to be a unit.
This (or similar) scenarios indicate the
impossibility to come to a single, universal definition of a
"textel" -- the main reason why this term is of lower utility
than "pixel".
A./
  
  



Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread William_J_G Overington
Prof. Janusz S. Bień wrote:

> Just yet another reason for introducing the notion of textel?

I opine that it would be a good idea to introduce several new words, of which 
textel would be one, with each such new word having a precisely-defined meaning 
so that in precise discussions of programming techniques people could discuss 
the situation without needing to use any of the words character, code point, 
grapheme cluster.

How many such new words would be needed?

I remember how in electronics the introduction of the term Hertz to be used 
instead of cycles per second helped discussions.

After the introduction of the term Hertz it became easy to refer to twenty 
cycles of a fifty Hertz signal without confusion over one's meaning.

So introducing several new precisely-defined words now could help lots of 
discussions in the future.

Perhaps, apart from textel, the definitions could be produced first and then 
people can decide, for each such definition, which new word would be a good 
word to have that definition.

The recent introduction into Unicode of ZWJ sequences for some emoji and the 
introduction into Unicode of tag sequences applied to a base character does 
could mean that the introducing of such new words becomes of increasing 
importance due to the programming implications of those recently introduced 
techniques. 

William Overington

Monday 13 March 2017




Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread Janusz S. Bien
Quote/Cytat - William_J_G Overington  (Mon  
13 Mar 2017 12:24:13 PM CET):



Prof. Janusz S. Bień wrote:


Just yet another reason for introducing the notion of textel?


I opine that it would be a good idea to introduce several new words,  
of which textel would be one, with each such new word having a  
precisely-defined meaning so that in precise discussions of  
programming techniques people could discuss the situation without  
needing to use any of the words character, code point, grapheme  
cluster.


How many such new words would be needed?


In my paper (in Polish)

http://bc.klf.uw.edu.pl/480/

I propose also the term "texton" meaning a code point from a specific  
subset, not yet fully defined, but including at least the components  
of composite characters.


Best regards

Janusz

--
Prof. dr hab. Janusz S. Bień -  Uniwersytet Warszawski (Katedra  
Lingwistyki Formalnej)

Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/



Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread Janusz S. Bien
Quote/Cytat - Richard Wordingham   
(Sun 12 Mar 2017 09:10:22 PM CET):



On Sun, 12 Mar 2017 20:02:28 +0100
"Janusz S. Bien"  wrote:


If the basic notion has to be referred in a cumbersome way as
"extended grapheme cluster" then it is easier to talk about "Unicode
characters" despite the fact that they have a rather loose relation
to real-life/user-perceived characters.


The notion that extended grapheme clusters corresponds to
user-perceived characters is also rather dodgy.


The idea is not mine, but it appears from time to time on the list in  
a more or less explicit way.



Whereas it may work
for French, it is getting very dubious by the time one adds Hebrew
cantillation marks or Vedic accentuation.  The Thais revolted when
their preposed vowels were joined with the following consonant in the
same extended grapheme cluster, and Unicode had to revoke that union.


Just yet another reason for introducing the notion of textel?

Best regards

Janusz


--
Prof. dr hab. Janusz S. Bień -  Uniwersytet Warszawski (Katedra  
Lingwistyki Formalnej)

Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/



Re: "A Programmer's Introduction to Unicode"

2017-03-12 Thread Richard Wordingham
On Sun, 12 Mar 2017 20:02:28 +0100
"Janusz S. Bien"  wrote:

> If the basic notion has to be referred in a cumbersome way as  
> "extended grapheme cluster" then it is easier to talk about "Unicode  
> characters" despite the fact that they have a rather loose relation
> to real-life/user-perceived characters.

The notion that extended grapheme clusters corresponds to
user-perceived characters is also rather dodgy.  Whereas it may work
for French, it is getting very dubious by the time one adds Hebrew
cantillation marks or Vedic accentuation.  The Thais revolted when
their preposed vowels were joined with the following consonant in the
same extended grapheme cluster, and Unicode had to revoke that union.

Richard.


Re: "A Programmer's Introduction to Unicode"

2017-03-12 Thread Janusz S. Bien
Quote/Cytat - Manish Goregaokar  (Sun 12 Mar 2017  
07:43:22 PM CET):



This is just another confirmation that the present Unicode terminology

is confusing.

I find this to be a symptom of our pedagogy around "characters" in
programming; most folks get taught that characters are bytes are code
points, especially because many languages try to make this the case.
The name "grapheme cluster" could be improved upon, but it's not the
primary source of this confusion.


I agree that it's not the primary source. However the pedagogy depends  
on the terminology used.


If the basic notion has to be referred in a cumbersome way as  
"extended grapheme cluster" then it is easier to talk about "Unicode  
characters" despite the fact that they have a rather loose relation to  
real-life/user-perceived characters.


Best regards

Janusz

--
Prof. dr hab. Janusz S. Bień -  Uniwersytet Warszawski (Katedra  
Lingwistyki Formalnej)

Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/



Re: "A Programmer's Introduction to Unicode"

2017-03-12 Thread Manish Goregaokar
> This is just another confirmation that the present Unicode terminology
is confusing.

I find this to be a symptom of our pedagogy around "characters" in
programming; most folks get taught that characters are bytes are code
points, especially because many languages try to make this the case.
The name "grapheme cluster" could be improved upon, but it's not the
primary source of this confusion.
-Manish


On Sat, Mar 11, 2017 at 10:04 PM, Janusz S. Bień  wrote:
> On Fri, Mar 10 2017 at 19:55 CET, man...@mozilla.com writes:
>> I recently wrote
>> http://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/
>> , which sort of addresses the whole hangup programmers have with
>> treating code points as "characters".
>
> [...]
>
> This is just another confirmation that the present Unicode terminology
> is confusing. Let me remind below a fragment of an old thread about
> "textels".
>
> Best regards
>
> Janusz
>
>
> On Thu, Sep 15 2016 at 21:12 CEST, jsb...@mimuw.edu.pl writes:
>> On Thu, Sep 15 2016 at 16:36 CEST, john.w.kenn...@gmail.com writes:
>>
>> [...]
>>
>>> In the new Swift programming language, which is white-hot in the Apple
>>> community, Apple is moving toward a model of a transparent, generic
>>> Unicode that can be “viewed” as UTF-8, UTF-16, or UTF-32 if necessary,
>>> but in which a “character” contains however many code points it needs
>>> (“e” with a stacked macron, acute accent, and dieresis is
>>> algorithmically one “character” in Swift). Moreover,
>>> e-with-an-acute-accent and e followed by a combining acute accent, for
>>> example, compare as equal. At present, the underlying code is still
>>> UTF-16LE.
>>
>> For several years I use the name "textel" (text element, in Polish
>> "tekstel") for such objects. I do it mostly orally in my presentations
>> for my students, but I used it also in writing e.g. in
>> http://bc.klf.uw.edu.pl/118/, unfortunately without a proper
>> definition. A rudymentary definition was provided for me only in my
>> recent paper in Polish: http://bc.klf.uw.edu.pl/480/. It states simply
>> (on p. 69) "an elementary text element independently of its Unicode
>> representation" (meaning in particular composed vs precomposed). I still
>> hope to formulate sooner or later a more satisfactory definition :-)
>>
>> I think Swift confirms that such a notion is really needed.
>>
>> Best regards
>>
>> Janusz
>
> On Wed, Sep 21 2016 at  6:44 CEST, jsb...@mimuw.edu.pl writes:
>> On Tue, Sep 20 2016 at 18:09 CEST, d...@ewellic.org writes:
>>> Janusz Bień wrote:
>>>
 For me it means that Swift's characters are equivalence classes of the
 set of extended grapheme clusters by canonical equivalence relation.
>>>
>>> I still hope we can come to some conclusion on the correct Unicode name
>>> for this concept. I don't think non-Unicode interpretations of terms
>>> like "grapheme" are grounds for throwing out "grapheme cluster,"
>>
>> I agree.
>>
>>> but I can see that the equivalence class itself is lacking a name.
>>
>> I'glad.
>>
>>>
>>> Note that the Swift definition doesn't say that <00E9> and <0065 0301>
>>> are identical entities, only that the language compares them as equal.
>>
>> I'm fully aware of this.
>>
>> Best regards
>>
>> Janusz
>
>
> --
>,
> Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki 
> Formalnej)
> Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
> jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
>



Re: "A Programmer's Introduction to Unicode"

2017-03-11 Thread Janusz S. Bień
On Fri, Mar 10 2017 at 19:55 CET, man...@mozilla.com writes:
> I recently wrote
> http://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/
> , which sort of addresses the whole hangup programmers have with
> treating code points as "characters".

[...]

This is just another confirmation that the present Unicode terminology
is confusing. Let me remind below a fragment of an old thread about
"textels".

Best regards

Janusz


On Thu, Sep 15 2016 at 21:12 CEST, jsb...@mimuw.edu.pl writes:
> On Thu, Sep 15 2016 at 16:36 CEST, john.w.kenn...@gmail.com writes:
>
> [...]
>
>> In the new Swift programming language, which is white-hot in the Apple
>> community, Apple is moving toward a model of a transparent, generic
>> Unicode that can be “viewed” as UTF-8, UTF-16, or UTF-32 if necessary,
>> but in which a “character” contains however many code points it needs
>> (“e” with a stacked macron, acute accent, and dieresis is
>> algorithmically one “character” in Swift). Moreover,
>> e-with-an-acute-accent and e followed by a combining acute accent, for
>> example, compare as equal. At present, the underlying code is still
>> UTF-16LE.
>
> For several years I use the name "textel" (text element, in Polish
> "tekstel") for such objects. I do it mostly orally in my presentations
> for my students, but I used it also in writing e.g. in
> http://bc.klf.uw.edu.pl/118/, unfortunately without a proper
> definition. A rudymentary definition was provided for me only in my
> recent paper in Polish: http://bc.klf.uw.edu.pl/480/. It states simply
> (on p. 69) "an elementary text element independently of its Unicode
> representation" (meaning in particular composed vs precomposed). I still
> hope to formulate sooner or later a more satisfactory definition :-)
>
> I think Swift confirms that such a notion is really needed.
>
> Best regards
>
> Janusz

On Wed, Sep 21 2016 at  6:44 CEST, jsb...@mimuw.edu.pl writes:
> On Tue, Sep 20 2016 at 18:09 CEST, d...@ewellic.org writes:
>> Janusz Bień wrote:
>>
>>> For me it means that Swift's characters are equivalence classes of the
>>> set of extended grapheme clusters by canonical equivalence relation.
>>
>> I still hope we can come to some conclusion on the correct Unicode name
>> for this concept. I don't think non-Unicode interpretations of terms
>> like "grapheme" are grounds for throwing out "grapheme cluster,"
>
> I agree.
>
>> but I can see that the equivalence class itself is lacking a name.
>
> I'glad.
>
>>
>> Note that the Swift definition doesn't say that <00E9> and <0065 0301>
>> are identical entities, only that the language compares them as equal.
>
> I'm fully aware of this.
>
> Best regards
>
> Janusz


-- 
   ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki 
Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/



Re: "A Programmer's Introduction to Unicode"

2017-03-10 Thread Manish Goregaokar
I recently wrote
http://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/
, which sort of addresses the whole hangup programmers have with
treating code points as "characters".

I also wrote 
http://manishearth.github.io/blog/2017/01/15/breaking-our-latin-1-assumptions/
that provides a useful list of scripts to check against when figuring
out if your design makes sense uniformly across scripts.


There's also https://eev.ee/blog/2015/09/12/dark-corners-of-unicode/
-Manish


On Fri, Mar 10, 2017 at 9:00 AM, Peter Constable  wrote:
> FYI:
>
>
>
> http://reedbeta.com/blog/programmers-intro-to-unicode/
>
>
>
> The visuals may be the most interesting part. E.g., in the usage heat map,
> Arabic Presentation Forms-B lights up much more than I would have expected –
> as much as a lot of emoji.
>
>
>
>
>
>
>
> Peter



Re: "A Programmer's Introduction to Unicode"

2017-03-10 Thread Khaled Hosny
On Fri, Mar 10, 2017 at 05:00:55PM +, Peter Constable wrote:
> FYI:
> 
> http://reedbeta.com/blog/programmers-intro-to-unicode/
> 
> The visuals may be the most interesting part. E.g., in the usage heat
> map, Arabic Presentation Forms-B lights up much more than I would have
> expected

I often see U+FEFB and other lam-alef ligatures used on social media (I
easily spot it because my default font does not have them so they end up
using fallback font).

My guess is that might be because some keyboard layouts (Xorg, Android?)
use them for the lam-alef keys on the keyboard (I’m guilty of doing this
for Xorg keyboard layout because it didn’t handle more than one
character per key, this was then decomposed back inside XIM input
method, but many people don’t use XIM and the decomposition does not
happen, it was messy overall).

Regards,
Khaled