Re: "A Programmer's Introduction to Unicode"

2017-03-14 Thread Richard Wordingham
On Tue, 14 Mar 2017 08:51:18 +
Alastair Houghton  wrote:

> On 14 Mar 2017, at 02:03, Richard Wordingham
>  wrote:
> > 
> > On Mon, 13 Mar 2017 19:18:00 +
> > Alastair Houghton  wrote:

> > The problem is that UTF-16 based code can very easily overlook the
> > handling of surrogate pairs, and one very easily get confused over
> > what string lengths mean.  
> 
> Yet the same problem exists for UCS-4; it could very easily overlook
> the handling of combining characters.

That's a different issue.  I presume you mean the issues of canonical
equivalence and detecting text boundaries.  Again, there is the problem
of remembering to consider the whole surrogate pair when using
UTF-16.  (I suppose this could be largely handled by avoiding the
concept of arrays.)  Now, the supplementary characters where these
issues arise are very infrequently used.  An error in UTF-16 code might
easily not come to attention, whereas a problem with UCS-4 (or UTF-8)
comes to light as soon as one handles Thai or IPA.

> As for string lengths, string
> lengths in code points are no more meaningful than string lengths in
> UTF-16 code units.  They don’t tell you anything about the number of
> user-visible characters; or anything about the width the string will
> take up if rendered on the display (even in a fixed-width font); or
> anything about the number of glyphs that a given string might be
> transformed into by glyph mapping.  The *only* think a string length
> of a Unicode string will tell you is the number of code units.

A string length in codepoints does have the advantage of being
independent of encoding.  I'm actually using an index for UTF-16
text (I don't know whether its denominated in codepoints or code
units) to index into the UTF-8 source code.  However, the number of code
units is the more commonly used quantity, as it tells one how much
memory is required for simple array storage.

Richard.



RE: "A Programmer's Introduction to Unicode"

2017-03-14 Thread Doug Ewell
Philippe Verdy wrote:

>>> Well, you do have eleven bits for flags per codepoint, for example.
>>
>> That's not UCS-4; that's a custom encoding.
>>
>> (any UCS-4 code unit) & 0xFFE0 == 0

(changing to "UTF-32" per Ken's observation)

> Per definition yes, but UTC-4 is not Unicode.

I guess it's not. What is UTC-4, anyway? Another name for a UWG meeting
held in 1989?

> As well (any UCS-4 code unit) & 0xFFE0 == 0 (i.e. 21 bits) is not
> Unicode, UTF-32 is Unicode (more restrictive than just 21 bits which
> would allow 32 planes instead of just the 17 first ones).

I used bitwise arithmetic strictly to address Steffen's premise that the
11 "unused bits" in a UTF-32 code unit were available to store metadata
about the code point. Of course UTF-32 does not allow 0x11 through
0x1F either.

> I suppose he meant 21 bits, not 11 bits which covers only a small part
> of the BMP.

No, his comment "you do have eleven bits for flags per codepoint" pretty
clearly referred to using the "extra" 11 bits beyond what is needed to
hold the Unicode scalar value.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org




Re: "A Programmer's Introduction to Unicode"

2017-03-14 Thread Philippe Verdy
Per definition yes, but UTC-4 is not Unicode.
As well (any UCS-4 code unit) & 0xFFE0 == 0 (i.e. 21 bits) is not
Unicode, UTF-32 is Unicode (more restrictive than just 21 bits which would
allow 32 planes instead of just the 17 first ones).
I suppose he meant 21 bits, not 11 bits which covers only a small part of
the BMP.

2017-03-14 16:14 GMT+01:00 Doug Ewell :

> Steffen Nurpmeso wrote:
>
> >> I didn’t say you never needed to work with code points. What I said
> >> is that there’s no advantage to UCS-4 as an encoding, and that
> >
> > Well, you do have eleven bits for flags per codepoint, for example.
>
> That's not UCS-4; that's a custom encoding.
>
> (any UCS-4 code unit) & 0xFFE0 == 0
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>


Re: "A Programmer's Introduction to Unicode"

2017-03-14 Thread Doug Ewell
Steffen Nurpmeso wrote:

>> I didn’t say you never needed to work with code points. What I said
>> is that there’s no advantage to UCS-4 as an encoding, and that
>
> Well, you do have eleven bits for flags per codepoint, for example. 

That's not UCS-4; that's a custom encoding.

(any UCS-4 code unit) & 0xFFE0 == 0
 
--
Doug Ewell | Thornton, CO, US | ewellic.org



Re: "A Programmer's Introduction to Unicode"

2017-03-14 Thread Steffen Nurpmeso
Alastair Houghton  wrote:
 |On 13 Mar 2017, at 21:10, Khaled Hosny  wrote:
 |> On Mon, Mar 13, 2017 at 07:18:00PM +, Alastair Houghton wrote:
 |>> On 13 Mar 2017, at 17:55, J Decker  wrote:
 |>>> 
 |>>> I liked the Go implementation of character type - a rune type - \
 |>>> which is a codepoint.  and strings that return runes from by index.
 |>>> https://blog.golang.org/strings
 |>> 
 |>> IMO, returning code points by index is a mistake.  It over-emphasises
 |>> the importance of the code point, which helps to continue the notion
 |>> in some developers’ minds that code points are somehow “characters”.
 |>> It also leads to people unnecessarily using UCS-4 as an internal
 |>> representation, which seems to have very few advantages in practice
 |>> over UTF-16.
 |> 
 |> But there are many text operations that require access to Unicode code
 |> points. Take for example text layout, as mapping characters to glyphs
 |> and back has to operate on code points. The idea that you never need to
 |> work with code points is too simplistic.
 |
 |I didn’t say you never needed to work with code points.  What I said \
 |is that there’s no advantage to UCS-4 as an encoding, and that there’s \

Well, you do have eleven bits for flags per codepoint, for example.

 |no advantage to being able to index a string by code point.  As it \

With UTF-32 you can take the very codepoint and look up Unicode
classification tables.

 |happens, I’ve written the kind of code you cite as an example, including \
 |glyph mapping and OpenType processing, and the fact is that it’s no \
 |harder to do it with a UTF-16 string than it is with a UCS-4 string. \
 | Yes, certainly, surrogate pairs need to be decoded to map to glyphs; \
 |but that’s a *trivial* matter, particularly as the code point to glyph \
 |mapping is not 1:1 or even 1:N - it’s N:M, so you already need to cope \
 |with being able to map multiple code units in the string to multiple \
 |glyphs in the result.

If you have to iterate over a string to perform some high-level
processing then UTF-8 is a choice almost equally fine, for the
very same reasons you bring in.  And if the usage pattern
"hotness" pictures that this thread has shown up at the beginning
is correct, then the size overhead of UTF-8 that the UTF-16
proponents point out turns out to be a flop.

But i for one gave up on making a stand against UTF-16 or BOMs.
In fact i have turned to think UTF-16 is a pretty nice in-memory
representation, and it is a small step to get from it to the real
codepoint that you need to decide what something is, and what has
to be done with it.  I don't know whether i would really use it
for this purpose, though, i am pretty sure that my core Unicode
functions will (start to /) continue to use UTF-32, because the
codepoint to codepoint(s) is what is described, and onto which
anything else can be implemented.  I.e., you can store three
UTF-32 codepoints in a single uint64_t, and i would shoot myself
in the foot if i would make this accessible via an UTF-16 or UTF-8
converter, imho; instead, i (will) make it accessible directly as
UTF-32, and that serves equally well all other formats.  Of
course, if it is clear that you are UTF-16 all-through-the-way
then you can save the conversion, but (the) most (widespread)
Uni(x|ces) are UTF-8 based and it looks as if that would stay.
Yes, yes, you can nonetheless use UTF-16, but it will most likely
not safe you something on the database side due to storage
alignment requirements, and the necessity to be able to access
data somewhere.  You can have a single index-lookup array and
a dynamically sized database storage which uses two-byte
alignment, of course, then i can imagine UTF-16 is for the better.
I never looked how ICU does it, but i have been impressed by sheer
data facts ^.^

--steffen



Re: "A Programmer's Introduction to Unicode"

2017-03-14 Thread Alastair Houghton
On 14 Mar 2017, at 02:03, Richard Wordingham  
wrote:
> 
> On Mon, 13 Mar 2017 19:18:00 +
> Alastair Houghton  wrote:
> 
>> IMO, returning code points by index is a mistake.  It over-emphasises
>> the importance of the code point, which helps to continue the notion
>> in some developers’ minds that code points are somehow “characters”.
>> It also leads to people unnecessarily using UCS-4 as an internal
>> representation, which seems to have very few advantages in practice
>> over UTF-16.
> 
> The problem is that UTF-16 based code can very easily overlook the
> handling of surrogate pairs, and one very easily get confused over what
> string lengths mean.

Yet the same problem exists for UCS-4; it could very easily overlook the 
handling of combining characters.  As for string lengths, string lengths in 
code points are no more meaningful than string lengths in UTF-16 code units.  
They don’t tell you anything about the number of user-visible characters; or 
anything about the width the string will take up if rendered on the display 
(even in a fixed-width font); or anything about the number of glyphs that a 
given string might be transformed into by glyph mapping.  The *only* think a 
string length of a Unicode string will tell you is the number of code units.

Kind regards,

Alastair.

--
http://alastairs-place.net




Re: "A Programmer's Introduction to Unicode"

2017-03-14 Thread Alastair Houghton
On 13 Mar 2017, at 21:10, Khaled Hosny  wrote:
> 
> On Mon, Mar 13, 2017 at 07:18:00PM +, Alastair Houghton wrote:
>> On 13 Mar 2017, at 17:55, J Decker  wrote:
>>> 
>>> I liked the Go implementation of character type - a rune type - which is a 
>>> codepoint.  and strings that return runes from by index.
>>> https://blog.golang.org/strings
>> 
>> IMO, returning code points by index is a mistake.  It over-emphasises
>> the importance of the code point, which helps to continue the notion
>> in some developers’ minds that code points are somehow “characters”.
>> It also leads to people unnecessarily using UCS-4 as an internal
>> representation, which seems to have very few advantages in practice
>> over UTF-16.
> 
> But there are many text operations that require access to Unicode code
> points. Take for example text layout, as mapping characters to glyphs
> and back has to operate on code points. The idea that you never need to
> work with code points is too simplistic.

I didn’t say you never needed to work with code points.  What I said is that 
there’s no advantage to UCS-4 as an encoding, and that there’s no advantage to 
being able to index a string by code point.  As it happens, I’ve written the 
kind of code you cite as an example, including glyph mapping and OpenType 
processing, and the fact is that it’s no harder to do it with a UTF-16 string 
than it is with a UCS-4 string.  Yes, certainly, surrogate pairs need to be 
decoded to map to glyphs; but that’s a *trivial* matter, particularly as the 
code point to glyph mapping is not 1:1 or even 1:N - it’s N:M, so you already 
need to cope with being able to map multiple code units in the string to 
multiple glyphs in the result.

Kind regards,

Alastair.

--
http://alastairs-place.net




Re: "A Programmer's Introduction to Unicode"

2017-03-14 Thread Manish Goregaokar
Ah, it was what I thought you were talking about -- I wasn't aware they
were considered word boundaries :)

Thanks for the links!

On Mar 13, 2017 4:54 PM, "Richard Wordingham" <
richard.wording...@ntlworld.com> wrote:

On Mon, 13 Mar 2017 15:26:00 -0700
Manish Goregaokar  wrote:

> Do you have examples of AA being split that way (and further reading)?
> I think I'm aware of what you're talking about, but would love to read
> more about it.

Just googling for the three words 'Sanskrit', 'sandhi' and 'resolution'
brings up plenty of papers and discussion, e.g. Hellwig's at
http://ltc.amu.edu.pl/book/papers/LRL-1.pdf and a multi-author paper at
https://www.aclweb.org/anthology/C/C16/C16-1048.pdf.

There are even technical terms for before and after.  Unsplit text is
'samhita text', and text split into words is 'pada text'.

Richard.