Arranging Hieroglyphics (was: A sign/abbreviation for "magister")

2018-11-04 Thread Richard Wordingham via Unicode
On Sat, 3 Nov 2018 22:55:17 +0100
Philippe Verdy via Unicode  wrote:

> I can also cite the case of Egyptian hieroglyphs: there's still no
> way to render them correctly because we lack the development of a
> stable orthography that would drive the encoding of the missing
> **semantic** characters (for this reason Egyptian hieroglyphs still
> require an upper layer protocol, as there's still no accepted
> orthographic norm that successfully represents all possible semantic
> variations, but alsop because the research on old Egyptian
> hieroglyphs is still aphic very incomplete).

If you study the document register, you'll find that layout
control characters are being added.  I think semantic characters would
have depended on the font to select the rendering consequences; this
will now not happen.  What we're getting is more rigorous version of the
Manuel de Codage.

Richard.


Re: UCA unnecessary collation weight 0000

2018-11-04 Thread Philippe Verdy via Unicode
So you finally admit that I was right... And that the specs include
requirements that are not even needed to make UCA work, and that not even
used by wellknown implementations. These are old artefacts which are now
really confusive (instructing programmers to adopt the old deprecated
behavior, before realizing that this was a bad advice which jut complicated
their task). UCA can be implemented **conformingly** without these, even
for the simplest implementations (where using complex packages like ICU is
not an option and rewriting it is not one as well for much simpler goals)
where these incorrect requirements are in fact suggesting to be more
inefficient than really needed.
There's not a lot of work to edit and to fix the specs without these
polluting  "pseudo-weights".

Le dim. 4 nov. 2018 à 09:27, Mark Davis ☕️  a écrit :

> Philippe, I agree that we could have structured the UCA differently. It
> does make sense, for example, to have the weights be simply decimal values
> instead of integers. But nobody is going to go through the substantial
> work of restructuring the UCA spec and data file unless there is a very
> strong reason to do so. It takes far more time and effort than people
> realize to change in the algorithm/data while making sure that everything
> lines up without inadvertent changes being introduced.
>
> It is just not worth the effort. There are so, so, many things we can do
> in Unicode (encoding, properties, algorithms, CLDR, ICU) that have a higher
> benefit.
>
> You can continue flogging this horse all you want, but I'm muting this
> thread (and I suspect I'm not the only one).
>
> Mark
>
>
> On Sun, Nov 4, 2018 at 2:37 AM Philippe Verdy via Unicode <
> unicode@unicode.org> wrote:
>
>> Le ven. 2 nov. 2018 à 22:27, Ken Whistler  a écrit :
>>
>>>
>>> On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote:
>>>
>>> I was replying not about the notational repreentation of the DUCET data
>>> table (using [....] unnecessarily) but about the text of UTR#10 itself.
>>> Which remains highly confusive, and contains completely unnecesary steps,
>>> and just complicates things with absoiluytely no benefit at all by
>>> introducing confusion about these "".
>>>
>>> Sorry, Philippe, but the confusion that I am seeing introduced is what
>>> you are introducing to the unicode list in the course of this discussion.
>>>
>>>
>>> UTR#10 still does not explicitly state that its use of "" does not
>>> mean it is a valid "weight", it's a notation only
>>>
>>> No, it is explicitly a valid weight. And it is explicitly and
>>> normatively referred to in the specification of the algorithm. See UTS10-D8
>>> (and subsequent definitions), which explicitly depend on a definition of "A
>>> collation weight whose value is zero." The entire statement of what are
>>> primary, secondary, tertiary, etc. collation elements depends on that
>>> definition. And see the tables in Section 3.2, which also depend on those
>>> definitions.
>>>
>>> (but the notation is used for TWO distinct purposes: one is for
>>> presenting the notation format used in the DUCET
>>>
>>> It is *not* just a notation format used in the DUCET -- it is part of
>>> the normative definitional structure of the algorithm, which then
>>> percolates down into further definitions and rules and the steps of the
>>> algorithm.
>>>
>>
>> I insist that this is NOT NEEDED at all for the definition, it is
>> absolutely NOT structural. The algorithm still guarantees the SAME result.
>>
>> It is ONLY used to explain the format of the DUCET and the fact the this
>> format does NOT use  as a valid weight, ans os can use it as a notation
>> (in fact only a presentational feature).
>>
>>
>>> itself to present how collation elements are structured, the other one
>>> is for marking the presence of a possible, but not always required,
>>> encoding of an explicit level separator for encoding sort keys).
>>>
>>> That is a numeric value of zero, used in Section 7.3, Form Sort Keys. It
>>> is not part of the *notation* for collation elements, but instead is a
>>> magic value chosen for the level separator precisely because zero values
>>> from the collation elements are removed during sort key construction, so
>>> that zero is then guaranteed to be a lower value than any remaining weight
>>> added to the sort key under construction. This part of the algorithm is not
>>> rocket science, by the way!
>>>
>>
>> Here again you make a confusion: a sort key MAY use them as separators if
>> it wants to compress keys by reencoding weights per level: that's the only
>> case where you may want to introduce an encoding pattern starting with 0,
>> while the rest of the encoding for weights in that level must using
>> patterns not starting by this 0 (the number of bits to encode this 0 does
>> not matter: it is only part of the encoding used on this level which does
>> not necessarily have to use 16-bit code units per weight.
>>
>>>
>>> Even the example tables can be 

Re: Encoding

2018-11-04 Thread Philippe Verdy via Unicode
I can take another example about what I call "legacy encoding" (which
really means that such encoding is just an "approximation" from which no
semantic can be clearly infered, except by using a non-determinist
heuristic, which can frequently make "false guesses").

Consider the case of the legacy Hangul "half-width" jamos: they were kept
in Unicode (as compatibility characters) not recommended for encoding
natural Korean text, because their semantic is not clear when they are used
in sequences: it's impossible to know clearly where semantically
significant syllable breaks occur, because they don't distinguish the
"leading" and "trailing consonants", and so it is not even possible to
clearly infer that any Hangul "half-width" vowel jamos is logically
attached to the same syllable as the "half-width" consonnant (or
consonnant+vowel) jamo that is encoded just before it. As a consequence,
you cannot safely convert Korean texts using these "half-width" jamos into
normal jamos: only an heuristic attempts to detertemine the syllable breaks
and then infer the "leading" or "trailing" semantic of consonnants. This
last semantic ("leading" or "trailing" is exactly like a letter case
distinction in Latin, so it can be said that the Korean alphabet is
bicameral for consonnants, but only monocameral for vowels, where each
Hangul syllable normally starts by an "uppercase-like" consonnant, or by a
consonnant filler which is also "uppercase-like", and that all other
consonnants and all vowels are "lowercase-like": the heuristic that
transforms the legacy "half-width" jamos into normal jamos just does the
same thing as the heuristic used in Latin that attempts to capitalize some
leading letters in words: it works frequently, but this also fails and that
heuristic is also lossy in Latin, just like it is lossy in Korean!).

The same can be said about the heuristics that attempt to infer an
abbreviation semantic from existing superscript letters (either encoded in
Unicode, or encoded as plain letters modified by superscripting style in
CSS or HTML, or in word processors for example): it fails to give the
correct guess most of the time if there's no user to confirm the actual
intended meaning

Such confirmation is the job of spell correctors in word processors: they
must clearly inform the user and let them decide, all what spell checkers
can do is to provide visual hints to the user editing the document, such as
the common wavy underline in red, that several interpretations are
possible, or this is not the preferrred encoding to use to convey the
correct semantic.

A spell checker may be instructed to do the conversion automatically, while
typing text, but there must be a way for the user to cancel this transform
and make his own decision about the real meaning if canceling the automatic
transform causes the "wavy red underline" to appear; the user may type
"Mr." then the wavy line will appear under these 3 characters, the spell
checker will propose to encode it as an abbreviation "Mr" or leave "Mr." unchanged (and no longer signaled) in
which case the dot remains a regular punctuation, and the "r" is not
modified. Then the user may choose to style the "r" with superscripting or
underlining, and a new wavy red underline will appear below the three
characters "M.", proposing to only transform the  as
 or  and even when the user accepts
one of these suggestions it will remain "M." or
"M." where it is still possible to infer the
semantics of an abbreviation (propose to replace or keep the dot after it),
or doing nothing else and cancel these suggestions (to hide the wavy red
underline hint, added by the spell checker), or instruct the spell checker
that the meaning of the superscript r is that of a mathematical exponent,
or a chemical a notation.

In all cases, the user/author has full control of the intended meaning of
his text and an informed decision is made where all cases are now
distinguished. "Legacy" encoding can be kept as is (in Unicode), even if
it's no longer recommended, just like Unicode has documented that
half-width Hangul is deprecated (it just offers a "compatibility
decomposition" for NFKD or NFKC, but this is lossy and cannot be done
automatically without a human decision).

And the user/author can now freely and easily compose any abbreviation he
wishes in natural languages, without being limited by the reduced "legacy"
set of  encoded in Unicode (which should no longer be
extended, except for use as distinct plain letters needed in alphabets of
actual natural languages, or as possibly new IPA symbols), and without
using the styling tricks (of HTML/CSS, or of word processor documents,
spreadsheets, presentation documents allowing "'rich text" formats on top
of "plain text") which are best suitable for "free styling" of any human
text, without any additional semantics, (or as a legacy but insufficient
trick for maths and chemical notations).



Le dim. 4 nov. 2018 à 20:51, Philippe Verdy  a écrit :

> Note 

Re: Encoding (was: Re: A sign/abbreviation for "magister")

2018-11-04 Thread Marcel Schneider via Unicode

Sorry, I didn’t truncate the subject line, it was my mail client.

On 04/11/2018 17:45, Philippe Verdy wrote:


Note that I actually propose not just one rendering for the
 but two possible variants (that would
be equally valid withou preference). Use it after any base cluster
(including with diacritics if needed, like combining underlines).

- the first one can be to render the previous cluster as superscript
(very easy to do implement synthetically by any text renderer)

- the second one can be to render it as an abbreviation dot (also
very easy to)

Fonts can provide their own mapping (e.g. to offer alternate glyph
forms or kerning for the superscript, they can also reuse the leter
forms used for other existing and encoded superscript letters, or to
position the abbreviation dot with negative kerning, for example
after a T), in which case the renderer does not have to synthetize
the rendering for the sequence combining sequence not mapped in the
font.

Allowing this variation from the start will:

- allow renderers to support it fast (so a rapid adoption for
encoding texts in humane languages, instead of the few legacy
superscript letters).

- allow font designers to develop and provide reasonnable mappings if
needed (to adjust the position or size of the superscript) in updated
fonts (no requirement for them to add new glyphs if it's just to map
the same glyphs used by existing superscript letters)

- also prohibit the abuse of this mark for every text that one would
would to write in superscript (these cases can still uses the few
existing superscript letters/digits/signs that are already encoded),
so this is not suitable for example for marking mathematical
exponents (e.g. "x²", if it's encoded as  could validly be rendered as "x2."): exponents must use the
superscript (either the already encoded ones, or using external
styles like in HTML/CSS, or in LaTeX which uses the notation "x^2",
both as a style, but also some intended semantic of an exponent and
certainly not the intended semantic of an abbreviation)


Unicode always (or in principle) aims at polyvalence, making characters
reusable and repurposable, while the combining abbreviation mark does
not solve the problems around making chemicals better represented in
plain text as seen in the parent thread, for example. I don’t advocate
this use case, as I’m only lobbying for natural languages’ support as
specified in the Standard,* but it shouldn’t be forgotten given there is
some point in not disfavoring chemistry compared to mathematics, that is
already widely favored over chemistry when looking at the symbol blocks,
while chemistry is denied three characters because they are subscript
forms of already encoded letters.

Beyond that, the problem with *COMBINING ABBREVIATION MARK is that it
needs OpenType support to work, while direct encoding of preformatted
superscripts and use as abbreviation indicators for an interoperable
digital representation of natural languages does not.

Best regards,

Marcel
* As already repeatedly stated, I’m taking the one bit where TUS states
that all natural languages shall be given a semantically unambiguous (ie
not introducing new ambiguity) and interoperable digital representation.


Re: Encoding

2018-11-04 Thread Marcel Schneider via Unicode

On 04/11/2018 17:45, Philippe Verdy wrote:


Note that I actually propose not just one rendering for the
 but two possible variants (that would
be equally valid withou preference). Use it after any base cluster
(including with diacritics if needed, like combining underlines).

- the first one can be to render the previous cluster as superscript
(very easy to do implement synthetically by any text renderer)

- the second one can be to render it as an abbreviation dot (also
very easy to)

Fonts can provide their own mapping (e.g. to offer alternate glyph
forms or kerning for the superscript, they can also reuse the leter
forms used for other existing and encoded superscript letters, or to
position the abbreviation dot with negative kerning, for example
after a T), in which case the renderer does not have to synthetize
the rendering for the sequence combining sequence not mapped in the
font.

Allowing this variation from the start will:

- allow renderers to support it fast (so a rapid adoption for
encoding texts in humane languages, instead of the few legacy
superscript letters).

- allow font designers to develop and provide reasonnable mappings if
needed (to adjust the position or size of the superscript) in updated
fonts (no requirement for them to add new glyphs if it's just to map
the same glyphs used by existing superscript letters)

- also prohibit the abuse of this mark for every text that one would
would to write in superscript (these cases can still uses the few
existing superscript letters/digits/signs that are already encoded),
so this is not suitable for example for marking mathematical
exponents (e.g. "x²", if it's encoded as  could validly be rendered as "x2."): exponents must use the
superscript (either the already encoded ones, or using external
styles like in HTML/CSS, or in LaTeX which uses the notation "x^2",
both as a style, but also some intended semantic of an exponent and
certainly not the intended semantic of an abbreviation)


Unicode always (or in principle) aims at polyvalence, making characters
reusable and repurposable, while the combining abbreviation mark does
not solve the problems around making chemicals better represented in
plain text as seen in the parent thread, for example. I don’t advocate
this use case, as I’m only lobbying for natural languages’ support as
specified in the Standard,* but it shouldn’t be forgotten given there is
some point in not disfavoring chemistry compared to mathematics, that is
already widely favored over chemistry when looking at the symbol blocks,
while chemistry is denied three characters because they are subscript
forms of already encoded letters.

Beyond that, the problem with *COMBINING ABBREVIATION MARK is that it
needs OpenType support to work, while direct encoding of preformatted
superscripts and use as abbreviation indicators for an interoperable
digital representation of natural languages does not.

Best regards,

Marcel
* As already repeatedly stated, I’m taking the one bit where TUS states
that all natural languages shall be given a semantically unambiguous (ie
not introducing new ambiguity) and interoperable digital representation.


Re: Encoding (was: Re: A sign/abbreviation for "magister")

2018-11-04 Thread Philippe Verdy via Unicode
Note that I actually propose not just one rendering for the  but two possible variants (that would be equally valid
withou preference). Use it after any base cluster (including with
diacritics if needed, like combining underlines).
- the first one can be to render the previous cluster as superscript (very
easy to do implement synthetically by any text renderer)
- the second one can be to render it as an abbreviation dot (also very easy
to)
Fonts can provide their own mapping (e.g. to offer alternate glyph forms or
kerning for the superscript, they can also reuse the leter forms used for
other existing and encoded superscript letters, or to position the
abbreviation dot with negative kerning, for example after a T), in which
case the renderer does not have to synthetize the rendering for the
sequence combining sequence not mapped in the font.

Allowing this variation from the start will:
- allow renderers to support it fast (so a rapid adoption for encoding
texts in humane languages, instead of the few legacy superscript letters).
- allow font designers to develop and provide reasonnable mappings if
needed (to adjust the position or size of the superscript) in updated fonts
(no requirement for them to add new glyphs if it's just to map the same
glyphs used by existing superscript letters)
- also prohibit the abuse of this mark for every text that one would would
to write in superscript (these cases can still uses the few existing
superscript letters/digits/signs that are already encoded), so this is not
suitable for example for marking mathematical exponents (e.g. "x²", if it's
encoded as  could validly be rendered as
"x2."): exponents must use the superscript (either the already encoded
ones, or using external styles like in HTML/CSS, or in LaTeX which uses the
notation "x^2", both as a style, but also some intended semantic of an
exponent and certainly not the intended semantic of an abbreviation)



Le dim. 4 nov. 2018 à 09:34, Marcel Schneider via Unicode <
unicode@unicode.org> a écrit :

> On 03/11/2018 23:50, James Kass via Unicode wrote:
> >
> > When the topic being discussed no longer matches the thread title,
> > somebody should start a new thread with an appropriate thread title.
> >
>
> Yes, that is what also the OP called for, but my last reply though
> taking me some time to write was sent without checking the new mail,
> so unfortunately it didn’t acknowledge. So let’s start this new thread
> to account for Philippe Verdy’s proposal to encode a new format control.
>
> But all what I can add so far prior to probably stepping out of this
> discussion is that the industry does not seem to be interested in this
> initiative. Why do I think so? As already discussed on this List, even
> the long-existing FRACTION SLASH U+2044 has not been implemented by
> major vendors, except that HarfBuzz does implement it and makes its
> specified behavior available in environments using HarfBuzz, among
> which some major vendors’ products are actually available with
> HarfBuzz support.
>
> As a result, the Polish abbreviation of Magister as found on the
> postcard, and all other abbreviations using superscript that have
> been put into parallel in the parent thread, cannot be reliably
> encoded without using preformatted superscript, so far as the goal
> is a plain text backbone being in the benefit of reliable rendering
> support, rather than a semantic-centered coding that may be easier
> to parse by special applications but lacks wider industrial support.
>
> If nevertheless,  is encoded and will
> gain traction, or rather reversely: if it gains traction and will be
> encoded (I don’t know which way around to put it, given U+2044 has
> been encoded but one still cannot seem to be able to call it widely
> implemented), I would surely add it on keyboard layouts if I will
> still be maintaining any in that era.
>
> Best regards,
>
> Marcel
>


Re: UCA unnecessary collation weight 0000

2018-11-04 Thread Mark Davis ☕️ via Unicode
Philippe, I agree that we could have structured the UCA differently. It
does make sense, for example, to have the weights be simply decimal values
instead of integers. But nobody is going to go through the substantial work
of restructuring the UCA spec and data file unless there is a very strong
reason to do so. It takes far more time and effort than people realize to
change in the algorithm/data while making sure that everything lines up
without inadvertent changes being introduced.

It is just not worth the effort. There are so, so, many things we can do in
Unicode (encoding, properties, algorithms, CLDR, ICU) that have a higher
benefit.

You can continue flogging this horse all you want, but I'm muting this
thread (and I suspect I'm not the only one).

Mark


On Sun, Nov 4, 2018 at 2:37 AM Philippe Verdy via Unicode <
unicode@unicode.org> wrote:

> Le ven. 2 nov. 2018 à 22:27, Ken Whistler  a écrit :
>
>>
>> On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote:
>>
>> I was replying not about the notational repreentation of the DUCET data
>> table (using [....] unnecessarily) but about the text of UTR#10 itself.
>> Which remains highly confusive, and contains completely unnecesary steps,
>> and just complicates things with absoiluytely no benefit at all by
>> introducing confusion about these "".
>>
>> Sorry, Philippe, but the confusion that I am seeing introduced is what
>> you are introducing to the unicode list in the course of this discussion.
>>
>>
>> UTR#10 still does not explicitly state that its use of "" does not
>> mean it is a valid "weight", it's a notation only
>>
>> No, it is explicitly a valid weight. And it is explicitly and normatively
>> referred to in the specification of the algorithm. See UTS10-D8 (and
>> subsequent definitions), which explicitly depend on a definition of "A
>> collation weight whose value is zero." The entire statement of what are
>> primary, secondary, tertiary, etc. collation elements depends on that
>> definition. And see the tables in Section 3.2, which also depend on those
>> definitions.
>>
>> (but the notation is used for TWO distinct purposes: one is for
>> presenting the notation format used in the DUCET
>>
>> It is *not* just a notation format used in the DUCET -- it is part of the
>> normative definitional structure of the algorithm, which then percolates
>> down into further definitions and rules and the steps of the algorithm.
>>
>
> I insist that this is NOT NEEDED at all for the definition, it is
> absolutely NOT structural. The algorithm still guarantees the SAME result.
>
> It is ONLY used to explain the format of the DUCET and the fact the this
> format does NOT use  as a valid weight, ans os can use it as a notation
> (in fact only a presentational feature).
>
>
>> itself to present how collation elements are structured, the other one is
>> for marking the presence of a possible, but not always required, encoding
>> of an explicit level separator for encoding sort keys).
>>
>> That is a numeric value of zero, used in Section 7.3, Form Sort Keys. It
>> is not part of the *notation* for collation elements, but instead is a
>> magic value chosen for the level separator precisely because zero values
>> from the collation elements are removed during sort key construction, so
>> that zero is then guaranteed to be a lower value than any remaining weight
>> added to the sort key under construction. This part of the algorithm is not
>> rocket science, by the way!
>>
>
> Here again you make a confusion: a sort key MAY use them as separators if
> it wants to compress keys by reencoding weights per level: that's the only
> case where you may want to introduce an encoding pattern starting with 0,
> while the rest of the encoding for weights in that level must using
> patterns not starting by this 0 (the number of bits to encode this 0 does
> not matter: it is only part of the encoding used on this level which does
> not necessarily have to use 16-bit code units per weight.
>
>>
>> Even the example tables can be made without using these "" (for
>> example in tables showing how to build sort keys, it can present the list
>> of weights splitted in separate columns, one column per level, without any
>> "". The implementation does not necessarily have to create a buffer
>> containing all weight values in a row, when separate buffers for each level
>> is far superior (and even more efficient as it can save space in memory).
>>
>> The UCA doesn't *require* you to do anything particular in your own
>> implementation, other than come up with the same results for string
>> comparisons.
>>
> Yes I know, but the algorithm also does not require me to use these
> invalid  pseudo-weights, that the algorithm itself will always discard
> (in a completely needless step)!
>
>
>> That is clearly stated in the conformance clause of UTS #10.
>>
>> https://www.unicode.org/reports/tr10/tr10-39.html#Basic_Conformance
>>
>> The step "S3.2" in the UCA 

Encoding (was: Re: A sign/abbreviation for "magister")

2018-11-04 Thread Marcel Schneider via Unicode

On 03/11/2018 23:50, James Kass via Unicode wrote:


When the topic being discussed no longer matches the thread title,
somebody should start a new thread with an appropriate thread title.



Yes, that is what also the OP called for, but my last reply though
taking me some time to write was sent without checking the new mail,
so unfortunately it didn’t acknowledge. So let’s start this new thread
to account for Philippe Verdy’s proposal to encode a new format control.

But all what I can add so far prior to probably stepping out of this
discussion is that the industry does not seem to be interested in this
initiative. Why do I think so? As already discussed on this List, even
the long-existing FRACTION SLASH U+2044 has not been implemented by
major vendors, except that HarfBuzz does implement it and makes its
specified behavior available in environments using HarfBuzz, among
which some major vendors’ products are actually available with
HarfBuzz support.

As a result, the Polish abbreviation of Magister as found on the
postcard, and all other abbreviations using superscript that have
been put into parallel in the parent thread, cannot be reliably
encoded without using preformatted superscript, so far as the goal
is a plain text backbone being in the benefit of reliable rendering
support, rather than a semantic-centered coding that may be easier
to parse by special applications but lacks wider industrial support.

If nevertheless,  is encoded and will
gain traction, or rather reversely: if it gains traction and will be
encoded (I don’t know which way around to put it, given U+2044 has
been encoded but one still cannot seem to be able to call it widely
implemented), I would surely add it on keyboard layouts if I will
still be maintaining any in that era.

Best regards,

Marcel