Arranging Hieroglyphics (was: A sign/abbreviation for "magister")
On Sat, 3 Nov 2018 22:55:17 +0100 Philippe Verdy via Unicode wrote: > I can also cite the case of Egyptian hieroglyphs: there's still no > way to render them correctly because we lack the development of a > stable orthography that would drive the encoding of the missing > **semantic** characters (for this reason Egyptian hieroglyphs still > require an upper layer protocol, as there's still no accepted > orthographic norm that successfully represents all possible semantic > variations, but alsop because the research on old Egyptian > hieroglyphs is still aphic very incomplete). If you study the document register, you'll find that layout control characters are being added. I think semantic characters would have depended on the font to select the rendering consequences; this will now not happen. What we're getting is more rigorous version of the Manuel de Codage. Richard.
Re: UCA unnecessary collation weight 0000
So you finally admit that I was right... And that the specs include requirements that are not even needed to make UCA work, and that not even used by wellknown implementations. These are old artefacts which are now really confusive (instructing programmers to adopt the old deprecated behavior, before realizing that this was a bad advice which jut complicated their task). UCA can be implemented **conformingly** without these, even for the simplest implementations (where using complex packages like ICU is not an option and rewriting it is not one as well for much simpler goals) where these incorrect requirements are in fact suggesting to be more inefficient than really needed. There's not a lot of work to edit and to fix the specs without these polluting "pseudo-weights". Le dim. 4 nov. 2018 à 09:27, Mark Davis ☕️ a écrit : > Philippe, I agree that we could have structured the UCA differently. It > does make sense, for example, to have the weights be simply decimal values > instead of integers. But nobody is going to go through the substantial > work of restructuring the UCA spec and data file unless there is a very > strong reason to do so. It takes far more time and effort than people > realize to change in the algorithm/data while making sure that everything > lines up without inadvertent changes being introduced. > > It is just not worth the effort. There are so, so, many things we can do > in Unicode (encoding, properties, algorithms, CLDR, ICU) that have a higher > benefit. > > You can continue flogging this horse all you want, but I'm muting this > thread (and I suspect I'm not the only one). > > Mark > > > On Sun, Nov 4, 2018 at 2:37 AM Philippe Verdy via Unicode < > unicode@unicode.org> wrote: > >> Le ven. 2 nov. 2018 à 22:27, Ken Whistler a écrit : >> >>> >>> On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote: >>> >>> I was replying not about the notational repreentation of the DUCET data >>> table (using [....] unnecessarily) but about the text of UTR#10 itself. >>> Which remains highly confusive, and contains completely unnecesary steps, >>> and just complicates things with absoiluytely no benefit at all by >>> introducing confusion about these "". >>> >>> Sorry, Philippe, but the confusion that I am seeing introduced is what >>> you are introducing to the unicode list in the course of this discussion. >>> >>> >>> UTR#10 still does not explicitly state that its use of "" does not >>> mean it is a valid "weight", it's a notation only >>> >>> No, it is explicitly a valid weight. And it is explicitly and >>> normatively referred to in the specification of the algorithm. See UTS10-D8 >>> (and subsequent definitions), which explicitly depend on a definition of "A >>> collation weight whose value is zero." The entire statement of what are >>> primary, secondary, tertiary, etc. collation elements depends on that >>> definition. And see the tables in Section 3.2, which also depend on those >>> definitions. >>> >>> (but the notation is used for TWO distinct purposes: one is for >>> presenting the notation format used in the DUCET >>> >>> It is *not* just a notation format used in the DUCET -- it is part of >>> the normative definitional structure of the algorithm, which then >>> percolates down into further definitions and rules and the steps of the >>> algorithm. >>> >> >> I insist that this is NOT NEEDED at all for the definition, it is >> absolutely NOT structural. The algorithm still guarantees the SAME result. >> >> It is ONLY used to explain the format of the DUCET and the fact the this >> format does NOT use as a valid weight, ans os can use it as a notation >> (in fact only a presentational feature). >> >> >>> itself to present how collation elements are structured, the other one >>> is for marking the presence of a possible, but not always required, >>> encoding of an explicit level separator for encoding sort keys). >>> >>> That is a numeric value of zero, used in Section 7.3, Form Sort Keys. It >>> is not part of the *notation* for collation elements, but instead is a >>> magic value chosen for the level separator precisely because zero values >>> from the collation elements are removed during sort key construction, so >>> that zero is then guaranteed to be a lower value than any remaining weight >>> added to the sort key under construction. This part of the algorithm is not >>> rocket science, by the way! >>> >> >> Here again you make a confusion: a sort key MAY use them as separators if >> it wants to compress keys by reencoding weights per level: that's the only >> case where you may want to introduce an encoding pattern starting with 0, >> while the rest of the encoding for weights in that level must using >> patterns not starting by this 0 (the number of bits to encode this 0 does >> not matter: it is only part of the encoding used on this level which does >> not necessarily have to use 16-bit code units per weight. >> >>> >>> Even the example tables can be
Re: Encoding
I can take another example about what I call "legacy encoding" (which really means that such encoding is just an "approximation" from which no semantic can be clearly infered, except by using a non-determinist heuristic, which can frequently make "false guesses"). Consider the case of the legacy Hangul "half-width" jamos: they were kept in Unicode (as compatibility characters) not recommended for encoding natural Korean text, because their semantic is not clear when they are used in sequences: it's impossible to know clearly where semantically significant syllable breaks occur, because they don't distinguish the "leading" and "trailing consonants", and so it is not even possible to clearly infer that any Hangul "half-width" vowel jamos is logically attached to the same syllable as the "half-width" consonnant (or consonnant+vowel) jamo that is encoded just before it. As a consequence, you cannot safely convert Korean texts using these "half-width" jamos into normal jamos: only an heuristic attempts to detertemine the syllable breaks and then infer the "leading" or "trailing" semantic of consonnants. This last semantic ("leading" or "trailing" is exactly like a letter case distinction in Latin, so it can be said that the Korean alphabet is bicameral for consonnants, but only monocameral for vowels, where each Hangul syllable normally starts by an "uppercase-like" consonnant, or by a consonnant filler which is also "uppercase-like", and that all other consonnants and all vowels are "lowercase-like": the heuristic that transforms the legacy "half-width" jamos into normal jamos just does the same thing as the heuristic used in Latin that attempts to capitalize some leading letters in words: it works frequently, but this also fails and that heuristic is also lossy in Latin, just like it is lossy in Korean!). The same can be said about the heuristics that attempt to infer an abbreviation semantic from existing superscript letters (either encoded in Unicode, or encoded as plain letters modified by superscripting style in CSS or HTML, or in word processors for example): it fails to give the correct guess most of the time if there's no user to confirm the actual intended meaning Such confirmation is the job of spell correctors in word processors: they must clearly inform the user and let them decide, all what spell checkers can do is to provide visual hints to the user editing the document, such as the common wavy underline in red, that several interpretations are possible, or this is not the preferrred encoding to use to convey the correct semantic. A spell checker may be instructed to do the conversion automatically, while typing text, but there must be a way for the user to cancel this transform and make his own decision about the real meaning if canceling the automatic transform causes the "wavy red underline" to appear; the user may type "Mr." then the wavy line will appear under these 3 characters, the spell checker will propose to encode it as an abbreviation "Mr" or leave "Mr." unchanged (and no longer signaled) in which case the dot remains a regular punctuation, and the "r" is not modified. Then the user may choose to style the "r" with superscripting or underlining, and a new wavy red underline will appear below the three characters "M.", proposing to only transform the as or and even when the user accepts one of these suggestions it will remain "M." or "M." where it is still possible to infer the semantics of an abbreviation (propose to replace or keep the dot after it), or doing nothing else and cancel these suggestions (to hide the wavy red underline hint, added by the spell checker), or instruct the spell checker that the meaning of the superscript r is that of a mathematical exponent, or a chemical a notation. In all cases, the user/author has full control of the intended meaning of his text and an informed decision is made where all cases are now distinguished. "Legacy" encoding can be kept as is (in Unicode), even if it's no longer recommended, just like Unicode has documented that half-width Hangul is deprecated (it just offers a "compatibility decomposition" for NFKD or NFKC, but this is lossy and cannot be done automatically without a human decision). And the user/author can now freely and easily compose any abbreviation he wishes in natural languages, without being limited by the reduced "legacy" set of encoded in Unicode (which should no longer be extended, except for use as distinct plain letters needed in alphabets of actual natural languages, or as possibly new IPA symbols), and without using the styling tricks (of HTML/CSS, or of word processor documents, spreadsheets, presentation documents allowing "'rich text" formats on top of "plain text") which are best suitable for "free styling" of any human text, without any additional semantics, (or as a legacy but insufficient trick for maths and chemical notations). Le dim. 4 nov. 2018 à 20:51, Philippe Verdy a écrit : > Note
Re: Encoding (was: Re: A sign/abbreviation for "magister")
Sorry, I didn’t truncate the subject line, it was my mail client. On 04/11/2018 17:45, Philippe Verdy wrote: Note that I actually propose not just one rendering for the but two possible variants (that would be equally valid withou preference). Use it after any base cluster (including with diacritics if needed, like combining underlines). - the first one can be to render the previous cluster as superscript (very easy to do implement synthetically by any text renderer) - the second one can be to render it as an abbreviation dot (also very easy to) Fonts can provide their own mapping (e.g. to offer alternate glyph forms or kerning for the superscript, they can also reuse the leter forms used for other existing and encoded superscript letters, or to position the abbreviation dot with negative kerning, for example after a T), in which case the renderer does not have to synthetize the rendering for the sequence combining sequence not mapped in the font. Allowing this variation from the start will: - allow renderers to support it fast (so a rapid adoption for encoding texts in humane languages, instead of the few legacy superscript letters). - allow font designers to develop and provide reasonnable mappings if needed (to adjust the position or size of the superscript) in updated fonts (no requirement for them to add new glyphs if it's just to map the same glyphs used by existing superscript letters) - also prohibit the abuse of this mark for every text that one would would to write in superscript (these cases can still uses the few existing superscript letters/digits/signs that are already encoded), so this is not suitable for example for marking mathematical exponents (e.g. "x²", if it's encoded as could validly be rendered as "x2."): exponents must use the superscript (either the already encoded ones, or using external styles like in HTML/CSS, or in LaTeX which uses the notation "x^2", both as a style, but also some intended semantic of an exponent and certainly not the intended semantic of an abbreviation) Unicode always (or in principle) aims at polyvalence, making characters reusable and repurposable, while the combining abbreviation mark does not solve the problems around making chemicals better represented in plain text as seen in the parent thread, for example. I don’t advocate this use case, as I’m only lobbying for natural languages’ support as specified in the Standard,* but it shouldn’t be forgotten given there is some point in not disfavoring chemistry compared to mathematics, that is already widely favored over chemistry when looking at the symbol blocks, while chemistry is denied three characters because they are subscript forms of already encoded letters. Beyond that, the problem with *COMBINING ABBREVIATION MARK is that it needs OpenType support to work, while direct encoding of preformatted superscripts and use as abbreviation indicators for an interoperable digital representation of natural languages does not. Best regards, Marcel * As already repeatedly stated, I’m taking the one bit where TUS states that all natural languages shall be given a semantically unambiguous (ie not introducing new ambiguity) and interoperable digital representation.
Re: Encoding
On 04/11/2018 17:45, Philippe Verdy wrote: Note that I actually propose not just one rendering for the but two possible variants (that would be equally valid withou preference). Use it after any base cluster (including with diacritics if needed, like combining underlines). - the first one can be to render the previous cluster as superscript (very easy to do implement synthetically by any text renderer) - the second one can be to render it as an abbreviation dot (also very easy to) Fonts can provide their own mapping (e.g. to offer alternate glyph forms or kerning for the superscript, they can also reuse the leter forms used for other existing and encoded superscript letters, or to position the abbreviation dot with negative kerning, for example after a T), in which case the renderer does not have to synthetize the rendering for the sequence combining sequence not mapped in the font. Allowing this variation from the start will: - allow renderers to support it fast (so a rapid adoption for encoding texts in humane languages, instead of the few legacy superscript letters). - allow font designers to develop and provide reasonnable mappings if needed (to adjust the position or size of the superscript) in updated fonts (no requirement for them to add new glyphs if it's just to map the same glyphs used by existing superscript letters) - also prohibit the abuse of this mark for every text that one would would to write in superscript (these cases can still uses the few existing superscript letters/digits/signs that are already encoded), so this is not suitable for example for marking mathematical exponents (e.g. "x²", if it's encoded as could validly be rendered as "x2."): exponents must use the superscript (either the already encoded ones, or using external styles like in HTML/CSS, or in LaTeX which uses the notation "x^2", both as a style, but also some intended semantic of an exponent and certainly not the intended semantic of an abbreviation) Unicode always (or in principle) aims at polyvalence, making characters reusable and repurposable, while the combining abbreviation mark does not solve the problems around making chemicals better represented in plain text as seen in the parent thread, for example. I don’t advocate this use case, as I’m only lobbying for natural languages’ support as specified in the Standard,* but it shouldn’t be forgotten given there is some point in not disfavoring chemistry compared to mathematics, that is already widely favored over chemistry when looking at the symbol blocks, while chemistry is denied three characters because they are subscript forms of already encoded letters. Beyond that, the problem with *COMBINING ABBREVIATION MARK is that it needs OpenType support to work, while direct encoding of preformatted superscripts and use as abbreviation indicators for an interoperable digital representation of natural languages does not. Best regards, Marcel * As already repeatedly stated, I’m taking the one bit where TUS states that all natural languages shall be given a semantically unambiguous (ie not introducing new ambiguity) and interoperable digital representation.
Re: Encoding (was: Re: A sign/abbreviation for "magister")
Note that I actually propose not just one rendering for the but two possible variants (that would be equally valid withou preference). Use it after any base cluster (including with diacritics if needed, like combining underlines). - the first one can be to render the previous cluster as superscript (very easy to do implement synthetically by any text renderer) - the second one can be to render it as an abbreviation dot (also very easy to) Fonts can provide their own mapping (e.g. to offer alternate glyph forms or kerning for the superscript, they can also reuse the leter forms used for other existing and encoded superscript letters, or to position the abbreviation dot with negative kerning, for example after a T), in which case the renderer does not have to synthetize the rendering for the sequence combining sequence not mapped in the font. Allowing this variation from the start will: - allow renderers to support it fast (so a rapid adoption for encoding texts in humane languages, instead of the few legacy superscript letters). - allow font designers to develop and provide reasonnable mappings if needed (to adjust the position or size of the superscript) in updated fonts (no requirement for them to add new glyphs if it's just to map the same glyphs used by existing superscript letters) - also prohibit the abuse of this mark for every text that one would would to write in superscript (these cases can still uses the few existing superscript letters/digits/signs that are already encoded), so this is not suitable for example for marking mathematical exponents (e.g. "x²", if it's encoded as could validly be rendered as "x2."): exponents must use the superscript (either the already encoded ones, or using external styles like in HTML/CSS, or in LaTeX which uses the notation "x^2", both as a style, but also some intended semantic of an exponent and certainly not the intended semantic of an abbreviation) Le dim. 4 nov. 2018 à 09:34, Marcel Schneider via Unicode < unicode@unicode.org> a écrit : > On 03/11/2018 23:50, James Kass via Unicode wrote: > > > > When the topic being discussed no longer matches the thread title, > > somebody should start a new thread with an appropriate thread title. > > > > Yes, that is what also the OP called for, but my last reply though > taking me some time to write was sent without checking the new mail, > so unfortunately it didn’t acknowledge. So let’s start this new thread > to account for Philippe Verdy’s proposal to encode a new format control. > > But all what I can add so far prior to probably stepping out of this > discussion is that the industry does not seem to be interested in this > initiative. Why do I think so? As already discussed on this List, even > the long-existing FRACTION SLASH U+2044 has not been implemented by > major vendors, except that HarfBuzz does implement it and makes its > specified behavior available in environments using HarfBuzz, among > which some major vendors’ products are actually available with > HarfBuzz support. > > As a result, the Polish abbreviation of Magister as found on the > postcard, and all other abbreviations using superscript that have > been put into parallel in the parent thread, cannot be reliably > encoded without using preformatted superscript, so far as the goal > is a plain text backbone being in the benefit of reliable rendering > support, rather than a semantic-centered coding that may be easier > to parse by special applications but lacks wider industrial support. > > If nevertheless, is encoded and will > gain traction, or rather reversely: if it gains traction and will be > encoded (I don’t know which way around to put it, given U+2044 has > been encoded but one still cannot seem to be able to call it widely > implemented), I would surely add it on keyboard layouts if I will > still be maintaining any in that era. > > Best regards, > > Marcel >
Re: UCA unnecessary collation weight 0000
Philippe, I agree that we could have structured the UCA differently. It does make sense, for example, to have the weights be simply decimal values instead of integers. But nobody is going to go through the substantial work of restructuring the UCA spec and data file unless there is a very strong reason to do so. It takes far more time and effort than people realize to change in the algorithm/data while making sure that everything lines up without inadvertent changes being introduced. It is just not worth the effort. There are so, so, many things we can do in Unicode (encoding, properties, algorithms, CLDR, ICU) that have a higher benefit. You can continue flogging this horse all you want, but I'm muting this thread (and I suspect I'm not the only one). Mark On Sun, Nov 4, 2018 at 2:37 AM Philippe Verdy via Unicode < unicode@unicode.org> wrote: > Le ven. 2 nov. 2018 à 22:27, Ken Whistler a écrit : > >> >> On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote: >> >> I was replying not about the notational repreentation of the DUCET data >> table (using [....] unnecessarily) but about the text of UTR#10 itself. >> Which remains highly confusive, and contains completely unnecesary steps, >> and just complicates things with absoiluytely no benefit at all by >> introducing confusion about these "". >> >> Sorry, Philippe, but the confusion that I am seeing introduced is what >> you are introducing to the unicode list in the course of this discussion. >> >> >> UTR#10 still does not explicitly state that its use of "" does not >> mean it is a valid "weight", it's a notation only >> >> No, it is explicitly a valid weight. And it is explicitly and normatively >> referred to in the specification of the algorithm. See UTS10-D8 (and >> subsequent definitions), which explicitly depend on a definition of "A >> collation weight whose value is zero." The entire statement of what are >> primary, secondary, tertiary, etc. collation elements depends on that >> definition. And see the tables in Section 3.2, which also depend on those >> definitions. >> >> (but the notation is used for TWO distinct purposes: one is for >> presenting the notation format used in the DUCET >> >> It is *not* just a notation format used in the DUCET -- it is part of the >> normative definitional structure of the algorithm, which then percolates >> down into further definitions and rules and the steps of the algorithm. >> > > I insist that this is NOT NEEDED at all for the definition, it is > absolutely NOT structural. The algorithm still guarantees the SAME result. > > It is ONLY used to explain the format of the DUCET and the fact the this > format does NOT use as a valid weight, ans os can use it as a notation > (in fact only a presentational feature). > > >> itself to present how collation elements are structured, the other one is >> for marking the presence of a possible, but not always required, encoding >> of an explicit level separator for encoding sort keys). >> >> That is a numeric value of zero, used in Section 7.3, Form Sort Keys. It >> is not part of the *notation* for collation elements, but instead is a >> magic value chosen for the level separator precisely because zero values >> from the collation elements are removed during sort key construction, so >> that zero is then guaranteed to be a lower value than any remaining weight >> added to the sort key under construction. This part of the algorithm is not >> rocket science, by the way! >> > > Here again you make a confusion: a sort key MAY use them as separators if > it wants to compress keys by reencoding weights per level: that's the only > case where you may want to introduce an encoding pattern starting with 0, > while the rest of the encoding for weights in that level must using > patterns not starting by this 0 (the number of bits to encode this 0 does > not matter: it is only part of the encoding used on this level which does > not necessarily have to use 16-bit code units per weight. > >> >> Even the example tables can be made without using these "" (for >> example in tables showing how to build sort keys, it can present the list >> of weights splitted in separate columns, one column per level, without any >> "". The implementation does not necessarily have to create a buffer >> containing all weight values in a row, when separate buffers for each level >> is far superior (and even more efficient as it can save space in memory). >> >> The UCA doesn't *require* you to do anything particular in your own >> implementation, other than come up with the same results for string >> comparisons. >> > Yes I know, but the algorithm also does not require me to use these > invalid pseudo-weights, that the algorithm itself will always discard > (in a completely needless step)! > > >> That is clearly stated in the conformance clause of UTS #10. >> >> https://www.unicode.org/reports/tr10/tr10-39.html#Basic_Conformance >> >> The step "S3.2" in the UCA
Encoding (was: Re: A sign/abbreviation for "magister")
On 03/11/2018 23:50, James Kass via Unicode wrote: When the topic being discussed no longer matches the thread title, somebody should start a new thread with an appropriate thread title. Yes, that is what also the OP called for, but my last reply though taking me some time to write was sent without checking the new mail, so unfortunately it didn’t acknowledge. So let’s start this new thread to account for Philippe Verdy’s proposal to encode a new format control. But all what I can add so far prior to probably stepping out of this discussion is that the industry does not seem to be interested in this initiative. Why do I think so? As already discussed on this List, even the long-existing FRACTION SLASH U+2044 has not been implemented by major vendors, except that HarfBuzz does implement it and makes its specified behavior available in environments using HarfBuzz, among which some major vendors’ products are actually available with HarfBuzz support. As a result, the Polish abbreviation of Magister as found on the postcard, and all other abbreviations using superscript that have been put into parallel in the parent thread, cannot be reliably encoded without using preformatted superscript, so far as the goal is a plain text backbone being in the benefit of reliable rendering support, rather than a semantic-centered coding that may be easier to parse by special applications but lacks wider industrial support. If nevertheless, is encoded and will gain traction, or rather reversely: if it gains traction and will be encoded (I don’t know which way around to put it, given U+2044 has been encoded but one still cannot seem to be able to call it widely implemented), I would surely add it on keyboard layouts if I will still be maintaining any in that era. Best regards, Marcel