Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
At 04:22 -0500 2003-06-27, [EMAIL PROTECTED] wrote: I just have a hard time believing that 50 years from now our grandchildren won't look back, What were they thinking? So it took them a couple of years to figure out canonical ordering and normalization; why on earth didn't they work that out first before setting things in stone, rather than saddling us with this hodgepodge of ad hoc workarounds? How short sighted. As Rick said, I know this will get shot down; don't bother telling me so. I agree with you, Peter. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
At 04:22 -0500 2003-06-27, [EMAIL PROTECTED] wrote: Are we saying that ISO doesn't give a rip for implementation issues? Duplication of characters is not the way to fix (forgive me, UTC) *Unicode's* error in combining characters. Or that their notion of ordering distinctions is different from Unicode's such that *any* differently ordering permutation of some given set of characters is considered a distinct representation? Are we saying that the voting members of WG2 are not already aware of the issue that has been discussed and incapable of understanding an explanation of these issues addressed to them? You might submit your paper to WG2. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
At 04:53 -0500 2003-06-27, [EMAIL PROTECTED] wrote: If they're so unaware of combining classes, might it not seem reasonable to think the the dialog might continue as follows? - [gives explanation of combining classes and the related problem for Hebrew] ISO: So, you're saying you're coming to us asking for duplicates of existing characters because of an error the Unicode Consortium made with some of those character properties they define? - Well, yes, that's basically it. ISO: Then, obviously they need to correct their errors. I mean, it's not like the wrong characters got encoded or something. Tell them to just fix the errors; that can't be difficult to do, and is obviously the right thing to do. This is exactly my view. Who is it who will kill the Unicode Consortium if UAX #15 were to be revised? Did it occur to anyone to *ask* about the possible revision of classes for the dozen or so instances that would be affected? -- Michael Everson * * Everson Typography * * http://www.evertype.com
[cowan: Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)]
Michael Everson scripsit: Who is it who will kill the Unicode Consortium if UAX #15 were to be revised? Did it occur to anyone to *ask* about the possible revision of classes for the dozen or so instances that would be affected? The IETF, for one. IETF is already very wary of Unicode, even though they recognize the practical necessity of using it, but with the existing stability guarantees about normalization, they have managed to swallow it. Stability *even if wrong* is really, really important to protocol people -- just think of all the nonfunctional stubs in the world of *diplomatic* protocol, maintained in the name of not changing anything. The W3C would also hit the roof if Unicode normalization changed radically. Neither party is at all happy with even the four (I think) characters that have already changed, and are already beginning to turn into optimistic pessimists (people who smile brightly, nod their heads, and say happily, See, things are every bit as bad as I predicted!). Since the use of non-ASCII characters in things like XML and the DNS depends on the good will of these folks, it is very very dangerous to alienate them, and *they do not care* whether the case is a corner case or not -- _stare decisis_ is everything to them, the actual details little or nothing. Change the character classes in Unicode 4.1, and they *might* decide to freeze support at, say, Unicode 3.0. -- John Cowan [EMAIL PROTECTED] I am a member of a civilization. --David Brin
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
On Fri, 27 Jun 2003 04:22:30 -0500, [EMAIL PROTECTED] wrote: I just have a hard time believing that 50 years from now our grandchildren won't look back, What were they thinking? So it took them a couple of years to figure out canonical ordering and normalization; why on earth didn't they work that out first before setting things in stone, rather than saddling us with this hodgepodge of ad hoc workarounds? How short sighted. As Rick said, I know this will get shot down; don't bother telling me so. I have to agree 100% with Peter on this. The potential fiasco with regards to Mongolian Free Variation Selectors is another area where our grandchildren are going to be weeping with despair if we are not careful. The standardized variants for Mongolian were set in stone by Unicode based on an unfortunate but understandable misunderstanding of the infamous TR170, and now that it is apparent from Chinese and Mongolian sources that Unicode had got hold of completely the wrong end of the stick (the defined standardized variants are actually intended for use in isolation only, and the same MFVS that selects one variant form in isolation may be used to select a completely different variant within running text ... which of course it can't according to the Standardized Variants document), instead of just wiping the slate clean and redefining a new and consistant set of standardized variants that correspond to actual usage within China and Mongolia, Unicode is determined to preserve the original erroneous standardised variants come hell or high water - even though no-one has ever seriously used them yet (well, the Chinese and Mongolians will go ahead and do it their way whatever Unicode decides). And before Peter suggests it, I have already suggested elsewhere that if Unicode can't fix past errors, the only course might be for Unicode to deprecate the MFVSs, and start again from scratch - didn't go down too well! Andrew
Re: [cowan: Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)]
On Friday, June 27, 2003 1:29 PM, John Cowan [EMAIL PROTECTED] wrote: Michael Everson scripsit: Change the character classes in Unicode 4.1, and they *might* decide to freeze support at, say, Unicode 3.0. Or they may simply opt to define their *OWN* normalization standard, distinct from Unicode NF* form, and designated in a separate reference document, removing *all* references to UAX#15 from XML and IDNA references, only to guarantee this stability that Unicode would be unable to offer. Let's not this happen! The IDNA protocol authors already made a lot of concessions to Unicode, but they may simply abandon the intent to support the idea of Unicode to normalize old scripts that they clearly don't need. This would mean that modern scripts that are still not encoded would not fit before long within XML or IDNA frameworks... And this would be dramatic for those languages (and very frustating for their writers, that have little resources and could not influence the maintainers of other protocol specifications at the same time as Unicode) that are active but would be excluded for use in modern technologies such as XML and IDNA. If the supporters of these languages finally consider it is more important to get it usable in modern technologies (notably for XML), they will prefer collaborating with the W3C and ISO10646 and will ignore completely Unicode's attempt to define abusive character properties. Unicode will then have no voice for the standardization of those languages, and will have to endorse the character repertoire registered at ISO10646 without any discussion, even if the XML usage contradicts Unicode normative rules. There's no other choice than maintaining the stability. If this means using special characters for combining sequences, that's something that Unicode will have to do and document clearly... -- Philippe.
Re: [cowan: Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)]
Michael Everson scripsit: Oh, come on. Let's not put words in people's mouths. Ifs and mights are not facts. Expressed attitudes are facts, and it's reasonable to extrapolate people's future behaviors, at least the general trend thereof, from their expressed attitudes. When someone draws a line in the sand, it's not unreasonable to expect that crossing it will be taken as a declaration of war. -- Yes, chili in the eye is bad, but so is yourJohn Cowan ear. However, I would suggest you wash your[EMAIL PROTECTED] hands thoroughly before going to the toilet.http://www.reutershealth.com --gadicath http://www.ccil.org/~cowan
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
At 04:22 -0500 2003-06-27, [EMAIL PROTECTED] wrote: I just have a hard time believing that 50 years from now our grandchildren won't look back [...] I am in complete agreement with the spirit of what Peter says, though realistically, 50 years from now, this is likely to be all neither here nor there... (?) I can't address all the technical details of the issue(s) at hand, however, from a point of view of computing systems generally speaking, I think the following is true: 1. Everyone is more or less agreed that the present combining class rules as they apply to BH contain mistakes. The clearly preferential way to deal with mistakes in any technological/computing software environment is to FIX them. Several people have expressed reasons why this can't be (practically) be done--which mainly seem to stem from political concerns. 2. Consequently ANY OTHER solution than 'FIX the obvious mistake(s)' is a kludge (contra Philippe's (?) recent comment). One *pays* for all kludges, one way or the other. If one is going to do this clearly undesirable thing, one had better face that, acknowledge it, and be prepared to live with it, and not try to talk one's way out of it being a kludge. 3. In that case, the question is, which kludge will cause less damage in the end? (Because kludges will ALWAYS cause some problem one hasn't forseen. It is their nature, since they involve adding twists into an otherwise plain approach and complicating the algorithms in ways that are mystifying even to the experts, after a while.) 4. Creating a whole new set of characters whose combining classes can be redefined from scratch 'correctly' would seem to be undesirable, for a host of reasons: one can't justify duplicating existing characters (specially, if I understand it correctly, in the ISO environment which doesn't have all these other superset systems?), and to some extent, one (perhaps?) runs the risk of duplicating the present mess yet again, if one makes another mistake 5. Inserting some kind of other character in the chain (perhaps even a different one depending on the case, whether double vowels or metheg or whatever--that is not the issue just now) is clearly a kludge too... but then the sub-issue becomes whether to overload new semantics on existing characters (e.g. ZWJ etc.) with the potential of adding exponentially more twists in the system. Would it not be preferable, in that case, to create a new character (with the appropriate attributes that I really can't comment on) whose semantic is specific to addressing the current problem? New (clean) rules would then have to be defined to cope with this. This keeps the mess to a minimum. Now, Q: I take it the combining classes are linked to the script, rather than say to a dialect--e.g. one can't define BH as a separate dialect from MH with its own set of rules? (I assume this is the case because otherwise someone would have proposed it already.) I REALLY think that option 1 should be beaten to death with a stick, then beaten to death again, before settling for one of the others. Hoping this didn't sound like a pointless diatribe but rather that taking a step back from the details might help? K
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
(Regret I hadn't yet read this post prior to my last post) Peter said, in reponse to Ken: Why is it a kludge to insert some cc=0 control character into the text for the sole purpose of preventing reordering during canonical ordering of two combining marks that do interact typographically and so should but nevertheless do not have the same combining class; and, moreover, to do so using a control character that was not created for that purpose? The answer seems so obvious, I wouldn't know how to begin responding. And the fact that it achieves some desired effect has no bearing on being described as a kludge -- every kludge achieves some desired effect. If it were otherwise, the given practice would never have been conceived. Exactly correct. I am surprised Ken posed the question. If we want to insert a control character to prevent reordering under canonical ordering, I think it would be preferable to create a new control character for just that purpose: that would give a character that could be used elsewhere for the very same purpose without needing to worry about what unanticipated and undesirable effects might result by hijacking a control created for some completely unrelated purpose. For instance, you suggested RLM. Suppose next week we discover a very similar issue in a LTR script; do we want to insert RLM to prevent mark reordering in that case? [...] Very fine cases in point of what I was trying to say in more general terms. K
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
On Friday, June 27, 2003 3:23 PM, Karljürgen Feuerherm [EMAIL PROTECTED] wrote: At 04:22 -0500 2003-06-27, [EMAIL PROTECTED] wrote: Now, Q: I take it the combining classes are linked to the script, rather than say to a dialect--e.g. one can't define BH as a separate dialect from MH with its own set of rules? (I assume this is the case because otherwise someone would have proposed it already.) I REALLY think that option 1 should be beaten to death with a stick, then beaten to death again, before settling for one of the others. Hoping this didn't sound like a pointless diatribe but rather that taking a step back from the details might help? Do you then propose to create a specific character, for use within the Hebrew script only, as a way to specify an alternate order for hebrew cantillation? In that case, it would be more appropriate to define new standard variants of these cantillation marks, and list them in the supported variants, to be used specially for Biblic Hebrew. The rule for their use must be however simple: the variant selector must be made legal before any cantillation mark, even if it is not strictly necessary (for example between a base Hebrew character and a Hebrew point, or between two hebrew points whose normalization combining order is not defective). This would allow writing a simple transcoding algorithm for the existing encoded texts (using only the ISO10646 encoding rules), and allow further optimizations of the transformed text, to remove Variant selectors when they are not strictly necessary. This way, we won't override the semantic of the existing ZWJ or CGJ characters that were initially created to be used only before a base character to join combining sequences in the renderer or to disallow a candidate break. The breaking algorithms are already complex enough to avoid adding special semantics to these characters. On the opposite, variant selectors are much cleaner, and the extra optimization for their superfluous use, can be added to UAX#15, simply because Variant selectors are only legal (and thus stable) for the predefined sequences. Variant selectors do not break the stability pact, because under this pact, a VS, character sequence is considered (for XML and other related standards) as distinct from the isolated character without the variant selector, and thus can have distinct character properties. This also has the adantage that there is absolutely no need to recode all the existing documents written with modern Hebrew, and the problem can be isolated to just the few already encoded historic documents. -- Philippe.
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
Karljürgen Feuerherm scripsit: 1. Everyone is more or less agreed that the present combining class rules as they apply to BH contain mistakes. The clearly preferential way to deal with mistakes in any technological/computing software environment is to FIX them. Not so. Sometimes stability is more important than correctness. The use of the backslash character in DOS/Windows systems as a path separator is arguably a mistake (paths were borrowed from Unix into DOS 2.0, but the slash was already in use for command-line options, something inherited from CP/M and the ancestral CLI running back through DEC operating systems), but fixing it is out of the question. Several people have expressed reasons why this can't be (practically) be done--which mainly seem to stem from political concerns. All concerns involving human beings -- ho bios politikos -- are political in some sense. 2. Consequently ANY OTHER solution than 'FIX the obvious mistake(s)' is a kludge (contra Philippe's (?) recent comment). One *pays* for all kludges, One pays for all *choices*. -- Do NOT stray from the path! John Cowan [EMAIL PROTECTED] --Gandalf http://www.ccil.org/~cowan
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
At 10:40 -0400 2003-06-27, John Cowan wrote: Karljürgen Feuerherm scripsit: 1. Everyone is more or less agreed that the present combining class rules as they apply to BH contain mistakes. The clearly preferential way to deal with mistakes in any technological/computing software environment is to FIX them. Not so. Sometimes stability is more important than correctness. And sometimes not, then. What four characters have been corrected so far? Were they important characters to some company? Are there no Christians or Jews in the IETF who might care about a problem like this, where a simple solution might be effected? Particularly if it involves only a handful of characters, and the precedent for making such corrections has been set? Or is our standard, which as I have said many times, will be used for CENTURIES, going to be hobbled by silliness like this forever? Hm? The use of the backslash character in DOS/Windows systems as a path separator is arguably a mistake (paths were borrowed from Unix into DOS 2.0, but the slash was already in use for command-line options, something inherited from CP/M and the ancestral CLI running back through DEC operating systems), but fixing it is out of the question. This is not analogous to the present situation, it seems to me. In the first place, what else is the \ for? :-) No one who wants to use the \ is prevented from doing so except maybe in filenames, in systems which don't allow it. (The colon is disallowed in Apple filenames.) All concerns involving human beings -- ho bios politikos -- are political in some sense. And some have more sense than others, it seems. (Sorry, couldn't resist.) -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: [cowan: Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)]
Michael Everson scripsit: But you might trot on over with a white flag to parley about a problem. They're only human beings over there, just as we are over here. Michael, I *am* the guy carrying the white flag to the W3C, and I have made promises about what the Unicode Consortium will and won't do based on its published stability policies. If those are changed now, I'm left twisting in the wind. As for the IETF, membership on the IETF is defined as being subscribed to an IETF mailing list and discussing a problem. Anyone can do it, anyone at all. -- John Cowan[EMAIL PROTECTED] http://www.reutershealth.com http://www.ccil.org/~cowan Yakka foob mog. Grug pubbawup zink wattoom gazork. Chumble spuzz. -- Calvin, giving Newton's First Law in his own words
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
Andrew C. West andrewcwest at alumni dot princeton dot edu wrote: I have to agree 100% with Peter on this. The potential fiasco with regards to Mongolian Free Variation Selectors is another area where our grandchildren are going to be weeping with despair if we are not careful. The standardized variants for Mongolian were set in stone by Unicode based on an unfortunate but understandable misunderstanding of the infamous TR170, and now that it is apparent from Chinese and Mongolian sources that Unicode had got hold of completely the wrong end of the stick (the defined standardized variants are actually intended for use in isolation only, and the same MFVS that selects one variant form in isolation may be used to select a completely different variant within running text ... which of course it can't according to the Standardized Variants document), instead of just wiping the slate clean and redefining a new and consistant set of standardized variants that correspond to actual usage within China and Mongolia, Unicode is determined to preserve the original erroneous standardised variants come hell or high water - even though no-one has ever seriously used them yet (well, the Chinese and Mongolians will go ahead and do it their way whatever Unicode decides). Just a day or two ago we had a discussion about fast-tracking or short-circuiting the standardization process, or otherwise using things that were partway through the process before they had received final approval. Without expressing an opinion on Unicode's handling of Mongolian, or Hebrew, or Tibetan, I think this thread shows clearly why decisions must be thought out carefully and not rushed. The perception that Unicode got it wrong, whether real or imagined, can cause great damage to the credibility and acceptance of the standard. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
On Friday, June 27, 2003 4:40 PM, John Cowan [EMAIL PROTECTED] wrote: Not so. Sometimes stability is more important than correctness. Very well answered. I don't see why we need to sacrifice stability when correcting something. As the error is not in ISO10646, it is definitely not reasonnable to have ISO10646 endorse the error done by Unicode due to its stability pact. For now, the only good solution is to use existing Unicode-only resources that will not impact the normalization pact, and the ISO10646 unification work. If this requires defining some additional Unicode semantics or properties for some language-significant markup characters, this can be done with variants (if ISO10646 accept it), or with a request for a dedicated new *invisible* diacritic in the Hebrew block to ISO10646. May be Unicode should be more prudent with Normalization Forms: if new characters are added, their combining classes should be documented as informative before there is a consensus and experimentation. This will not break the stability pact with XML, which will simply not accept the new characters before they are stabilized by Unicode. So the characters can be standardized by Unicode, and ISO10646, but be used with caution with XML which can restrict the set of characters supported to only those for which the canonicalization is not finished. Why not then documenting these critical normative properties to make them clearly informative if needed? For example informative canonical decompositions could be noted with canon (and thus only recognized by compatibility decompositions until further notice). And proposed combining classes could be noted with an additional symbol in the CC column of the UCD (for example a ?). This would prevent using the character within XML compliant applications, but it could allow a more rapid development of fonts and renderers or layout engines, allow experimentations to encode actual new documents with some safe-guards regarding the actual character properties. This would say to IETF and W3C a warning this character has an informative combining class or decomposition. Normalization at this step is dangerous, and documents should be considered as already normalized for those characters. These potentially instable unicode-encoded documents will then be labelled with the unicode version, as a future revision may require verigying if the informative properties have become enforcable. If there's a change in the properties, existing documents can then be tested to see if they still respect the proposed normalization, and corrected. If there is no change after say 1 year, a revision annex publishes these properties as normative and a incremental version of Unicode is added, that allows interchange and conservation of the encoded documents without an explicit Unicode version label.
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
Philippe said on June 27, 2003 at 10:25 AM On Friday, June 27, 2003 3:23 PM, Karljürgen Feuerherm [EMAIL PROTECTED] wrote: I REALLY think that option 1 [FIX the combining classes] should be beaten to death with a stick, then beaten to death again, before settling for one of the others. Do you then propose to create a specific character, for use within the Hebrew script only, as a way to specify an alternate order for hebrew cantillation? In that case, it would be more appropriate to define new standard variants of these cantillation marks, and list them in the supported variants, to be used specially for Biblic Hebrew. To be honest, I'm out of my depth with the details of the technical solution, so I will leave it to the properly knowledgeable like e.g. John Hudson and so on to reply to your analysis of my general conception. Basically, I simply wanted to make a 'general principle' comment based on my experience in other areas of software development because at times one can get very involved in the gory details and I felt that a step back and global summary of what I'm hearing by and large might be helpful. (And one learns by interacting, at a certain point. I'm bound to make mistakes in the process.) Essentially, I understand and appreciate John Cowan's concern/WG2's intransigeance (?) about stability, and the promises (however it was done) by Unicode in that regard and so on, and I don't deprecate that in the least. But, I agree with Michael that one should at least ask the appropriate persons if possible, and if there is no way to get concession (one should aim for a general principle, in case this sort of concern comes up in another area later, so as not to have to go to bat ANOTHER time), THEN one should go to one of these other, in principle less desirable 'solutions'. (But one can still dialogue about them in the interim.) And in any case this should NOT muck things up which aren't broken, like MH. K
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
Philippe Verdy scripsit: May be Unicode should be more prudent with Normalization Forms: if new characters are added, their combining classes should be documented as informative before there is a consensus and experimentation. This will not break the stability pact with XML, which will simply not accept the new characters before they are stabilized by Unicode. XML has gone with a preacceptance approach. All possible Unicode characters in all 17 planes are already accepted as text, and most of them will be accepted (in XML 1.1) as name characters as well, pending Unicode actually creating them. The problem is that normalization can't deal with a known character whose CC is unknown -- unknown is the same as zero. These potentially instable unicode-encoded documents will then be labelled with the unicode version, as a future revision may require verigying if the informative properties have become enforcable. This is precisely the nightmare that we wish to avoid. -- John Cowan [EMAIL PROTECTED] You need a change: try Canada You need a change: try China --fortune cookies opened by a couple that I know
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
On Friday, June 27, 2003 5:05 PM, Michael Everson [EMAIL PROTECTED] wrote: At 10:40 -0400 2003-06-27, John Cowan wrote: Karljürgen Feuerherm scripsit: 1. Everyone is more or less agreed that the present combining class rules as they apply to BH contain mistakes. The clearly preferential way to deal with mistakes in any technological/computing software environment is to FIX them. Not so. Sometimes stability is more important than correctness. And sometimes not, then. What four characters have been corrected so far? Were they important characters to some company? Are there no Christians or Jews in the IETF who might care about a problem like this, where a simple solution might be effected? Particularly if it involves only a handful of characters, and the precedent for making such corrections has been set? Or is our standard, which as I have said many times, will be used for CENTURIES, going to be hobbled by silliness like this forever? Hm? So this change must be done by proposing several alternatives to correct it, with a formal approval process with those with which Unicode made a promise: the IETF, and the W3C XML committee, or the SGML group and you should give them enough time to consult their members. I do think that the IETF will be quite open: after all its impact is limited in a few domains like IRI and IDNA which is still not used for domain names assigned to registrants, at least not for the Biblic Hebrew language. The experimentations at ICANN and IANA for IRI are still not closed and they have still not approved all the ISO10646 repertoire for all supported languages... From the acceptable solutions, ISO10646 will certainly follow the decision of the XML committee for practical reasons: the intent of ISO is to facilitate the implementation of a coherent repertoire, not to brake implementers in their developments. This requires an official poll to solve this problem, and Unicode will not be able to decide alone...
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
On Friday, June 27, 2003 5:53 PM, Karljürgen Feuerherm [EMAIL PROTECTED] wrote: And in any case this should NOT muck things up which aren't broken, like MH. Not breaking Modern Hebrew means not changing the combining classes of the characters it uses. Adding a distinct set for Traditional Hebrew may then be the only practical solution: after all there are many such concessions in ISO10646, which did not try to unify Greek and Cyrillic despite these two scripts are extremely related... With Unicode, there is for now no solution, so scholars will need to develop their own legacy encoding with distinct mappings to a future ISO10646 and Unicode standard, and for interoperability with these existing documents using this legacy 8-bit encoding, then will come the need to map this encoding to a distinct set in ISO10646 and Unicode. This would be the end of the nightmare. What Unicode will then publish, is a set of *compatibility* equivalences between the new diacritics for Traditional Hebrew and the existing diacritics for Modern Hebrew. I'm curious to see how legacy 8-bit documents are used with Biblic texts... Are the current conversion tables (informative in the Unicode database) for the ISO and Windows charsets correct with that perspective? If so the conversion from these 8-bit encodings to Unicode would be less simple than simple mappings, as it would require looking at the place of diacritics in the 8-bit encoding to see if they can safely be normalized once in Unicode accoding to their relative combining classes. -- Philippe.
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
Karljürgen Feuerherm scripsit: The use of the backslash character in DOS/Windows systems as a path separator is arguably a mistake I hardly think so. It was a matter of a necessary alternative. It could only be viewed as a mistake on the assumption that somehow the Unix way was defacto 'correct'. Pick your own mistake, then. Another good case I thought of this morning are the national boundaries in Africa, which have little or nothing to do with the realities on the ground. But (with one exception) all African nation-states treat them as sacred, because the results of full-scale border rectification would be nothing less than a world war. Several people have expressed reasons why this can't be (practically) be done--which mainly seem to stem from political concerns. All concerns involving human beings -- ho bios politikos -- are political in some sense. Of course. But that just trivializes the comment. I took your reference to political concerns to be trivializing the concerns, and pointed out that the very notion of concern is a political one. If there were no stakes, we could change Unicode daily according to the best current notion of technical excellence. Truth cannot conflict with truth, but interest can and commonly does conflict with interest. Indeed. And for some more than others. Kludges tend to be, in my experience, penny-wise and pound foolish. So if you like, I'll restate my point as 'pay me now, or pay me (probably more) later'. Alienate major customers now, or alienate a relatively small customer now. -- But that, he realized, was a foolishJohn Cowan thought; as no one knew better than he [EMAIL PROTECTED] that the Wall had no other side.http://www.ccil.org/~cowan --Arthur C. Clarke, The Wall of Darkness
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
Michael Everson scripsit: No, but you're not making a technical argument, either. The life of [Unicode] has not been logic but experience. --Oliver Wendell Holmes, somewhat mutated Not when their core values -- correctness vs. stability -- are made to be at odds. And shifting a METEG in a normalization versioning is going to cause what technical problem? If we can change METEG today, we might change COMBINING ACUTE tomorrow. -- Knowledge studies others / Wisdom is self-known; John Cowan Muscle masters brothers / Self-mastery is bone; [EMAIL PROTECTED] Content need never borrow / Ambition wanders blind; www.ccil.org/~cowan Vitality cleaves to the marrow / Leaving death behind.--Tao 33 (Bynner)
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
At 02:53 AM 6/27/2003, [EMAIL PROTECTED] wrote: ISO: Then, obviously they need to correct their errors. I mean, it's not like the wrong characters got encoded or something. Tell them to just fix the errors; that can't be difficult to do, and is obviously the right thing to do. That seems to be exactly what Michael 'as a member of WG2' is saying. What if the request to change the Hebrew combining classes came *from* W3C and/or IETF? I'm not saying that this is likely, but I'm wondering whether they might, in fact, not insist on stability for characters for which normalisation is currently broken anyway? John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] If you browse in the shelves that, in American bookstores, are labeled New Age, you can find there even Saint Augustine, who, as far as I know, was not a fascist. But combining Saint Augustine and Stonehenge -- that is a symptom of Ur-Fascism. - Umberto Eco
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
At 03:12 AM 6/27/2003, Michael Everson wrote: Who is it who will kill the Unicode Consortium if UAX #15 were to be revised? Did it occur to anyone to *ask* about the possible revision of classes for the dozen or so instances that would be affected? My understanding is that stability promises have been made to W3C and IETF (any others?). I'm also leaning toward asking these organisations if an exception can be made to fix a broken normalisation for Hebrew, given that the present normalisation is not useful. If the UTC doesn't want to make the enquiry, perhaps a consortium of Biblical Hebraicist academic organisations, publishers and software developers could take the matter to W3C and IETF and try to obtain a statement that would allow Unicode to make the change? Perhaps WG2 would also support such a petition? John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] If you browse in the shelves that, in American bookstores, are labeled New Age, you can find there even Saint Augustine, who, as far as I know, was not a fascist. But combining Saint Augustine and Stonehenge -- that is a symptom of Ur-Fascism. - Umberto Eco
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
John Cowan said on June 27, 2003 at 12:48 PM Karljürgen Feuerherm scripsit: Several people have expressed reasons why this can't be (practically) be done--which mainly seem to stem from political concerns. All concerns involving human beings -- ho bios politikos -- are political in some sense. Of course. But that just trivializes the comment. I took your reference to political concerns to be trivializing the concerns, That was not the intention, though perhaps it sounded that way I take your concerns every bit as seriously as I do those of the Biblical Hebrew community. If there were no stakes, we could change Unicode daily according to the best current notion of technical excellence. 'We' do in fact change Unicode (nearly) daily every time there is a revision. The question at hand is where to draw the line. Your position is crystal clear, and I am not questioning the reason why you make it, or the value in it. But Alienate major customers now, or alienate a relatively small customer now. the relatively small customer has every right to argue his/her case, and to hope for an implemention which will address his/her needs. The cost of kludges vs. corrections is not in the least analogous to your statement: both customers can--at least in theory--be satisfied, with some give on both sides. If Unicode 'botched it up the first time' (not that I'm necessarily saying it did, but let's say so for the sake of argument), is it reasonable for the major customers to insist that the solution lies in botching it further? (and so on...) I agree that stability is sometimes preferable to (not necessarily better than) correctness. But a stable product which does not address the purpose for which it was created is definitely not preferable to one which is corrected to suit the purpose (the risks therein being acknowledged). Of course, one may object that the present implementation was not created for or to include BH in the first place, and that may be (I'm happy to be informed if that is so. But that isn't the impression the discussion thus far has made). (If not, then one must ask whether the present implementation can be reasonably extended (thus preserving the stability of the existing platform) or whether one must create a new, parallel implementation for the new purpose, [or some combination of the two] which is where most of the discussion seems centred.) K
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
John Hudson scripsit: What if the request to change the Hebrew combining classes came *from* W3C and/or IETF? I'm not saying that this is likely, but I'm wondering whether they might, in fact, not insist on stability for characters for which normalisation is currently broken anyway? The normalization is not broken from the point of view of the stability community. They consider it more important that there be a fixed rule, than what the content of the rule is. Google for stare decisis for much more on this point of view in general. -- John Cowan [EMAIL PROTECTED] www.ccil.org/~cowan www.reutershealth.com If I have seen farther than others, it is because I am surrounded by dwarves. --Murray Gell-Mann
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
At 05:48 AM 6/27/2003, Michael Everson wrote: The W3C would also hit the roof if Unicode normalization changed radically. I don't think anyone is proposing a *radical* change. I have uploaded the relevant draft pages of the SBL Hebrew user manual to http://www.tiro.com/transfer/SBLappendixB.pdf This appendix provides suggested combining classes for customised normalisation routines, compared with Unicode normalisation routines. This has been tested by Libronix/Logos with the Michigan-Claremont electronic text of the _Biblia Hebraica Stuttgartensia_. There are 17 marks whose combining class value should be corrected, of which the vowels and meteg are most important. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] If you browse in the shelves that, in American bookstores, are labeled New Age, you can find there even Saint Augustine, who, as far as I know, was not a fascist. But combining Saint Augustine and Stonehenge -- that is a symptom of Ur-Fascism. - Umberto Eco
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
John Cowan said on June 27, 2003 at 12:56 PM Michael Everson had said: This is not analogous to the present situation, it seems to me. In the first place, what else is the \ for? :-) Escaping special characters, since you ask. But in a completely different. K
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
John Cowan wrote on 06/27/2003 08:24:35 AM: The IETF has an explicit contract with Unicode: We' ll use your normalization algorithm if you promise NEVER, NEVER to change the normalization status of a single character. Unicode has already broken that promise four times, so its credibility is shaky. Yeah, but what I don't get is that IETF doesn't set anything in stone until there are working implementations, but Unicode's canonical combining classes have to be set in stone for IETF's benefit before there are working implementations. I just have a hard time understanding that. So far I have not heard any compelling objections to CGJ except that invisible characters are fuggly. I just sent a message discussing this. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
Philippe Verdy said on June 27, 2003 at 12:38 PM Subject: Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels) On Friday, June 27, 2003 5:53 PM, Karljürgen Feuerherm [EMAIL PROTECTED] wrote: And in any case this should NOT muck things up which aren't broken, like MH. Not breaking Modern Hebrew means not changing the combining classes of the characters it uses. Adding a distinct set for Traditional Hebrew may then be the only practical solution That was in effect basically what I was wondering about with my question. Thanks K
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
Philippe said on June 27, 2003 at 10:25 AM Do you then propose to create a specific character, for use within the Hebrew script only, as a way to specify an alternate order for hebrew cantillation? In that case, it would be more appropriate to define new standard variants of these cantillation marks, and list them in the supported variants, to be used specially for Biblic Hebrew. The cantillation marks are pretty much okay: they will not be re-ordered during normalisation. There are three that should ideally have a postpositional combining class (see http://www.tiro.com/transfer/SBLappendixB.pdf), but the rest are fine. The problem is with the vowels. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] If you browse in the shelves that, in American bookstores, are labeled New Age, you can find there even Saint Augustine, who, as far as I know, was not a fascist. But combining Saint Augustine and Stonehenge -- that is a symptom of Ur-Fascism. - Umberto Eco
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
Peter replied: Karljürgen Feuerherm wrote on 06/27/2003 08:23:08 AM: Now, Q: I take it the combining classes are linked to the script, rather than say to a dialect They're linked to the character. --e.g. one can't define BH as a separate dialect from MH with its own set of rules? No, not unless BH is encoded with a separate set of characters. I see. Not desirable, to say the least. Thanks K
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
At 10:20 AM 6/27/2003, John Cowan wrote: What if the request to change the Hebrew combining classes came *from* W3C and/or IETF? I'm not saying that this is likely, but I'm wondering whether they might, in fact, not insist on stability for characters for which normalisation is currently broken anyway? The normalization is not broken from the point of view of the stability community. They consider it more important that there be a fixed rule, than what the content of the rule is. Google for stare decisis for much more on this point of view in general. Fair enough. I made my suggestion before reading all of your exchange with Michael. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] If you browse in the shelves that, in American bookstores, are labeled New Age, you can find there even Saint Augustine, who, as far as I know, was not a fascist. But combining Saint Augustine and Stonehenge -- that is a symptom of Ur-Fascism. - Umberto Eco
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
Peter responded: Kenneth Whistler wrote on 06/26/2003 05:36:34 PM: Why is making use of the existing behavior of existing characters a groanable kludge, if it has the desired effect and makes the required distinctions in text? Why is it a kludge to insert some cc=0 control character into the text for the sole purpose of preventing reordering during canonical ordering of two combining marks that do interact typographically and so should but nevertheless do not have the same combining class; and, moreover, to do so using a control character that was not created for that purpose? The answer seems so obvious, I wouldn't know how to begin responding. And others apparently had the same feeling. But I contend that the reason this seems odd is because of the way you present it to yourself and others. It isn't a matter of my text is o.k. the way I entered it, but now I have to insert some invisible control character into the text for the sole purpose of preventing reordering -- which wasn't something I wanted to have happen in the first place. Instead, it is that for Biblical Hebrew, the following textual conventions are adopted: A sequence of patah followed by hiriq is represented by patah, CGJ, hiriq A sequence of hiriq followed by patah is represented by hiriq, CGJ, patah Then you build keyboards (or other abstractions) that obey those textual conventions. You stop telling the Biblical Scholars that their text is screwed up because of Unicode and they have to fix it by inserting crazy control codes they don't know about, and chances are they will stop believing that their text is screwed up. :-) This isn't really any stranger than telling someone that for Twi, the following textual convention is adopted: An open o with an acute tone mark is represented by open-o, combining acute As long as the pieces stay firmly attached for entry, display, and searching, everybody is happy and nobody needs to be the wiser about what gimmicks the programmers are using under the covers. And why should it be any stranger that maintenance of vowel point order in Biblical Hebrew cases with multiple points requires judicious use of an invisible combining mark like CGJ, when maintenance of visible directional layout distinctions for any Hebrew requires a boatload of invisible format controls? If we want to insert a control character to prevent reordering under canonical ordering, I think it would be preferable to create a new control character for just that purpose: How would that be less of a kludge? I contend that inventing another invisible character *just* to do this is even more of a kludge than what I have suggested, when use of an existing character already has the desired effect. The end effect of the impulse you are describing here would be an attempt to create atomistic controls for each conceivable text effect, and I think the UTC has already given up on heading that direction. It is already bad enough trying to keep straight all the possible interactions for the ones already created, as demonstrated by the discoveries we just made when trying to consider what happens if a ZWJ gets plunked down *between* two combining marks. that would give a character that could be used elsewhere for the very same purpose without needing to worry about what unanticipated and undesirable effects might result by hijacking a control created for some completely unrelated purpose. This was a more applicable criticism for the suggestions of RLM, ZWJ, or WJ, since their very status as format controls instead of as combining marks had undesirable effects on the combining character sequences in question. I don't think the criticism applies to CGJ, however, since that character doesn't have any defined behavior other than what is needed here. And, as I indicated in a separate response, I do not think using CGJ for the purpose described in Biblical Hebrew is unrelated to its intent. It is just that nobody had yet thought through a scenario where it would prove useful between combining marks. --Ken
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
At 12:43 AM 6/26/2003, [EMAIL PROTECTED] wrote: The problem of combinations of vowels with meteg could be amenable to a similar approach. OR, one could propose just one additional meteq/silluq character, to make it possible to distinguish (in plain text) instances of left-side and right-side meteq placement, for example. And the third position of meteg with hataf vowels? Introduce *two* additional meteg/silluq characters? No, that's a glyph ligation matter however you look at it. It could be made to work with either just a left meteg or also with a new right meteg, and can be inhibited with ZWNJ. This is not to say that I think encoding a distinct right meteg character is the best solution, only that it doesn't affect the medial meteg shaping. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] If you browse in the shelves that, in American bookstores, are labeled New Age, you can find there even Saint Augustine, who, as far as I know, was not a fascist. But combining Saint Augustine and Stonehenge -- that is a symptom of Ur-Fascism. - Umberto Eco
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
At 10:09 AM 6/26/2003, [EMAIL PROTECTED] wrote: The Meteg is a completely different issue. There is a small number of places were the Meteg is placed differently. Since it does not behave the same as the regular Meteg, and is thus visually distinguishable, it should be possible to add a character, as long as it is clearly named. That is a potential solution, thought it would have to be *two* additional metegs. Can you explain your thinking here, Peter? I agree that if the intention is to encode new Biblical Hebrew marks with revised combining classes, then two new metegs would be necessary if we want one left and one right. But if one were to accept the text encoding hack of a ZERO-WIDTH CANONICAL ORDERING INHIBITOR -- which seems less and less like a good idea, and more and more like a long term embarassment and, like ZWJ and ZWNJ, a pain in the neck for users who have every right to expect a sensible encoding that doesn't require such gymnastics --, then I think one would only need a new HEBREW POINT RIGHT METEG character, and let it be assumed that the existing meteg character is the left position form (it's current combining class puts it after all vowels, I believe). John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] If you browse in the shelves that, in American bookstores, are labeled New Age, you can find there even Saint Augustine, who, as far as I know, was not a fascist. But combining Saint Augustine and Stonehenge -- that is a symptom of Ur-Fascism. - Umberto Eco
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
Another consequence is that it separates the sequence into two combining sequences, not one. Don't know if this is a serious problem, especially since we are concerned with a limited domain with non-modern usage, but I wanted to mention it. Mark __ http://www.macchiato.com Eppur si muove - Original Message - From: Kenneth Whistler [EMAIL PROTECTED] To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Thursday, June 26, 2003 13:41 Subject: Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels) Peter replied to Karljrgen: Karljrgen Feuerherm wrote on 06/25/2003 08:31:41 PM: I was going to suggest something very similar, a ZW-pseudo-consonant of some kind, which would force each vowel to be associated with one consonant. An invisible *consonant* doesn't make sense because the problem involves more than just multiple written vowels on one consonant; I agree that we don't want to go inventing invisible consonants for this. BTW, there's already an invisible vowel (in fact a pair of them) that is unwanted by the stakeholders of the script it was originally invented for: U+17B4 KHMER VOWEL INHERENT AQ This is also (cc=0), so would serve to block canonical reordering if placed between two Hebrew vowel points. But I'm sure that if Peter thought the suggestion of the ZWJ for this was a groanable kludge, Biblical Hebraicists would probably not take lightly to the importation of an invisible Khmer character into their text representations. ;-) in fact, that is a small portion of the general problem. If we want such a character, it would notionally be a zero-width-canonical-ordering-inhibiter, and nothing more. The fact is that any of the zero-width format controls has the side-effect of inhibiting (or rather interrupting) canonical reordering if inserted in the middle of a target sequence, because of their own class (cc=0). I'm not particularly campaigning for ZWJ, by the way. ZWNJ or even U+FEFF ZWNBSP would accomplish the same. I just suggested ZWJ because it seemed in the ballpark. ZWNBSP would likely have fewer possible other consequences, since notionally it means just don't break here, which you wouldn't do in the middle of a Hebrew combining character sequence, anyway. And I don't particular want to think about what happens when people start sticking this thing into sequences other than Biblical Hebrew (in unicode, any sequence is legal). But don't forget that these cc=0 zero width format controls already can be stuck into sequences other than Biblical Hebrew. In some instances they have defined semantics there (as for Arabic and Indic scripts), but in all cases they would *already* have the effect of interrupting canonical reordering of combining character sequences if inserted there. --Ken
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
Peter responded: Ken Whistler wrote on 06/25/2003 06:57:56 PM: People could consider, for example, representation of the required sequence: lamed, qamets, hiriq, final mem as: lamed, qamets, ZWJ, hiriq, final mem So, we want to introduce yet *another* distinct semantic for ZWJ? Actually, no, I don't. That was just the first candidate that came to mind. We've got one for Indic, another for Arabic, another for ligatures (similar to that for Arabic, but slightly different). Now another that is don't affect any visual change, just be there to inhibit reordering under canonical ordering / normalization? As I pointed out in a separate response, just putting the ZWJ there would *already* interrupt the reodering of the sequence. There is nothing new about that. The problem is that you might not be able to count on it not effecting a visual change, because the generic meaning of ZWJ is now intended to be ligation requesting, which does have visual consequences. I now like better the suggestions of RLM or WJ for this. Both of those format controls, by *definition*, should have no impact on visual display in this context, the RLM because it would be inserted between two NSM's that pick up strong R-to-L directionality from the consonant, and the WJ because it would be inserted at a position where there already is no word/line break opportunity. But either of them, by their current definition and properties, would break the sequences for canonical reordering. So they already have the semantics of the putative new control in question: no effect on visual display, while inhibiting of the canonical reordering of the point sequence. The presence of a ZWJ (cc=0) in the sequence would block the canonical reordering of the sequence to hiriq before qamets. If that is the essence of the problem needing to be addressed, then this is a much simpler solution which would impact neither the stability of normalization nor require mass cloning of vowels in order to give them new combining classes. Yes, it would accomplish all that; and is groanable kludge. Why is making use of the existing behavior of existing characters a groanable kludge, if it has the desired effect and makes the required distinctions in text? If there is not some rendering system or font lookup showstopper here, I'm inclined to think it's a rather elegant way out of the problem. At least with having distinct vowel characters for Biblical Hebrew, we'd come to a point we could forget about it, and wouldn't be wincing every time we considered it. Au contraire. We'll be wincing forever for this one. There's no way of getting around the fact that this is merely a cloning of a the whole set of points in order to have candidates for a reassigned set of combining classes. You're stuck between a rock and a hard place on this one. The UTC cannot entertain merely fixing the existing combining class assignments, because it breaks the normalization stability guarantee. We've all come to acknowledge and most to accept that, even though it still elicits groans. But in the 10646 WG2 context, coming in with a duplicate set of Hebrew points is not going to make any sense, because, as someone (John Cowan?) has already pointed out, 10646 doesn't assign combining classes, and so trying to justify character cloning on the basis of distinct combining class assignments isn't going to make any sense there. You can always come in with the proposal to encode BIBLICAL HEBREW POINT PATAH and say, even though the glyph is identical, see, the name is different, so the character is different. But this is a pretty thin disguise, and is vulnerable to simple questioning: What is it for? Well, to point Biblical Hebrew texts. But what was U+05B7 HEBREW POINT PATAH for? Well, to point Biblical Hebrew texts (or any Hebrew text, for that matter...). Well, then, what is the difference? Uh, the combining classes for the two are different. What is a combining class? ... and so on. I'm trying to find a way, using existing characters and a simple set of text representational conventions, to make the distinctions and preserve the order relations that you need for decent font lookup, without the whole enterprise washing up on either of those two rocks. --Ken
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
At 02:45 PM 6/26/2003, Mark Davis wrote: Another consequence is that it separates the sequence into two combining sequences, not one. Don't know if this is a serious problem, especially since we are concerned with a limited domain with non-modern usage, but I wanted to mention it. It is a serious problem if separate combining sequences means, as it seems to in all the current apps I have tested, that marks separated by one of these control characters cannot be correctly positioned relative to a preceding consonant. Insertion of any zero-width control character between two marks applied to the same Hebrew consonant results in a loss of interraction between the marks (i.e. the first mark is not repositioned to accomodate the second) and the second mark loses all positioning intelligence and falls between the consonant and the next one. My guess is that the layout engine (Uniscribe in this case) makes the reasonable assumption that the two combining sequences do not interract. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] If you browse in the shelves that, in American bookstores, are labeled New Age, you can find there even Saint Augustine, who, as far as I know, was not a fascist. But combining Saint Augustine and Stonehenge -- that is a symptom of Ur-Fascism. - Umberto Eco
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
At 15:36 -0700 2003-06-26, Kenneth Whistler wrote: I now like better the suggestions of RLM or WJ for this. ZZZT. Thank you for playing. RLM is for forcing the right behaviour for stops and parentheses and question marks and so on. Introducing it between two combining characters in Hebrew text would break all kinds of things, and would be horrible, horrible, horrible. Invent a new control character for this weird property-killer, if you must, but don't use an ordering mark for it. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
At 03:36 PM 6/26/2003, Kenneth Whistler wrote: Why is making use of the existing behavior of existing characters a groanable kludge, if it has the desired effect and makes the required distinctions in text? If there is not some rendering system or font lookup showstopper here, I'm inclined to think it's a rather elegant way out of the problem. I think assumptions about not breaking combining mark sequences may, in fact, be a showstopper. If base+mark+mark becomes base+mark+CtrlChar+mark, it is reasonable to think that this will not only inhibit mark re-ordering but also mark combining and mark interraction. Unfortunately, this seems to be the case with every control character I have been able to test, using two different rendering engines (Uniscribe and InDesign ME -- although the latter already has some problems with double marks in Biblical Hebrew). Perhaps we should have a specific COMBINING MARK SEQUENCE CONTROL character? All that said, I disagree with Ken that this is anything like an elegant way out of the problem. Forcing awkward, textually illogical and easily forgetable control character usage onto *users* in order to solve a problem in the Unicode Standard is not elegant, and it is unlikely to do much for the reputation of the standard. Q: 'Why do I have to insert this control character between these points?' A: 'To prevent them from being re-ordered.' Q: 'But why would they be re-ordered anyway? Why wouldn't they just stay in the order I put them in?' A: 'Because Unicode normalisation will automatically re-order the points.' Q: 'But why? Points shouldn't be re-ordered: it breaks the text.' A: 'Yes, but the people who decided how normalisation should work for Hebrew didn't know that.' Q: 'Well can't they fix it?' A: 'They have: they've told you that you have to insert this control character...' Q: 'But *I* didn't make the mistake. Why should I have to be the one to mess around with this annoying control character?' ... and so on. Much as the duplication of Hebrew mark encoding may be distasteful, and even considering the work that will need to be done to update layout engines, fonts and documents to work with the new mark characters, I agree with Peter Constable that this is by far the best long term solution, especially from a *user* perspective. Over the past two months I have been over this problem in great detail with the Society of Biblical Literature and their partners in the SBL Font Foundation. They understand the problems with the current normalisation, and they understand that any solution is going to require document and font revisions; they're resigned to this, and they've worked hard to come up with combining class assignments that would actually work for all consonant + mark(s) sequences encountered in Biblical Hebrew. This work forms the basis of the proposal submitted by Peter Constable. Encoding of new Biblical Hebrew mark characters provides a relatively simple update path for both documents and fonts, since it largely involves one-to-one mappings from old characters to new. Conversely, insisting on using control characters to manage mark ordering in texts will require analysis to identify those sequences that will be subject to re-ordering during normalisation, and individual insertion of control characters. The fact that these control characters are invisible and not obvious to users transcribing text, puts an additional burden on application and font support, and adds another level of complexity to using what are already some of the most complicated fonts in existence (how many fonts do you know that come with 18 page user manuals?). I think it is unreasonable to expect Biblical scholars to understand Unicode canonical ordering to such a deep level that they are able to know where to insert control characters to prevent a re-ordering that shouldn't be happening in the first place. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] If you browse in the shelves that, in American bookstores, are labeled New Age, you can find there even Saint Augustine, who, as far as I know, was not a fascist. But combining Saint Augustine and Stonehenge -- that is a symptom of Ur-Fascism. - Umberto Eco
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
Michael wrote: At 15:36 -0700 2003-06-26, Kenneth Whistler wrote: I now like better the suggestions of RLM or WJ for this. ZZZT. Thank you for playing. RLM is for forcing the right behaviour for stops and parentheses and question marks and so on. Introducing it between two combining characters in Hebrew text would break all kinds of things, True, apparently, but not for the reasons you surmise. RLM does not force behavior on things. It is a strong right-to-left context that can change the resolved directionality of neutrals or weak types next to it. In between two characters that are already R, the presence or absence of an RLM is basically a no-op for bidi. Just considering the bidi algorithm, a sequence: lamed, patah, RLM, hiriq R NSM R NSM would have the resolved directions: R, R, R, R, effectively no different than the resolved direction: R, R, R of the sequence without the RLM. The problem arises when you go to consider the graphic application of the combining mark to its base form, and for that, the issue is apparently the same for the WJ, ZWJ, or any other format control in such a position. So this is nothing to do with the bidi function of RLM. and would be horrible, horrible, horrible. Invent a new control character for this weird property-killer, if you must, but don't use an ordering mark for it If you invent a new control character for this weird property-killer (which it wouldn't be, since in any case, I'm just talking about inserting a (cc=0) character in between two other characters, not changing or killing any properties), you still end up with exactly the same problem of graphic application, because the presence of any format control creates a defective combining character sequence which applications (apparently) won't display. --Ken
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
John, At 03:36 PM 6/26/2003, Kenneth Whistler wrote: Why is making use of the existing behavior of existing characters a groanable kludge, if it has the desired effect and makes the required distinctions in text? If there is not some rendering system or font lookup showstopper here, I'm inclined to think it's a rather elegant way out of the problem. I think assumptions about not breaking combining mark sequences may, in fact, be a showstopper. If base+mark+mark becomes base+mark+CtrlChar+mark, it is reasonable to think that this will not only inhibit mark re-ordering but also mark combining and mark interraction. Unfortunately, this seems to be the case with every control character I have been able to test, using two different rendering engines (Uniscribe and InDesign ME -- although the latter already has some problems with double marks in Biblical Hebrew). Perhaps we should have a specific COMBINING MARK SEQUENCE CONTROL character? Actually, in casting around for the solution to the problem of introduction of format controls creating defective combining character sequences, it finally occurred to me that: U+034F COMBINING GRAPHEME JOINER has the requisite properties. It is non-visible, does not affect the display of neighboring characters (except incidentally, if processes choose to recognize sequences containing it and process them distinctly), *AND* it is a *combining mark*, not a format control. Hence, the sequence: lamed, patah, CGJ, hiriq 0 170 14 is *not* a defective combining character sequence, by the definitions in the standard. The entire sequence of three combining marks would have to apply to the lamed, but the fact that CGJ has (cc=0) prevents the patah from reordering around the hiriq under normalization. Could this finally be the missing killer ap for the CGJ? All that said, I disagree with Ken that this is anything like an elegant way out of the problem. Forcing awkward, textually illogical and easily forgetable control character usage onto *users* in order to solve a problem in the Unicode Standard is not elegant, and it is unlikely to do much for the reputation of the standard. I don't understand this contention. There is no reason, in principle, why this has to be surfaced to end users of Biblical Hebrew, any more than messy details of embedding override controls has to be surfaced to end users in order to make an interface which will support end user control over direction in bidirectional text. If CGJ is the one, then the only *real* implementation requirement would be that CGJ be consistently inserted (for Biblical Hebrew) between any pair of points applied to the same consonant. Depending on the particular application, this could either be hidden behind the input method/keyboard and be actively managed by the software, or it could be applied as a filter to an export format, when exporting to contexts that might neutralize intended contrasts or result in the wrong display by the application of normalization. Q: 'Why do I have to insert this control character between these points?' A: 'To prevent them from being re-ordered.' Q: 'But why would they be re-ordered anyway? Why wouldn't they just stay in the order I put them in?' A: 'Because Unicode normalisation will automatically re-order the points.' Q: 'But why? Points shouldn't be re-ordered: it breaks the text.' A: 'Yes, but the people who decided how normalisation should work for Hebrew didn't know that.' Q: 'Well can't they fix it?' A: 'They have: they've told you that you have to insert this control character...' And that whole dialogue should be limited to the *programmers* only, whose job it is then to hide the details of how they get the magic to work from people who would find those details just confusing. Q: 'But *I* didn't make the mistake. Why should I have to be the one to mess around with this annoying control character?' ... and so on. Much as the duplication of Hebrew mark encoding may be distasteful, and even considering the work that will need to be done to update layout engines, fonts and documents to work with the new mark characters, I agree with Peter Constable that this is by far the best long term solution, especially from a *user* perspective. I have to disagree. It should be largely irrelevant to the user perspective. In this case (as in others) the users are the experts about what their expected requirements are for text behavior, and in particular, what distinctions need to be maintained. But they should not be expected to define the technical means for fulfilling those requirements, nor lean over the shoulders of the engineers to tell them how to write the software to accomplish it. Over the past two months I have been over this problem in great detail with the Society of Biblical Literature and their partners in the SBL Font Foundation. They understand the problems with the current
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
At 04:57 PM 6/25/2003, Kenneth Whistler wrote: And I hate to have to continue being Mr. Negativity on this list, but I remain unconvinced that the proposed solution (of cloning 14 Hebrew points and vowels) just to fix an unpreferred canonical reordering result represents the sole remaining alternative. In this case, I believe the side-effects of the proposed medicine are worse than the disease itself. I didn't say I like the proposed solution, only that I've not heard of another one that works and is acceptable to the UTC. For example, the alleged problem of the vocalization order of the Masoretes might be amenable to a much less drastic solution. People could consider, for example, representation of the required sequence: lamed, qamets, hiriq, final mem as: lamed, qamets, ZWJ, hiriq, final mem and then map qamets, ZWJ, hiriq to the required glyph to get the hiriq to display to the left (and partly under the following final mem). There are a few problems with this scenario. One is that control characters are unreliable agents in glyph-level processing. Most applications do not paint control character glyphs, which means that they do not appear in glyph strings so cannot used in glyph substitution lookups. This seems to be a pretty much universal assumption about control characters. MS Word offers the option of turning on display of control characters, but then the purpose is to be able to see them in text, not to affect the text by toggling the display option. Arguably, there are implementation options that would overcome this problem, but they are complicated and the present assumption seems pretty universal. That said, I would be willing to explore this idea further, since I don't think it is necessary to get into glyph substitution involving ZWJ if the presence of ZWJ in the character string always blocks canonical reordering. In the example I gave, simply preventing, e.g. the hiriq from being re-ordered should be enough to make it correctly render under the right side of the final mem. However, this example is something of an exceptional rendering, currently involving a special /HiriqFinalMem/ glyph. I would need to check all the other affected sequences to confirm whether inserting ZWJ causes mark positioning problems (I know it will in some applications, simply because support for ZWJ isn't always very good). The frustration is that although ZWJ cannot be reliably used in glyph substitution lookups, its presence can break glyph positioning lookups. Thanks for the idea, though. I think it is worth exploring. The problem of combinations of vowels with meteg could be amenable to a similar approach. OR, one could propose just one additional meteq/silluq character, to make it possible to distinguish (in plain text) instances of left-side and right-side meteq placement, for example. Yes, that is an option for the meteg/silluq regardless of how the vowel ordering problem is addressed. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] If you browse in the shelves that, in American bookstores, are labeled New Age, you can find there even Saint Augustine, who, as far as I know, was not a fascist. But combining Saint Augustine and Stonehenge -- that is a symptom of Ur-Fascism. - Umberto Eco
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
John Hudson wrote: At 02:36 PM 6/25/2003, Michael Everson wrote: Write it up with glyphs and minimal pairs and people will see the problem, if any. Or propose some solution. (That isn't add duplicate characters.) Peter Constable has written this up and submitted a proposal to the UTC. And I hate to have to continue being Mr. Negativity on this list, but I remain unconvinced that the proposed solution (of cloning 14 Hebrew points and vowels) just to fix an unpreferred canonical reordering result represents the sole remaining alternative. In this case, I believe the side-effects of the proposed medicine are worse than the disease itself. For example, the alleged problem of the vocalization order of the Masoretes might be amenable to a much less drastic solution. People could consider, for example, representation of the required sequence: lamed, qamets, hiriq, final mem as: lamed, qamets, ZWJ, hiriq, final mem and then map qamets, ZWJ, hiriq to the required glyph to get the hiriq to display to the left (and partly under the following final mem). The presence of a ZWJ (cc=0) in the sequence would block the canonical reordering of the sequence to hiriq before qamets. If that is the essence of the problem needing to be addressed, then this is a much simpler solution which would impact neither the stability of normalization nor require mass cloning of vowels in order to give them new combining classes. Effectively what would be needed would be an agreement by Biblical Hebraicists on a text representational convention using existing characters. By doing so, they would gain both the required orderings and the ability to make the distinctions they want. If use of a ZWJ (or something similar) seems alien to Hebrew specialists, then, as always, the details can be hidden behind the details of input method and keyboard front ends. The use of a ZWJ should not impact searches on data (if the searches are properly implemented), unless the search is explicitly concerned about the distinctions -- in which case there actually *is* a difference in the text representation which can be searched for. The problem of combinations of vowels with meteg could be amenable to a similar approach. OR, one could propose just one additional meteq/silluq character, to make it possible to distinguish (in plain text) instances of left-side and right-side meteq placement, for example. --Ken
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
For example, the alleged problem of the vocalization order of the Masoretes might be amenable to a much less drastic solution. People could consider, for example, representation of the required sequence: lamed, qamets, hiriq, final mem as: lamed, qamets, ZWJ, hiriq, final mem and then map qamets, ZWJ, hiriq to the required glyph to get the hiriq to display to the left (and partly under the following final mem). There are a few problems with this scenario. One is that control characters are unreliable agents in glyph-level processing. Most applications do not paint control character glyphs, which means that they do not appear in glyph strings so cannot used in glyph substitution lookups. Even if the ZWJ is stripped by the application before the actual low-level paint API is called, so that instead of lamed, qamets, ZWJ, hiriq, final mem the renderer just sees lamed, qamets, hiriq, final mem you still end up with the order you need to make the distinction. The only problem would be if an application first stripped the ZWJ and then *before* calling the paint operation, proceeded to normalize the control-stripped glyph string. That would, however, strike me as being arguably non-conformant with the intent of the standard and the intent of normalization. You might end up with very strange behavior if applications started normalizing glyph strings after stripping them of format controls. --Ken
Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)
At 06:22 PM 6/25/2003, Kenneth Whistler wrote: Even if the ZWJ is stripped by the application before the actual low-level paint API is called, so that instead of lamed, qamets, ZWJ, hiriq, final mem the renderer just sees lamed, qamets, hiriq, final mem you still end up with the order you need to make the distinction. Yes. That works. My biggest worry with the ZWJ is that it may affect positioning lookups; this requires some experiment. Whatever solution is finally adopted, changing texts and fonts is relatively simple. Changing layout engines and applications is harder and takes longer. We're pretty much resigned to updating texts and fonts. I'll discuss the ZWJ idea with our project partners, but if it does affect positioning lookups it is not something that will get adopted until that is resolved. The potential problem is that the ZWJ is used to so many different things now, it is difficult to know exactly what applications will do with it. If the intent, as in this instance, is simply to prevent character re-ordering, should we really be using something so loaded. Perhaps we need a control character specifically as a canonical order override: something with cc=0 but no other behaviour associated with it. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] If you browse in the shelves that, in American bookstores, are labeled New Age, you can find there even Saint Augustine, who, as far as I know, was not a fascist. But combining Saint Augustine and Stonehenge -- that is a symptom of Ur-Fascism. - Umberto Eco