Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-27 Thread Michael Everson
At 04:22 -0500 2003-06-27, [EMAIL PROTECTED] wrote:

I just have a hard time believing that 50 years from now our 
grandchildren  won't look back, What were they thinking? So it took 
them a couple of  years to figure out canonical ordering and 
normalization; why on earth  didn't they work that out first before 
setting things in stone, rather  than saddling us with this 
hodgepodge of ad hoc workarounds? How short  sighted. As Rick said, 
I know this will get shot down; don't bother  telling me so.
I agree with you, Peter.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com


Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-27 Thread Michael Everson
At 04:22 -0500 2003-06-27, [EMAIL PROTECTED] wrote:

Are we saying that ISO doesn't give a rip for implementation issues?
Duplication of characters is not the way to fix (forgive me, UTC) 
*Unicode's* error in combining characters.

Or that their notion of ordering distinctions is different from 
Unicode's  such that *any* differently ordering permutation of some 
given set of  characters is considered a distinct representation? 
Are we saying that the  voting members of WG2 are not already aware 
of the issue that has been  discussed and incapable of understanding 
an explanation of these issues  addressed to them?
You might submit your paper to WG2.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com


Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-27 Thread Michael Everson
At 04:53 -0500 2003-06-27, [EMAIL PROTECTED] wrote:

If they're so unaware of combining classes, might it not seem 
reasonable to think the the dialog might continue as follows?

- [gives explanation of combining classes and the related problem for Hebrew]
ISO: So, you're saying you're coming to us asking for duplicates of 
existing characters because of an error the Unicode Consortium made 
with some of those character properties they define?
- Well, yes, that's basically it.
ISO: Then, obviously they need to correct their errors. I mean, it's 
not like the wrong characters got encoded or something. Tell them to 
just fix the errors; that can't be difficult to do, and is obviously 
the right thing to do.
This is exactly my view.

Who is it who will kill the Unicode Consortium if UAX #15 were to be 
revised? Did it occur to anyone to *ask* about the possible revision 
of classes for the dozen or so instances that would be affected?
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



[cowan: Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)]

2003-06-27 Thread John Cowan
Michael Everson scripsit:

 Who is it who will kill the Unicode Consortium if UAX #15 were to be 
 revised? Did it occur to anyone to *ask* about the possible revision 
 of classes for the dozen or so instances that would be affected?

The IETF, for one.  IETF is already very wary of Unicode, even though
they recognize the practical necessity of using it, but with the existing
stability guarantees about normalization, they have managed to swallow it.
Stability *even if wrong* is really, really important to protocol people --
just think of all the nonfunctional stubs in the world of *diplomatic*
protocol, maintained in the name of not changing anything.

The W3C would also hit the roof if Unicode normalization changed radically.
Neither party is at all happy with even the four (I think) characters
that have already changed, and are already beginning to turn into
optimistic pessimists (people who smile brightly, nod their heads, and say
happily, See, things are every bit as bad as I predicted!).

Since the use of non-ASCII characters in things like XML and the DNS
depends on the good will of these folks, it is very very dangerous
to alienate them, and *they do not care* whether the case is a corner
case or not -- _stare decisis_ is everything to them, the actual
details little or nothing.

Change the character classes in Unicode 4.1, and they *might* decide to
freeze support at, say, Unicode 3.0.

-- 
John Cowan
[EMAIL PROTECTED]
I am a member of a civilization. --David Brin



Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-27 Thread Andrew C. West
On Fri, 27 Jun 2003 04:22:30 -0500, [EMAIL PROTECTED] wrote:

 I just have a hard time believing that 50 years from now our grandchildren 
 won't look back, What were they thinking? So it took them a couple of 
 years to figure out canonical ordering and normalization; why on earth 
 didn't they work that out first before setting things in stone, rather 
 than saddling us with this hodgepodge of ad hoc workarounds? How short 
 sighted. As Rick said, I know this will get shot down; don't bother 
 telling me so.

I have to agree 100% with Peter on this. The potential fiasco with regards to
Mongolian Free Variation Selectors is another area where our grandchildren are
going to be weeping with despair if we are not careful. The standardized
variants for Mongolian were set in stone by Unicode based on an unfortunate but
understandable misunderstanding of the infamous TR170, and now that it is
apparent from Chinese and Mongolian sources that Unicode had got hold of
completely the wrong end of the stick (the defined standardized variants are
actually intended for use in isolation only, and the same MFVS that selects one
variant form in isolation may be used to select a completely different variant
within running text ... which of course it can't according to the Standardized
Variants document), instead of just wiping the slate clean and redefining a new
and consistant set of standardized variants that correspond to actual usage
within China and Mongolia, Unicode is determined to preserve the original
erroneous standardised variants come hell or high water - even though no-one has
ever seriously used them yet (well, the Chinese and Mongolians will go ahead and
do it their way whatever Unicode decides).

And before Peter suggests it, I have already suggested elsewhere that if Unicode
can't fix past errors, the only course might be for Unicode to deprecate the
MFVSs, and start again from scratch - didn't go down too well!

Andrew



Re: [cowan: Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)]

2003-06-27 Thread Philippe Verdy
On Friday, June 27, 2003 1:29 PM, John Cowan [EMAIL PROTECTED] wrote:
 Michael Everson scripsit:
 Change the character classes in Unicode 4.1, and they *might* decide
 to freeze support at, say, Unicode 3.0.

Or they may simply opt to define their *OWN* normalization standard, distinct from 
Unicode NF* form, and designated in a separate reference document, removing *all* 
references to UAX#15 from XML and IDNA references, only to guarantee this stability 
that Unicode would be unable to offer.

Let's not this happen!

The IDNA protocol authors already made a lot of concessions to Unicode, but they may 
simply abandon the intent to support the idea of Unicode to normalize old scripts that 
they clearly don't need. This would mean that modern scripts that are still not 
encoded would not fit before long within XML or IDNA frameworks...

And this would be dramatic for those languages (and very frustating for their writers, 
that have little resources and could not influence the maintainers of other protocol 
specifications at the same time as Unicode) that are active but would be excluded for 
use in modern technologies such as XML and IDNA.

If the supporters of these languages finally consider it is more important to get it 
usable in modern technologies (notably for XML), they will prefer collaborating with 
the W3C and ISO10646 and will ignore completely Unicode's attempt to define abusive 
character properties. Unicode will then have no voice for the standardization of those 
languages, and will have to endorse the character repertoire registered at ISO10646 
without any discussion, even if the XML usage contradicts Unicode normative rules.

There's no other choice than maintaining the stability. If this means using special 
characters for combining sequences, that's something that Unicode will have to do and 
document clearly...

-- Philippe.




Re: [cowan: Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)]

2003-06-27 Thread John Cowan
Michael Everson scripsit:

 Oh, come on. Let's not put words in people's mouths. Ifs and mights 
 are not facts.

Expressed attitudes are facts, and it's reasonable to extrapolate people's
future behaviors, at least the general trend thereof, from their expressed
attitudes.  When someone draws a line in the sand, it's not unreasonable
to expect that crossing it will be taken as a declaration of war.

-- 
Yes, chili in the eye is bad, but so is yourJohn Cowan
ear.  However, I would suggest you wash your[EMAIL PROTECTED]
hands thoroughly before going to the toilet.http://www.reutershealth.com
--gadicath  http://www.ccil.org/~cowan



Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-27 Thread Karljürgen Feuerherm
 At 04:22 -0500 2003-06-27, [EMAIL PROTECTED] wrote:

 I just have a hard time believing that 50 years from now our
 grandchildren  won't look back [...]

I am in complete agreement with the spirit of what Peter says, though
realistically, 50 years from now, this is likely to be all neither here nor
there... (?)

I can't address all the technical details of the issue(s) at hand, however,
from a point of view of computing systems generally speaking, I think the
following is true:

1. Everyone is more or less agreed that the present combining class rules as
they apply to BH contain mistakes. The clearly preferential way to deal with
mistakes in any technological/computing software environment is to FIX them.
Several people have expressed reasons why this can't be (practically) be
done--which mainly seem to stem from political concerns.

2. Consequently ANY OTHER solution than 'FIX the obvious mistake(s)' is a
kludge (contra Philippe's (?) recent comment). One *pays* for all kludges,
one way or the other. If one is going to do this clearly undesirable thing,
one had better face that, acknowledge it, and be prepared to live with it,
and not try to talk one's way out of it being a kludge.

3. In that case, the question is, which kludge will cause less damage in the
end? (Because kludges will ALWAYS cause some problem one hasn't forseen. It
is their nature, since they involve adding twists into an otherwise plain
approach and complicating the algorithms in ways that are mystifying even to
the experts, after a while.)

4. Creating a whole new set of characters whose combining classes can be
redefined from scratch 'correctly' would seem to be undesirable, for a host
of reasons: one can't justify duplicating existing characters (specially, if
I understand it correctly, in the ISO environment which doesn't have all
these other superset systems?), and to some extent, one (perhaps?) runs the
risk of duplicating the present mess yet again, if one makes another
mistake

5. Inserting some kind of other character in the chain (perhaps even a
different one depending on the case, whether double vowels or metheg or
whatever--that is not the issue just now) is clearly a kludge too... but
then the sub-issue becomes whether to overload new semantics on existing
characters (e.g. ZWJ etc.) with the potential of adding exponentially more
twists in the system. Would it not be preferable, in that case, to create a
new character (with the appropriate attributes that I really can't comment
on) whose semantic is specific to addressing the current problem? New
(clean) rules would then have to be defined to cope with this. This keeps
the mess to a minimum.

Now, Q: I take it the combining classes are linked to the script, rather
than say to a dialect--e.g. one can't define BH as a separate dialect from
MH with its own set of rules? (I assume this is the case because otherwise
someone would have proposed it already.)

I REALLY think that option 1 should be beaten to death with a stick, then
beaten to death again, before settling for one of the others.

Hoping this didn't sound like a pointless diatribe but rather that taking a
step back from the details might help?

K





Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-27 Thread Karljürgen Feuerherm
(Regret I hadn't yet read this post prior to my last post)

Peter said, in reponse to Ken:

 Why is it a kludge to insert some cc=0 control character into the text for
 the sole purpose of preventing reordering during canonical ordering of two
 combining marks that do interact typographically and so should but
 nevertheless do not have the same combining class; and, moreover, to do so
 using a control character that was not created for that purpose?

 The answer seems so obvious, I wouldn't know how to begin responding.

 And the fact that it achieves some desired effect has no bearing on being
 described as a kludge -- every kludge achieves some desired effect. If it
 were otherwise, the given practice would never have been conceived.

Exactly correct. I am surprised Ken posed the question.

 If we want to insert a control character to prevent reordering under
 canonical ordering, I think it would be preferable to create a new control
 character for just that purpose: that would give a character that could be
 used elsewhere for the very same purpose without needing to worry about
 what unanticipated and undesirable effects might result by hijacking a
 control created for some completely unrelated purpose. For instance, you
 suggested RLM. Suppose next week we discover a very similar issue in a LTR
 script; do we want to insert RLM to prevent mark reordering in that case?
[...]

Very fine cases in point of what I was trying to say in more general terms.

K





Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-27 Thread Philippe Verdy
On Friday, June 27, 2003 3:23 PM, Karljürgen Feuerherm [EMAIL PROTECTED] wrote:
  At 04:22 -0500 2003-06-27, [EMAIL PROTECTED] wrote:
 Now, Q: I take it the combining classes are linked to the script,
 rather than say to a dialect--e.g. one can't define BH as a separate
 dialect from MH with its own set of rules? (I assume this is the case
 because otherwise someone would have proposed it already.)
 
 I REALLY think that option 1 should be beaten to death with a stick,
 then beaten to death again, before settling for one of the others.
 
 Hoping this didn't sound like a pointless diatribe but rather that
 taking a step back from the details might help?

Do you then propose to create a specific character, for use within the Hebrew script 
only, as a way to specify an alternate order for hebrew cantillation? In that case, it 
would be more appropriate to define new standard variants of these cantillation marks, 
and list them in the supported variants, to be used specially for Biblic Hebrew.

The rule for their use must be however simple: the variant selector must be made legal 
before any cantillation mark, even if it is not strictly necessary (for example 
between a base Hebrew character and a Hebrew point, or between two hebrew points whose 
normalization combining order is not defective).

This would allow writing a simple transcoding algorithm for the existing encoded texts 
(using only the ISO10646 encoding rules), and allow further optimizations of the 
transformed text, to remove Variant selectors when they are not strictly necessary.

This way, we won't override the semantic of the existing ZWJ or CGJ characters that 
were initially created to be used only before a base character to join combining 
sequences in the renderer or to disallow a candidate break. The breaking algorithms 
are already complex enough to avoid adding special semantics to these characters.

On the opposite, variant selectors are much cleaner, and the extra optimization for 
their superfluous use, can be added to UAX#15, simply because Variant selectors are 
only legal (and thus stable) for the predefined sequences.

Variant selectors do not break the stability pact, because under this pact, a VS, 
character sequence is considered (for XML and other related standards) as distinct 
from the isolated character without the variant selector, and thus can have distinct 
character properties.

This also has the adantage that there is absolutely no need to recode all the existing 
documents written with modern Hebrew, and the problem can be isolated to just the few 
already encoded historic documents.

-- Philippe.




Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-27 Thread John Cowan
Karljürgen Feuerherm scripsit:

 1. Everyone is more or less agreed that the present combining class rules as
 they apply to BH contain mistakes. The clearly preferential way to deal with
 mistakes in any technological/computing software environment is to FIX them.

Not so.  Sometimes stability is more important than correctness.  The use of
the backslash character in DOS/Windows systems as a path separator is
arguably a mistake (paths were borrowed from Unix into DOS 2.0, but the
slash was already in use for command-line options, something inherited from
CP/M and the ancestral CLI running back through DEC operating systems),
but fixing it is out of the question.

 Several people have expressed reasons why this can't be (practically) be
 done--which mainly seem to stem from political concerns.

All concerns involving human beings -- ho bios politikos -- are political
in some sense.

 2. Consequently ANY OTHER solution than 'FIX the obvious mistake(s)' is a
 kludge (contra Philippe's (?) recent comment). One *pays* for all kludges,

One pays for all *choices*.

-- 
Do NOT stray from the path! John Cowan [EMAIL PROTECTED]
--Gandalf   http://www.ccil.org/~cowan



Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-27 Thread Michael Everson
At 10:40 -0400 2003-06-27, John Cowan wrote:
Karljürgen Feuerherm scripsit:

 1. Everyone is more or less agreed that the present combining class rules as
 they apply to BH contain mistakes. The clearly preferential way to deal with
 mistakes in any technological/computing software environment is to FIX them.
Not so.  Sometimes stability is more important than correctness.
And sometimes not, then. What four characters have been corrected so 
far? Were they important characters to some company? Are there no 
Christians or Jews in the IETF who might care about a problem like 
this, where a simple solution might be effected? Particularly if it 
involves only a handful of characters, and the precedent for making 
such corrections has been set? Or is our standard, which as I have 
said many times, will be used for CENTURIES, going to be hobbled by 
silliness like this forever? Hm?

The use of the backslash character in DOS/Windows systems as a path 
separator is arguably a mistake (paths were borrowed from Unix into 
DOS 2.0, but the
slash was already in use for command-line options, something inherited from
CP/M and the ancestral CLI running back through DEC operating systems),
but fixing it is out of the question.
This is not analogous to the present situation, it seems to me. In 
the first place, what else is the \ for? :-) No one who wants to use 
the \ is prevented from doing so except maybe in filenames, in 
systems which don't allow it. (The colon is disallowed in Apple 
filenames.)

All concerns involving human beings -- ho bios politikos -- are political
in some sense.
And some have more sense than others, it seems. (Sorry, couldn't resist.)
--
Michael Everson * * Everson Typography *  * http://www.evertype.com


Re: [cowan: Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)]

2003-06-27 Thread John Cowan
Michael Everson scripsit:

 But you might trot on over with a white flag to parley about a problem.
 
 They're only human beings over there, just as we are over here.

Michael, I *am* the guy carrying the white flag to the W3C, and I have
made promises about what the Unicode Consortium will and won't do based
on its published stability policies.  If those are changed now, I'm left
twisting in the wind.

As for the IETF, membership on the IETF is defined as being subscribed
to an IETF mailing list and discussing a problem.  Anyone can do it, anyone
at all.

-- 
John Cowan[EMAIL PROTECTED] 
http://www.reutershealth.com  http://www.ccil.org/~cowan
Yakka foob mog.  Grug pubbawup zink wattoom gazork.  Chumble spuzz.
-- Calvin, giving Newton's First Law in his own words



Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-27 Thread Doug Ewell
Andrew C. West andrewcwest at alumni dot princeton dot edu wrote:

 I have to agree 100% with Peter on this. The potential fiasco with
 regards to Mongolian Free Variation Selectors is another area where
 our grandchildren are going to be weeping with despair if we are
 not careful. The standardized variants for Mongolian were set in
 stone by Unicode based on an unfortunate but understandable
 misunderstanding of the infamous TR170, and now that it is apparent
 from Chinese and Mongolian sources that Unicode had got hold of
 completely the wrong end of the stick (the defined standardized
 variants are actually intended for use in isolation only, and the same
 MFVS that selects one variant form in isolation may be used to select
 a completely different variant within running text ... which of course
 it can't according to the Standardized Variants document), instead of
 just wiping the slate clean and redefining a new and consistant set of
 standardized variants that correspond to actual usage within China
 and Mongolia, Unicode is determined to preserve the original erroneous
 standardised variants come hell or high water - even though no-one has
 ever seriously used them yet (well, the Chinese and Mongolians will go
 ahead and do it their way whatever Unicode decides).

Just a day or two ago we had a discussion about fast-tracking or
short-circuiting the standardization process, or otherwise using things
that were partway through the process before they had received final
approval.

Without expressing an opinion on Unicode's handling of Mongolian, or
Hebrew, or Tibetan, I think this thread shows clearly why decisions must
be thought out carefully and not rushed.  The perception that Unicode
got it wrong, whether real or imagined, can cause great damage to the
credibility and acceptance of the standard.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/




Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-27 Thread Philippe Verdy
On Friday, June 27, 2003 4:40 PM, John Cowan [EMAIL PROTECTED] wrote:
 Not so.  Sometimes stability is more important than correctness.

Very well answered. I don't see why we need to sacrifice stability when
correcting something. As the error is not in ISO10646, it is definitely not
reasonnable to have ISO10646 endorse the error done by Unicode due
to its stability pact.

For now, the only good solution is to use existing Unicode-only resources
that will not impact the normalization pact, and the ISO10646 unification
work. If this requires defining some additional Unicode semantics or
properties for some language-significant markup characters, this can be
done with variants (if ISO10646 accept it), or with a request for a
dedicated new *invisible* diacritic in the Hebrew block to ISO10646.

May be Unicode should be more prudent with Normalization Forms: if
new characters are added, their combining classes should be
documented as informative before there is a consensus and
experimentation. This will not break the stability pact with XML, which
will simply not accept the new characters before they are stabilized
by Unicode.

So the characters can be standardized by Unicode, and ISO10646, but
be used with caution with XML which can restrict the set of characters
supported to only those for which the canonicalization is not finished.

Why not then documenting these critical normative properties to make
them clearly informative if needed?
For example informative canonical decompositions could be noted with
canon (and thus only recognized by compatibility decompositions
until further notice).

And proposed combining classes could be noted with an additional
symbol in the CC column of the UCD (for example a ?).

This would prevent using the character within XML compliant
applications, but it could allow a more rapid development of fonts
and renderers or layout engines, allow experimentations to encode
actual new documents with some safe-guards regarding the
actual character properties.

This would say to IETF and W3C a warning this character has
an informative combining class or decomposition. Normalization
at this step is dangerous, and documents should be considered
as already normalized for those characters.

These potentially instable unicode-encoded documents will then
be labelled with the unicode version, as a future revision may
require verigying if the informative properties have become
enforcable. If there's a change in the properties, existing
documents can then be tested to see if they still respect the
proposed normalization, and corrected. If there is no change
after say 1 year, a revision annex publishes these properties
as normative and a incremental version of Unicode is added,
that allows interchange and conservation of the encoded
documents without an explicit Unicode version label.




Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-27 Thread Karljürgen Feuerherm
Philippe said on June 27, 2003 at 10:25 AM

 On Friday, June 27, 2003 3:23 PM, Karljürgen Feuerherm
[EMAIL PROTECTED] wrote:
  I REALLY think that option 1 [FIX the combining classes] should be
beaten to death with a stick,
  then beaten to death again, before settling for one of the others.

 Do you then propose to create a specific character, for use within the
Hebrew script only, as a way to specify an alternate order for hebrew
cantillation? In that case, it would be more appropriate to define new
standard variants of these cantillation marks, and list them in the
supported variants, to be used specially for Biblic Hebrew.

To be honest, I'm out of my depth with the details of the technical
solution, so I will leave it to the properly knowledgeable like e.g. John
Hudson and so on to reply to your analysis of my general conception.

Basically, I simply wanted to make a 'general principle' comment based on my
experience in other areas of software development because at times one can
get very involved in the gory details and I felt that a step back and global
summary of what I'm hearing by and large might be helpful. (And one learns
by interacting, at a certain point. I'm bound to make mistakes in the
process.)

Essentially, I understand and appreciate John Cowan's concern/WG2's
intransigeance (?) about stability, and the promises (however it was done)
by Unicode in that regard and so on, and I don't deprecate that in the
least. But, I agree with Michael that one should at least ask the
appropriate persons if possible, and if there is no way to get concession
(one should aim for a general principle, in case this sort of concern comes
up in another area later, so as not to have to go to bat ANOTHER time), THEN
one should go to one of these other, in principle less desirable
'solutions'. (But one can still dialogue about them in the interim.)

And in any case this should NOT muck things up which aren't broken, like MH.

K





Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-27 Thread John Cowan
Philippe Verdy scripsit:

 May be Unicode should be more prudent with Normalization Forms: if
 new characters are added, their combining classes should be
 documented as informative before there is a consensus and
 experimentation. This will not break the stability pact with XML, which
 will simply not accept the new characters before they are stabilized
 by Unicode.

XML has gone with a preacceptance approach.  All possible Unicode
characters in all 17 planes are already accepted as text, and most of them
will be accepted (in XML 1.1) as name characters as well, pending Unicode
actually creating them.  The problem is that normalization can't deal
with a known character whose CC is unknown -- unknown is the same as zero.

 These potentially instable unicode-encoded documents will then
 be labelled with the unicode version, as a future revision may
 require verigying if the informative properties have become
 enforcable. 

This is precisely the nightmare that we wish to avoid.

-- 
John Cowan   [EMAIL PROTECTED]
You need a change: try Canada  You need a change: try China
--fortune cookies opened by a couple that I know



Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-27 Thread Philippe Verdy
On Friday, June 27, 2003 5:05 PM, Michael Everson [EMAIL PROTECTED] wrote:

 At 10:40 -0400 2003-06-27, John Cowan wrote:
  Karljürgen Feuerherm scripsit:
  
1. Everyone is more or less agreed that the present combining
class rules as they apply to BH contain mistakes. The clearly
preferential way to deal with mistakes in any
   technological/computing software environment is to FIX them. 
  
  Not so.  Sometimes stability is more important than correctness.
 
 And sometimes not, then. What four characters have been corrected so
 far? Were they important characters to some company? Are there no
 Christians or Jews in the IETF who might care about a problem like
 this, where a simple solution might be effected? Particularly if it
 involves only a handful of characters, and the precedent for making
 such corrections has been set? Or is our standard, which as I have
 said many times, will be used for CENTURIES, going to be hobbled by
 silliness like this forever? Hm?

So this change must be done by proposing several alternatives to correct
it, with a formal approval process with those with which Unicode made
a promise: the IETF, and the W3C XML committee, or the SGML
group and you should give them enough time to consult their members.

I do think that the IETF will be quite open: after all its impact is limited
in a few domains like IRI and IDNA which is still not used for domain
names assigned to registrants, at least not for the Biblic Hebrew
language. The experimentations at ICANN and IANA for IRI are still
not closed and they have still not approved all the ISO10646 repertoire
for all supported languages...

From the acceptable solutions, ISO10646 will certainly follow the decision
of the XML committee for practical reasons: the intent of ISO is to facilitate
the implementation of a coherent repertoire, not to brake implementers in
their developments.

This requires an official poll to solve this problem, and Unicode will not
be able to decide alone...




Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-27 Thread Philippe Verdy
On Friday, June 27, 2003 5:53 PM, Karljürgen Feuerherm [EMAIL PROTECTED] wrote:
 And in any case this should NOT muck things up which aren't broken,
 like MH. 

Not breaking Modern Hebrew means not changing the combining classes
of the characters it uses.

Adding a distinct set for Traditional Hebrew may then be the only practical
solution: after all there are many such concessions in ISO10646, which
did not try to unify Greek and Cyrillic despite these two scripts are
extremely related...

With Unicode, there is for now no solution, so scholars will need to develop
their own legacy encoding with distinct mappings to a future ISO10646
and Unicode standard, and for interoperability with these existing documents
using this legacy 8-bit encoding, then will come the need to map this
encoding to a distinct set in ISO10646 and Unicode.

This would be the end of the nightmare.

What Unicode will then publish, is a set of *compatibility* equivalences
between the new diacritics for Traditional Hebrew and the existing diacritics
for Modern Hebrew.

I'm curious to see how legacy 8-bit documents are used with Biblic texts...
Are the current conversion tables (informative in the Unicode database) for
the ISO and Windows charsets correct with that perspective?

If so the conversion from these 8-bit encodings to Unicode would be less
simple than simple mappings, as it would require looking at the place
of diacritics in the 8-bit encoding to see if they can safely be normalized
once in Unicode accoding to their relative combining classes.

-- Philippe.



Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-27 Thread John Cowan
Karljürgen Feuerherm scripsit:

  The use of
  the backslash character in DOS/Windows systems as a path separator is
  arguably a mistake
 
 I hardly think so. It was a matter of a necessary alternative. It could only
 be viewed as a mistake on the assumption that somehow the Unix way was
 defacto 'correct'.

Pick your own mistake, then.  Another good case I thought of this morning
are the national boundaries in Africa, which have little or nothing to
do with the realities on the ground.  But (with one exception) all African
nation-states treat them as sacred, because the results of full-scale
border rectification would be nothing less than a world war.

   Several people have expressed reasons why this can't be (practically) be
   done--which mainly seem to stem from political concerns.
 
  All concerns involving human beings -- ho bios politikos -- are political
  in some sense.
 
 Of course. But that just trivializes the comment.

I took your reference to political concerns to be trivializing the
concerns, and pointed out that the very notion of concern is a political
one.  If there were no stakes, we could change Unicode daily according to
the best current notion of technical excellence.

Truth cannot conflict with truth, but interest can and commonly does
conflict with interest.

 Indeed. And for some more than others. Kludges tend to be, in my experience,
 penny-wise and pound foolish. So if you like, I'll restate my point as 'pay
 me now, or pay me (probably more) later'.

Alienate major customers now, or alienate a relatively small customer now.

-- 
But that, he realized, was a foolishJohn Cowan
thought; as no one knew better than he  [EMAIL PROTECTED]
that the Wall had no other side.http://www.ccil.org/~cowan
--Arthur C. Clarke, The Wall of Darkness



Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-27 Thread John Cowan
Michael Everson scripsit:

 No, but you're not making a technical argument, either.

The life of [Unicode] has not been logic but experience.
--Oliver Wendell Holmes, somewhat mutated

 Not when their core values -- correctness vs. stability -- are made to
 be at odds.
 
 And shifting a METEG in a normalization versioning is going to cause 
 what technical problem?

If we can change METEG today, we might change COMBINING ACUTE tomorrow.

-- 
Knowledge studies others / Wisdom is self-known;  John Cowan
Muscle masters brothers / Self-mastery is bone;   [EMAIL PROTECTED]
Content need never borrow / Ambition wanders blind;   www.ccil.org/~cowan
Vitality cleaves to the marrow / Leaving death behind.--Tao 33 (Bynner)



Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-27 Thread John Hudson
At 02:53 AM 6/27/2003, [EMAIL PROTECTED] wrote:

ISO: Then, obviously they need to correct their errors. I mean, it's not
like the wrong characters got encoded or something. Tell them to just fix
the errors; that can't be difficult to do, and is obviously the right
thing to do.
That seems to be exactly what Michael 'as a member of WG2' is saying.

What if the request to change the Hebrew combining classes came *from* W3C 
and/or IETF? I'm not saying that this is likely, but I'm wondering whether 
they might, in fact, not insist on stability for characters for which 
normalisation is currently broken anyway?

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
If you browse in the shelves that, in American bookstores,
are labeled New Age, you can find there even Saint Augustine,
who, as far as I know, was not a fascist. But combining Saint
Augustine and Stonehenge -- that is a symptom of Ur-Fascism.
- Umberto Eco



Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-27 Thread John Hudson
At 03:12 AM 6/27/2003, Michael Everson wrote:

Who is it who will kill the Unicode Consortium if UAX #15 were to be 
revised? Did it occur to anyone to *ask* about the possible revision of 
classes for the dozen or so instances that would be affected?
My understanding is that stability promises have been made to W3C and IETF 
(any others?). I'm also leaning toward asking these organisations if an 
exception can be made to fix a broken normalisation for Hebrew, given that 
the present normalisation is not useful. If the UTC doesn't want to make 
the enquiry, perhaps a consortium of Biblical Hebraicist academic 
organisations, publishers and software developers could take the matter to 
W3C and IETF and try to obtain a statement that would allow Unicode to make 
the change? Perhaps WG2 would also support such a petition?

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
If you browse in the shelves that, in American bookstores,
are labeled New Age, you can find there even Saint Augustine,
who, as far as I know, was not a fascist. But combining Saint
Augustine and Stonehenge -- that is a symptom of Ur-Fascism.
- Umberto Eco



Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-27 Thread Karljürgen Feuerherm
John Cowan said on June 27, 2003 at 12:48 PM

 Karljürgen Feuerherm scripsit:

Several people have expressed reasons why this can't be
(practically) be
done--which mainly seem to stem from political concerns.
  
   All concerns involving human beings -- ho bios politikos -- are
political
   in some sense.
 
  Of course. But that just trivializes the comment.

 I took your reference to political concerns to be trivializing the
 concerns,

That was not the intention, though perhaps it sounded that way I take
your concerns every bit as seriously as I do those of the Biblical Hebrew
community.

If there were no stakes, we could change Unicode daily according to
 the best current notion of technical excellence.

'We' do in fact change Unicode (nearly) daily every time there is a
revision. The question at hand is where to draw the line. Your position is
crystal clear, and I am not questioning the reason why you make it, or the
value in it. But

 Alienate major customers now, or alienate a relatively small customer
now.

the relatively small customer has every right to argue his/her case, and to
hope for an implemention which will address his/her needs. The cost of
kludges vs. corrections is not in the least analogous to your statement:
both customers can--at least in theory--be satisfied, with some give on both
sides.

If Unicode 'botched it up the first time' (not that I'm necessarily saying
it did, but let's say so for the sake of argument), is it reasonable for the
major customers to insist that the solution lies in botching it further?
(and so on...)

I agree that stability is sometimes preferable to (not necessarily better
than) correctness. But a stable product which does not address the purpose
for which it was created is definitely not preferable to one which is
corrected to suit the purpose (the risks therein being acknowledged). Of
course, one may object that the present implementation was not created for
or to include BH in the first place, and that may be (I'm happy to be
informed if that is so. But that isn't the impression the discussion thus
far has made).

(If not, then one must ask whether the present implementation can be
reasonably extended (thus preserving the stability of the existing platform)
or whether one must create a new, parallel implementation for the new
purpose, [or some combination of the two] which is where most of the
discussion seems centred.)

K





Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-27 Thread John Cowan
John Hudson scripsit:

 What if the request to change the Hebrew combining classes came *from* W3C 
 and/or IETF? I'm not saying that this is likely, but I'm wondering whether 
 they might, in fact, not insist on stability for characters for which 
 normalisation is currently broken anyway?

The normalization is not broken from the point of view of the stability
community.  They consider it more important that there be a fixed rule,
than what the content of the rule is.  Google for stare decisis for
much more on this point of view in general.

-- 
John Cowan  [EMAIL PROTECTED]  www.ccil.org/~cowan  www.reutershealth.com
If I have seen farther than others, it is because I am surrounded by dwarves.
--Murray Gell-Mann



Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-27 Thread John Hudson
At 05:48 AM 6/27/2003, Michael Everson wrote:

The W3C would also hit the roof if Unicode normalization changed radically.
I don't think anyone is proposing a *radical* change.
I have uploaded the relevant draft pages of the SBL Hebrew user manual to

http://www.tiro.com/transfer/SBLappendixB.pdf

This appendix provides suggested combining classes for customised 
normalisation routines, compared with Unicode normalisation routines. This 
has been tested by Libronix/Logos with the Michigan-Claremont electronic 
text of the _Biblia Hebraica Stuttgartensia_.

There are 17 marks whose combining class value should be corrected, of 
which the vowels and meteg are most important.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
If you browse in the shelves that, in American bookstores,
are labeled New Age, you can find there even Saint Augustine,
who, as far as I know, was not a fascist. But combining Saint
Augustine and Stonehenge -- that is a symptom of Ur-Fascism.
- Umberto Eco



Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-27 Thread Karljürgen Feuerherm
John Cowan said on June 27, 2003 at 12:56 PM

Michael Everson had said:
  This is not analogous to the present situation, it seems to me. In 
  the first place, what else is the \ for? :-)
 
 Escaping special characters, since you ask.

But in a completely different.

K




Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-27 Thread Peter_Constable
John Cowan wrote on 06/27/2003 08:24:35 AM:

 The IETF has an explicit contract with Unicode: We'
 ll use your normalization algorithm if you promise NEVER, NEVER to 
change
 the normalization status of a single character.  Unicode has already
 broken that promise four times, so its credibility is shaky.

Yeah, but what I don't get is that IETF doesn't set anything in stone 
until there are working implementations, but Unicode's canonical combining 
classes have to be set in stone for IETF's benefit before there are 
working implementations. I just have a hard time understanding that.


 So far I have not heard any compelling objections to CGJ except that
 invisible characters are fuggly.

I just sent a message discussing this.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485




Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-27 Thread Karljürgen Feuerherm
Philippe Verdy said on June 27, 2003 at 12:38 PM
Subject: Re: Biblical Hebrew (Was: Major Defect in Combining Classes of
Tibetan Vowels)


 On Friday, June 27, 2003 5:53 PM, Karljürgen Feuerherm
[EMAIL PROTECTED] wrote:
  And in any case this should NOT muck things up which aren't broken,
  like MH.

 Not breaking Modern Hebrew means not changing the combining classes
 of the characters it uses.

 Adding a distinct set for Traditional Hebrew may then be the only
practical
 solution

That was in effect basically what I was wondering about with my question.
Thanks

K





Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-27 Thread John Hudson
Philippe said on June 27, 2003 at 10:25 AM

Do you then propose to create a specific character, for use within the
Hebrew script only, as a way to specify an alternate order for hebrew
cantillation? In that case, it would be more appropriate to define new
standard variants of these cantillation marks, and list them in the
supported variants, to be used specially for Biblic Hebrew.
The cantillation marks are pretty much okay: they will not be re-ordered 
during normalisation. There are three that should ideally have a 
postpositional combining class (see 
http://www.tiro.com/transfer/SBLappendixB.pdf), but the rest are fine.

The problem is with the vowels.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
If you browse in the shelves that, in American bookstores,
are labeled New Age, you can find there even Saint Augustine,
who, as far as I know, was not a fascist. But combining Saint
Augustine and Stonehenge -- that is a symptom of Ur-Fascism.
- Umberto Eco



Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-27 Thread Karljürgen Feuerherm
Peter replied:


 Karljürgen Feuerherm wrote on 06/27/2003 08:23:08 AM:

  Now, Q: I take it the combining classes are linked to the script, rather
  than say to a dialect

 They're linked to the character.

  --e.g. one can't define BH as a separate dialect from
  MH with its own set of rules?

 No, not unless BH is encoded with a separate set of characters.

I see. Not desirable, to say the least.

Thanks

K





Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-27 Thread John Hudson
At 10:20 AM 6/27/2003, John Cowan wrote:

 What if the request to change the Hebrew combining classes came *from* W3C
 and/or IETF? I'm not saying that this is likely, but I'm wondering whether
 they might, in fact, not insist on stability for characters for which
 normalisation is currently broken anyway?
The normalization is not broken from the point of view of the stability
community.  They consider it more important that there be a fixed rule,
than what the content of the rule is.  Google for stare decisis for
much more on this point of view in general.
Fair enough. I made my suggestion before reading all of your exchange with 
Michael.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
If you browse in the shelves that, in American bookstores,
are labeled New Age, you can find there even Saint Augustine,
who, as far as I know, was not a fascist. But combining Saint
Augustine and Stonehenge -- that is a symptom of Ur-Fascism.
- Umberto Eco



Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-27 Thread Kenneth Whistler
Peter responded:
 
 Kenneth Whistler wrote on 06/26/2003 05:36:34 PM:
 
  Why is making use of the existing behavior of existing characters
  a groanable kludge, if it has the desired effect and makes
  the required distinctions in text?
 
 Why is it a kludge to insert some cc=0 control character into the text for 
 the sole purpose of preventing reordering during canonical ordering of two 
 combining marks that do interact typographically and so should but 
 nevertheless do not have the same combining class; and, moreover, to do so 
 using a control character that was not created for that purpose?
 
 The answer seems so obvious, I wouldn't know how to begin responding.

And others apparently had the same feeling. But I contend that
the reason this seems odd is because of the way you present
it to yourself and others.

It isn't a matter of my text is o.k. the way I entered it, but
now I have to insert some invisible control character into the
text for the sole purpose of preventing reordering -- which wasn't
something I wanted to have happen in the first place.

Instead, it is that for Biblical Hebrew, the following textual
conventions are adopted:

   A sequence of patah followed by hiriq is represented by
   patah, CGJ, hiriq
   A sequence of hiriq followed by patah is represented by
   hiriq, CGJ, patah
   
Then you build keyboards (or other abstractions) that obey
those textual conventions.

You stop telling the Biblical Scholars that their text is
screwed up because of Unicode and they have to fix it by
inserting crazy control codes they don't know about, and
chances are they will stop believing that their text is
screwed up. :-)

This isn't really any stranger than telling someone that for Twi, the
following textual convention is adopted:

   An open o with an acute tone mark is represented by
   open-o, combining acute
   
As long as the pieces stay firmly attached for entry, display,
and searching, everybody is happy and nobody needs to be
the wiser about what gimmicks the programmers are
using under the covers.

And why should it be any stranger that maintenance of vowel
point order in Biblical Hebrew cases with multiple points
requires judicious use of an invisible combining mark like CGJ,
when maintenance of visible directional layout distinctions
for any Hebrew requires a boatload of invisible format controls?

 If we want to insert a control character to prevent reordering under 
 canonical ordering, I think it would be preferable to create a new control 
 character for just that purpose: 

How would that be less of a kludge? I contend that inventing
another invisible character *just* to do this is even more of
a kludge than what I have suggested, when use of an existing
character already has the desired effect.

The end effect of the impulse you are describing here would
be an attempt to create atomistic controls for each conceivable
text effect, and I think the UTC has already given up on
heading that direction. It is already bad enough trying
to keep straight all the possible interactions for the ones
already created, as demonstrated by the discoveries we just
made when trying to consider what happens if a ZWJ gets
plunked down *between* two combining marks.

 that would give a character that could be 
 used elsewhere for the very same purpose without needing to worry about 
 what unanticipated and undesirable effects might result by hijacking a 
 control created for some completely unrelated purpose. 

This was a more applicable criticism for the suggestions of RLM,
ZWJ, or WJ, since their very status as format controls instead
of as combining marks had undesirable effects on the combining
character sequences in question. I don't think the criticism applies
to CGJ, however, since that character doesn't have any
defined behavior other than what is needed here. And, as I
indicated in a separate response, I do not think using CGJ
for the purpose described in Biblical Hebrew is unrelated to
its intent. It is just that nobody had yet thought through a
scenario where it would prove useful between combining marks.

--Ken




Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-26 Thread John Hudson
At 12:43 AM 6/26/2003, [EMAIL PROTECTED] wrote:

 The problem of combinations of vowels with meteg could be
 amenable to a similar approach. OR, one could propose just
 one additional meteq/silluq character, to make it possible
 to distinguish (in plain text) instances of left-side and
 right-side meteq placement, for example.
And the third position of meteg with hataf vowels? Introduce *two*
additional meteg/silluq characters?
No, that's a glyph ligation matter however you look at it. It could be made 
to work with either just a left meteg or also with a new right meteg, and 
can be inhibited with ZWNJ. This is not to say that I think encoding a 
distinct right meteg character is the best solution, only that it doesn't 
affect the medial meteg shaping.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
If you browse in the shelves that, in American bookstores,
are labeled New Age, you can find there even Saint Augustine,
who, as far as I know, was not a fascist. But combining Saint
Augustine and Stonehenge -- that is a symptom of Ur-Fascism.
- Umberto Eco



Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-26 Thread John Hudson
At 10:09 AM 6/26/2003, [EMAIL PROTECTED] wrote:

 The Meteg is a completely different issue. There is a small number 
of  places
 were the Meteg is placed differently. Since it does not behave the same as
 the regular Meteg, and is thus visually distinguishable, it should be
 possible to add a character, as long as it is clearly named.

That is a potential solution, thought it would have to be *two* additional
metegs.
Can you explain your thinking here, Peter? I agree that if the intention is 
to encode new Biblical Hebrew marks with revised combining classes, then 
two new metegs would be necessary if we want one left and one right. But if 
one were to accept the text encoding hack of a ZERO-WIDTH CANONICAL 
ORDERING INHIBITOR -- which seems less and less like a good idea, and more 
and more like a long term embarassment and, like ZWJ and ZWNJ, a pain in 
the neck for users who have every right to expect a sensible encoding that 
doesn't require such gymnastics --, then I think one would only need a new 
HEBREW POINT RIGHT METEG character, and let it be assumed that the existing 
meteg character is the left position form (it's current combining class 
puts it after all vowels, I believe).

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
If you browse in the shelves that, in American bookstores,
are labeled New Age, you can find there even Saint Augustine,
who, as far as I know, was not a fascist. But combining Saint
Augustine and Stonehenge -- that is a symptom of Ur-Fascism.
- Umberto Eco



Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-26 Thread Mark Davis
Another consequence is that it separates the sequence into two
combining sequences, not one. Don't know if this is a serious problem,
especially since we are concerned with a limited domain with
non-modern usage, but I wanted to mention it.

Mark
__
http://www.macchiato.com
  Eppur si muove 

- Original Message - 
From: Kenneth Whistler [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Thursday, June 26, 2003 13:41
Subject: Re: Biblical Hebrew (Was: Major Defect in Combining Classes
of Tibetan Vowels)


 Peter replied to Karljrgen:

  Karljrgen Feuerherm wrote on 06/25/2003 08:31:41 PM:
 
   I was going to suggest something very similar, a
ZW-pseudo-consonant of
  some
   kind, which would force each vowel to be associated with one
consonant.
 
  An invisible *consonant* doesn't make sense because the problem
involves
  more than just multiple written vowels on one consonant;

 I agree that we don't want to go inventing invisible consonants for
 this.

 BTW, there's already an invisible vowel (in fact a pair of them)
 that is unwanted by the stakeholders of the script it was
 originally invented for:

 U+17B4 KHMER VOWEL INHERENT AQ

 This is also (cc=0), so would serve to block canonical reordering
 if placed between two Hebrew vowel points. But I'm sure that if
 Peter thought the suggestion of the ZWJ for this was a groanable
 kludge, Biblical Hebraicists would probably not take lightly
 to the importation of an invisible Khmer character into their
 text representations. ;-)

  in fact, that is
  a small portion of the general problem. If we want such a
character, it
  would notionally be a zero-width-canonical-ordering-inhibiter, and
nothing
  more.

 The fact is that any of the zero-width format controls has the
 side-effect of inhibiting (or rather interrupting) canonical
reordering
 if inserted in the middle of a target sequence, because of their
 own class (cc=0).

 I'm not particularly campaigning for ZWJ, by the way. ZWNJ or even
 U+FEFF ZWNBSP would accomplish the same. I just suggested ZWJ
because
 it seemed in the ballpark. ZWNBSP would likely have fewer possible
 other consequences, since notionally it means just don't break
here,
 which you wouldn't do in the middle of a Hebrew combining character
 sequence, anyway.

  And I don't particular want to think about what happens when
people start
  sticking this thing into sequences other than Biblical Hebrew (in
  unicode, any sequence is legal).

 But don't forget that these cc=0 zero width format controls already
 can be stuck into sequences other than Biblical Hebrew. In some
 instances they have defined semantics there (as for Arabic and
 Indic scripts), but in all cases they would *already* have the
 effect of interrupting canonical reordering of combining character
 sequences if inserted there.

 --Ken








Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-26 Thread Kenneth Whistler
Peter responded:

 Ken Whistler wrote on 06/25/2003 06:57:56 PM:
 
  People could consider, for example, representation
  of the required sequence:
  
lamed, qamets, hiriq, final mem
  
  as:
  
lamed, qamets, ZWJ, hiriq, final mem
 
 So, we want to introduce yet *another* distinct semantic for ZWJ?

Actually, no, I don't. That was just the first candidate that
came to mind.
 
 We've 
 got one for Indic, another for Arabic, another for ligatures (similar to 
 that for Arabic, but slightly different). Now another that is don't 
 affect any visual change, just be there to inhibit reordering under 
 canonical ordering / normalization?

As I pointed out in a separate response, just putting the ZWJ
there would *already* interrupt the reodering of the sequence.
There is nothing new about that. The problem is that you might
not be able to count on it not effecting a visual change,
because the generic meaning of ZWJ is now intended to be
ligation requesting, which does have visual consequences.

I now like better the suggestions of RLM or WJ for this. Both
of those format controls, by *definition*, should have no
impact on visual display in this context, the RLM because it
would be inserted between two NSM's that pick up strong
R-to-L directionality from the consonant, and the WJ
because it would be inserted at a position where there already
is no word/line break opportunity. But either of them,
by their current definition and properties, would break the
sequences for canonical reordering. So they already have
the semantics of the putative new control in question: no
effect on visual display, while inhibiting of the canonical
reordering of the point sequence.

  The presence of a ZWJ (cc=0) in the sequence would block
  the canonical reordering of the sequence to hiriq before
  qamets. If that is the essence of the problem needing to
  be addressed, then this is a much simpler solution which would
  impact neither the stability of normalization nor require
  mass cloning of vowels in order to give them new combining
  classes.
 
 Yes, it would accomplish all that; and is groanable kludge. 

Why is making use of the existing behavior of existing characters
a groanable kludge, if it has the desired effect and makes
the required distinctions in text? If there is not some
rendering system or font lookup showstopper here, I'm inclined
to think it's a rather elegant way out of the problem.

 At least with 
 having distinct vowel characters for Biblical Hebrew, we'd come to a point 
 we could forget about it, and wouldn't be wincing every time we considered 
 it.

Au contraire. We'll be wincing forever for this one. There's
no way of getting around the fact that this is merely a cloning
of a the whole set of points in order to have candidates for
a reassigned set of combining classes.

You're stuck between a rock and a hard place on this one.

The UTC cannot entertain merely fixing the existing combining
class assignments, because it breaks the normalization stability
guarantee. We've all come to acknowledge and most to accept that,
even though it still elicits groans.

But in the 10646 WG2 context, coming in with a duplicate set
of Hebrew points is not going to make any sense, because, as
someone (John Cowan?) has already pointed out, 10646 doesn't
assign combining classes, and so trying to justify character
cloning on the basis of distinct combining class assignments
isn't going to make any sense there. You can always come in
with the proposal to encode BIBLICAL HEBREW POINT PATAH and
say, even though the glyph is identical, see, the name is
different, so the character is different. But this is a pretty
thin disguise, and is vulnerable to simple questioning:
What is it for? Well, to point Biblical Hebrew texts. But
what was U+05B7 HEBREW POINT PATAH for? Well, to point Biblical
Hebrew texts (or any Hebrew text, for that matter...). Well,
then, what is the difference? Uh, the combining classes for
the two are different. What is a combining class?  ... and
so on.

I'm trying to find a way, using existing characters and a
simple set of text representational conventions, to make
the distinctions and preserve the order relations that you
need for decent font lookup, without the whole enterprise
washing up on either of those two rocks.

--Ken




Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-26 Thread John Hudson
At 02:45 PM 6/26/2003, Mark Davis wrote:

Another consequence is that it separates the sequence into two
combining sequences, not one. Don't know if this is a serious problem,
especially since we are concerned with a limited domain with
non-modern usage, but I wanted to mention it.
It is a serious problem if separate combining sequences means, as it seems 
to in all the current apps I have tested, that marks separated by one of 
these control characters cannot be correctly positioned relative to a 
preceding consonant. Insertion of any zero-width control character between 
two marks applied to the same Hebrew consonant results in a loss of 
interraction between the marks (i.e. the first mark is not repositioned to 
accomodate the second) and the second mark loses all positioning 
intelligence and falls between the consonant and the next one. My guess is 
that the layout engine (Uniscribe in this case) makes the reasonable 
assumption that the two combining sequences do not interract.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
If you browse in the shelves that, in American bookstores,
are labeled New Age, you can find there even Saint Augustine,
who, as far as I know, was not a fascist. But combining Saint
Augustine and Stonehenge -- that is a symptom of Ur-Fascism.
- Umberto Eco



Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-26 Thread Michael Everson
At 15:36 -0700 2003-06-26, Kenneth Whistler wrote:

I now like better the suggestions of RLM or WJ for this.
ZZZT. Thank you for playing.

RLM is for forcing the right behaviour for stops and parentheses and 
question marks and so on. Introducing it between two combining 
characters in Hebrew text would break all kinds of things, and would 
be horrible, horrible, horrible. Invent a new control character for 
this weird property-killer, if you must, but don't use an ordering 
mark for it.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-26 Thread John Hudson
At 03:36 PM 6/26/2003, Kenneth Whistler wrote:

Why is making use of the existing behavior of existing characters
a groanable kludge, if it has the desired effect and makes
the required distinctions in text? If there is not some
rendering system or font lookup showstopper here, I'm inclined
to think it's a rather elegant way out of the problem.
I think assumptions about not breaking combining mark sequences may, in 
fact, be a showstopper. If base+mark+mark becomes 
base+mark+CtrlChar+mark, it is reasonable to think that this will not 
only inhibit mark re-ordering but also mark combining and mark 
interraction. Unfortunately, this seems to be the case with every control 
character I have been able to test, using two different rendering engines 
(Uniscribe and InDesign ME -- although the latter already has some problems 
with double marks in Biblical Hebrew). Perhaps we should have a specific 
COMBINING MARK SEQUENCE CONTROL character?

All that said, I disagree with Ken that this is anything like an elegant 
way out of the problem. Forcing awkward, textually illogical and easily 
forgetable control character usage onto *users* in order to solve a problem 
in the Unicode Standard is not elegant, and it is unlikely to do much for 
the reputation of the standard.

Q: 'Why do I have to insert this control character between these points?'
A: 'To prevent them from being re-ordered.'
Q: 'But why would they be re-ordered anyway? Why wouldn't they just stay in 
the order I put them in?'
A: 'Because Unicode normalisation will automatically re-order the points.'
Q: 'But why? Points shouldn't be re-ordered: it breaks the text.'
A: 'Yes, but the people who decided how normalisation should work for 
Hebrew didn't know that.'
Q: 'Well can't they fix it?'
A: 'They have: they've told you that you have to insert this control 
character...'
Q: 'But *I* didn't make the mistake. Why should I have to be the one to 
mess around with this annoying control character?'

... and so on.

Much as the duplication of Hebrew mark encoding may be distasteful, and 
even considering the work that will need to be done to update layout 
engines, fonts and documents to work with the new mark characters, I agree 
with Peter Constable that this is by far the best long term solution, 
especially from a *user* perspective. Over the past two months I have been 
over this problem in great detail with the Society of Biblical Literature 
and their partners in the SBL Font Foundation. They understand the problems 
with the current normalisation, and they understand that any solution is 
going to require document and font revisions; they're resigned to this, and 
they've worked hard to come up with combining class assignments that would 
actually work for all consonant + mark(s) sequences encountered in Biblical 
Hebrew. This work forms the basis of the proposal submitted by Peter 
Constable. Encoding of new Biblical Hebrew mark characters provides a 
relatively simple update path for both documents and fonts, since it 
largely involves one-to-one mappings from old characters to new.

Conversely, insisting on using control characters to manage mark ordering 
in texts will require analysis to identify those sequences that will be 
subject to re-ordering during normalisation, and individual insertion of 
control characters. The fact that these control characters are invisible 
and not obvious to users transcribing text, puts an additional burden on 
application and font support, and adds another level of complexity to using 
what are already some of the most complicated fonts in existence (how many 
fonts do you know that come with 18 page user manuals?). I think it is 
unreasonable to expect Biblical scholars to understand Unicode canonical 
ordering to such a deep level that they are able to know where to insert 
control characters to prevent a re-ordering that shouldn't be happening in 
the first place.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
If you browse in the shelves that, in American bookstores,
are labeled New Age, you can find there even Saint Augustine,
who, as far as I know, was not a fascist. But combining Saint
Augustine and Stonehenge -- that is a symptom of Ur-Fascism.
- Umberto Eco



Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-26 Thread Kenneth Whistler
Michael wrote:

 At 15:36 -0700 2003-06-26, Kenneth Whistler wrote:
 
 I now like better the suggestions of RLM or WJ for this.
 
 ZZZT. Thank you for playing.
 
 RLM is for forcing the right behaviour for stops and parentheses and 
 question marks and so on. Introducing it between two combining 
 characters in Hebrew text would break all kinds of things,

True, apparently, but not for the reasons you surmise.

RLM does not force behavior on things. It is a strong
right-to-left context that can change the resolved directionality
of neutrals or weak types next to it. In between two
characters that are already R, the presence or absence of an
RLM is basically a no-op for bidi.

Just considering the bidi algorithm, a sequence:

  lamed, patah, RLM, hiriq
 R  NSM   R NSM
  
would have the resolved directions: R, R, R, R, effectively no
different than the resolved direction: R, R, R of the sequence
without the RLM.

The problem arises when you go to consider the graphic application
of the combining mark to its base form, and for that, the issue
is apparently the same for the WJ, ZWJ, or any other format
control in such a position. So this is nothing to do with the
bidi function of RLM. 

 and would 
 be horrible, horrible, horrible. Invent a new control character for 
 this weird property-killer, if you must, but don't use an ordering 
 mark for it

If you invent a new control character for this weird property-killer
(which it wouldn't be, since in any case, I'm just talking about
inserting a (cc=0) character in between two other characters, not
changing or killing any properties), you still end up with exactly
the same problem of graphic application, because the
presence of any format control creates a defective combining
character sequence which applications (apparently) won't display.

--Ken





Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-26 Thread Kenneth Whistler
John,

 At 03:36 PM 6/26/2003, Kenneth Whistler wrote:
 
 Why is making use of the existing behavior of existing characters
 a groanable kludge, if it has the desired effect and makes
 the required distinctions in text? If there is not some
 rendering system or font lookup showstopper here, I'm inclined
 to think it's a rather elegant way out of the problem.
 
 I think assumptions about not breaking combining mark sequences may, in 
 fact, be a showstopper. If base+mark+mark becomes 
 base+mark+CtrlChar+mark, it is reasonable to think that this will not 
 only inhibit mark re-ordering but also mark combining and mark 
 interraction. Unfortunately, this seems to be the case with every control 
 character I have been able to test, using two different rendering engines 
 (Uniscribe and InDesign ME -- although the latter already has some problems 
 with double marks in Biblical Hebrew). Perhaps we should have a specific 
 COMBINING MARK SEQUENCE CONTROL character?

Actually, in casting around for the solution to the problem of
introduction of format controls creating defective combining
character sequences, it finally occurred to me that:

U+034F COMBINING GRAPHEME JOINER

has the requisite properties.

It is non-visible, does not affect the display of neighboring
characters (except incidentally, if processes choose to recognize
sequences containing it and process them distinctly), *AND*
it is a *combining mark*, not a format control.

Hence, the sequence:

lamed, patah, CGJ, hiriq
   0  170 14

is *not* a defective combining character sequence, by the
definitions in the standard. The entire sequence of three
combining marks would have to apply to the lamed, but
the fact that CGJ has (cc=0) prevents the patah from reordering
around the hiriq under normalization.

Could this finally be the missing killer ap for the CGJ?

 
 All that said, I disagree with Ken that this is anything like an elegant 
 way out of the problem. Forcing awkward, textually illogical and easily 
 forgetable control character usage onto *users* in order to solve a problem 
 in the Unicode Standard is not elegant, and it is unlikely to do much for 
 the reputation of the standard.

I don't understand this contention. There is no reason, in principle,
why this has to be surfaced to end users of Biblical Hebrew, any
more than messy details of embedding override controls has to be surfaced
to end users in order to make an interface which will support end user
control over direction in bidirectional text.

If CGJ is the one, then the only *real* implementation requirement would
be that CGJ be consistently inserted (for Biblical Hebrew) between
any pair of points applied to the same consonant. Depending on the
particular application, this could either be hidden behind the
input method/keyboard and be actively managed by the software, or
it could be applied as a filter to an export format, when exporting
to contexts that might neutralize intended contrasts or result in
the wrong display by the application of normalization.

 
 Q: 'Why do I have to insert this control character between these points?'
 A: 'To prevent them from being re-ordered.'
 Q: 'But why would they be re-ordered anyway? Why wouldn't they just stay in 
 the order I put them in?'
 A: 'Because Unicode normalisation will automatically re-order the points.'
 Q: 'But why? Points shouldn't be re-ordered: it breaks the text.'
 A: 'Yes, but the people who decided how normalisation should work for 
 Hebrew didn't know that.'
 Q: 'Well can't they fix it?'
 A: 'They have: they've told you that you have to insert this control 
 character...'

And that whole dialogue should be limited to the *programmers* only,
whose job it is then to hide the details of how they get the
magic to work from people who would find those details just confusing.

 Q: 'But *I* didn't make the mistake. Why should I have to be the one to 
 mess around with this annoying control character?'
 
 ... and so on.
 
 Much as the duplication of Hebrew mark encoding may be distasteful, and 
 even considering the work that will need to be done to update layout 
 engines, fonts and documents to work with the new mark characters, I agree 
 with Peter Constable that this is by far the best long term solution, 
 especially from a *user* perspective. 

I have to disagree. It should be largely irrelevant to the user perspective.
In this case (as in others) the users are the experts about what their
expected requirements are for text behavior, and in particular, what
distinctions need to be maintained. But they should not be expected
to define the technical means for fulfilling those requirements, nor
lean over the shoulders of the engineers to tell them how to write
the software to accomplish it.

 Over the past two months I have been 
 over this problem in great detail with the Society of Biblical Literature 
 and their partners in the SBL Font Foundation. They understand the problems 
 with the current 

Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-25 Thread John Hudson
At 04:57 PM 6/25/2003, Kenneth Whistler wrote:

And I hate to have to continue being Mr. Negativity on this
list, but I remain unconvinced that the proposed solution
(of cloning 14 Hebrew points and vowels) just to fix an
unpreferred canonical reordering result represents the
sole remaining alternative. In this case, I believe the
side-effects of the proposed medicine are worse than the
disease itself.
I didn't say I like the proposed solution, only that I've not heard of 
another one that works and is acceptable to the UTC.

For example, the alleged problem of the vocalization order of
the Masoretes might be amenable to a much less drastic
solution. People could consider, for example, representation
of the required sequence:
  lamed, qamets, hiriq, final mem

as:

  lamed, qamets, ZWJ, hiriq, final mem

and then map qamets, ZWJ, hiriq to the required glyph
to get the hiriq to display to the left (and
partly under the following final mem).
There are a few problems with this scenario. One is that control characters 
are unreliable agents in glyph-level processing. Most applications do not 
paint control character glyphs, which means that they do not appear in 
glyph strings so cannot used in glyph substitution lookups. This seems to 
be a pretty much universal assumption about control characters. MS Word 
offers the option of turning on display of control characters, but then the 
purpose is to be able to see them in text, not to affect the text by 
toggling the display option. Arguably, there are implementation options 
that would overcome this problem, but they are complicated and the present 
assumption seems pretty universal.

That said, I would be willing to explore this idea further, since I don't 
think it is necessary to get into glyph substitution involving ZWJ if the 
presence of ZWJ in the character string always blocks canonical reordering. 
In the example I gave, simply preventing, e.g. the hiriq from being 
re-ordered should be enough to make it correctly render under the right 
side of the final mem. However, this example is something of an exceptional 
rendering, currently involving a special /HiriqFinalMem/ glyph. I would 
need to check all the other affected sequences to confirm whether inserting 
ZWJ causes mark positioning problems (I know it will in some applications, 
simply because support for ZWJ isn't always very good). The frustration is 
that although ZWJ cannot be reliably used in glyph substitution lookups, 
its presence can break glyph positioning lookups.

Thanks for the idea, though. I think it is worth exploring.

The problem of combinations of vowels with meteg could be
amenable to a similar approach. OR, one could propose just
one additional meteq/silluq character, to make it possible
to distinguish (in plain text) instances of left-side and
right-side meteq placement, for example.
Yes, that is an option for the meteg/silluq regardless of how the vowel 
ordering problem is addressed.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
If you browse in the shelves that, in American bookstores,
are labeled New Age, you can find there even Saint Augustine,
who, as far as I know, was not a fascist. But combining Saint
Augustine and Stonehenge -- that is a symptom of Ur-Fascism.
- Umberto Eco



Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-25 Thread Kenneth Whistler
John Hudson wrote:

 At 02:36 PM 6/25/2003, Michael Everson wrote:
 
 Write it up with glyphs and minimal pairs and people will see the problem, 
 if any. Or propose some solution. (That isn't add duplicate characters.)
 
 Peter Constable has written this up and submitted a proposal to the UTC. 

And I hate to have to continue being Mr. Negativity on this
list, but I remain unconvinced that the proposed solution
(of cloning 14 Hebrew points and vowels) just to fix an
unpreferred canonical reordering result represents the
sole remaining alternative. In this case, I believe the
side-effects of the proposed medicine are worse than the 
disease itself.

For example, the alleged problem of the vocalization order of
the Masoretes might be amenable to a much less drastic
solution. People could consider, for example, representation
of the required sequence:

  lamed, qamets, hiriq, final mem
  
as:

  lamed, qamets, ZWJ, hiriq, final mem
  
and then map qamets, ZWJ, hiriq to the required glyph
to get the hiriq to display to the left (and
partly under the following final mem).

The presence of a ZWJ (cc=0) in the sequence would block
the canonical reordering of the sequence to hiriq before
qamets. If that is the essence of the problem needing to
be addressed, then this is a much simpler solution which would
impact neither the stability of normalization nor require
mass cloning of vowels in order to give them new combining
classes.

Effectively what would be needed would be an agreement by
Biblical Hebraicists on a text representational convention
using existing characters. By doing so, they would gain both
the required orderings and the ability to make the distinctions
they want.

If use of a ZWJ (or something similar) seems alien to
Hebrew specialists, then, as always, the details can be
hidden behind the details of input method and keyboard
front ends. The use of a ZWJ should not impact searches
on data (if the searches are properly implemented), unless
the search is explicitly concerned about the distinctions --
in which case there actually *is* a difference in the text
representation which can be searched for. 

The problem of combinations of vowels with meteg could be
amenable to a similar approach. OR, one could propose just
one additional meteq/silluq character, to make it possible
to distinguish (in plain text) instances of left-side and
right-side meteq placement, for example.

--Ken




Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-25 Thread Kenneth Whistler

 For example, the alleged problem of the vocalization order of
 the Masoretes might be amenable to a much less drastic
 solution. People could consider, for example, representation
 of the required sequence:
 
lamed, qamets, hiriq, final mem
 
 as:
 
lamed, qamets, ZWJ, hiriq, final mem
 
 and then map qamets, ZWJ, hiriq to the required glyph
 to get the hiriq to display to the left (and
 partly under the following final mem).
 
 There are a few problems with this scenario. One is that control characters 
 are unreliable agents in glyph-level processing. Most applications do not 
 paint control character glyphs, which means that they do not appear in 
 glyph strings so cannot used in glyph substitution lookups. 

Even if the ZWJ is stripped by the application before the actual
low-level paint API is called, so that instead of

lamed, qamets, ZWJ, hiriq, final mem

the renderer just sees

lamed, qamets, hiriq, final mem

you still end up with the order you need to make the distinction.

The only problem would be if an application first stripped
the ZWJ and then *before* calling the paint operation, proceeded
to normalize the control-stripped glyph string. That would, however,
strike me as being arguably non-conformant with the intent
of the standard and the intent of normalization. You might end
up with very strange behavior if applications started normalizing
glyph strings after stripping them of format controls.

--Ken




Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-25 Thread John Hudson
At 06:22 PM 6/25/2003, Kenneth Whistler wrote:

Even if the ZWJ is stripped by the application before the actual
low-level paint API is called, so that instead of
lamed, qamets, ZWJ, hiriq, final mem

the renderer just sees

lamed, qamets, hiriq, final mem

you still end up with the order you need to make the distinction.
Yes. That works. My biggest worry with the ZWJ is that it may affect 
positioning lookups; this requires some experiment.

Whatever solution is finally adopted, changing texts and fonts is 
relatively simple. Changing layout engines and applications is harder and 
takes longer. We're pretty much resigned to updating texts and fonts.

I'll discuss the ZWJ idea with our project partners, but if it does affect 
positioning lookups it is not something that will get adopted until that is 
resolved. The potential problem is that the ZWJ is used to so many 
different things now, it is difficult to know exactly what applications 
will do with it. If the intent, as in this instance, is simply to prevent 
character re-ordering, should we really be using something so loaded. 
Perhaps we need a control character specifically as a canonical order 
override: something with cc=0 but no other behaviour associated with it.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
If you browse in the shelves that, in American bookstores,
are labeled New Age, you can find there even Saint Augustine,
who, as far as I know, was not a fascist. But combining Saint
Augustine and Stonehenge -- that is a symptom of Ur-Fascism.
- Umberto Eco