At 16:15 06/07/2003, Peter Kirk wrote:

I have a couple of points to make now on this issue. First, it might
help to get an idea of the scale of the problem. In the WTS encoded text
of the BHS Hebrew Bible, which comes to 5.25 MB in UTF-8, so a million
or so vowel points, there are just 637 instances of two vowel points on
one consonant. Of these, 636 are the word Yerushala(y)im, in four
slightly different forms including two with the directional he suffix.
The one additional instance is in the word mittaxat in Exodus 20:4,
which has a double vowel for a rather different reason - alternative
pronunciations of the word.

Thanks for the thoughtful analysis, Peter. Eli Evans and I have been documenting all of the unique mark sequences in the Michigan-Claremont text and WTS morphology database that are potentially incorrectly re-ordered in Unicode normalisation (I say potentially, because the fixed position combining classes may, by chance, not reorder some combinations of vowels). In addition to the <patah, hiriq> and <qamats, hiriq> double vowel sequences for Yerushala(y)im, the example you cite from Exodes 20:4 involves two vowels with an interposed cantillation mark -- <qamata, etnahta, patah> -- which needs to be renderable both with and without the cantillation. The WTS morphology database also includes a <tsadi, sheva, hiriq> sequence (in 2 Ch 13:14, last word) that is not attested in either BHS or BHL; Peter Constable enquired about this, since it seemed that it might be an error, but the WTS editors assured him that it was intentional. One thing we have not checked yet is whether there are any attested examples of cantillation marks that normally appear to the left of vowels occuring to the right. This seems unlikely, but nothing would surprise me about Biblical manuscripts, and such mark ordering would be affected by normalisation so should be checked and, hopefully, confirmed not to be an issue.


While I agree that the number of textual instances (in the known Ben Asher texts, at least) that are affected by the combining class problem is very small, and that re-encoding Hebrew vowels may be overkill as a solution, I'm not crazy about the proposed CGJ solution, because I'm not convinced that I'm going to see CGJ support any time soon. Given the small number of attested sequences that would be adversely affected by normalisation re-ordering, I'm beginning to favour the idea of encoding these sequences as individual characters. We'd probably only need three or four, plus a right meteg, to solve the problem, and rendering would work find with existing font and layout engine technologies.

Of course, I still hold out the faint hope that bodies like W3C and the IETF will say it is okay for Unicode to correct the existing combining classes and actually fix the problem at source.

John Hudson

Tiro Typeworks          www.tiro.com
Vancouver, BC           [EMAIL PROTECTED]

The sight of James Cox from the BBC's World at One,
interviewing Robin Oakley, CNN's man in Europe,
surrounded by a scrum of furiously scribbling print
journalists will stand for some time as the apogee of
media cannibalism.
                        - Emma Brockes, at the EU summit




Reply via email to