[quoted lines by Lee Maschmeyer on 2014/04/11 at 19:25 -0400] >In my naiveté I thought a font had to do with how a character was >displayed on the screen, not with what the character itself is; and I >thought brltty snagged the character before it went to the font. I >sure _hope_ that's the way it is. :-))
That's not exactly how it works with Unicode. Maybe, though, my use of the term "font" isn't technically correct. I'm not sure what the Unicode term for it is. All those different types of letters, for example, as listed in my previous message, are defined within distinct Unicode codepoint ranges. Each of them defines a particular style to be used for the letters. Unicode knows that they're compatible sets of characters, and there's a way to do what Unicode calls string normalization. That, in theory, is the solution to this problem, except for one (to us) very important limitation. Unicode can define a character in one of two ways: composed and decomposed. This is particularly significant, for example, for languages which use letters with accents. You used the word "naivité, for example., above. The last letter in that word is a lowercase e with an acute accent. Its composed form is the single codepoint U+00E9, and its decomposed form is the two-codepoint sequence U+0065 U+00B4 (a plain lowercase e, followed by a "combining" acute accent). The text on the screen, or the text in a file, can use either scheme for any character. In other words, if the text contains more than one composit character, some may be composed while others are decomposed. What we'd need to do is to normalize the text before contracting it by forcing all the characters to be composed. That's easy enough to do with standard functions except that those functions don't return offset information. In other words, they don't make it easy for us to map the start of a decomposed character in the soruce text to its corresponding composed character in the normalized text. Unless I find a way, what I may end up having to do is figure out how to do our own normalization so that we can keep track of the offset information. Something I think I'll experiment with is skipping over combining characters in both strings after the normalization is done since, in theory, both strings should contain exactly the same number of base characters. An added efficiency, which would be the common case by far, might be to compare the two strings, and, if they're the same, skip the (expensive) offset mapping bit. If this approach will work then all we should need to do is define our tables using only composed and compatibility characters, which, I suspect, is already the case. Contraction table compilation could check for this anyway, though, just to be sure. -- Dave Mielke | 2213 Fox Crescent | The Bible is the very Word of God. Phone: 1-613-726-0014 | Ottawa, Ontario | http://Mielke.cc/bible/ EMail: [email protected] | Canada K2A 1H7 | http://FamilyRadio.com/ _______________________________________________ This message was sent via the BRLTTY mailing list. To post a message, send an e-mail to: [email protected] For general information, go to: http://mielke.cc/mailman/listinfo/brltty
