Re: sequences and stuff
On Thu, 30 Nov 2000, Brendan Murray/DUB/Lotus wrote: There are similar situations in many languages. Possibly more complicated is the use of graphemes which usually contract but don't in some cases. For example, the "aa" sequence as in "gaard" in Danish is traditionally sorted as å (a-ring), after ø (o-slash), but in other situations, particularly in names, the "aa" is really "a"+"a", and should be sorted before "b". How can this be catered for algorithmically? My guess is that there are only two possible solutions: 1. use an exceptions list, or 2. break the grapheme with some marker like ZWNJ to prevent the contraction. Obviously the first creates a maintenance nightmare, and the latter has to be somehow tagged to store the data correctly. In any case there's no simple solution. The situation is somehow worse with Persian. The letter "U+0622, Alef With Madda Above", when at the middle of a word, is treated based on its root when sorted. This letter, although pronounced the same, may be a letter of its own (with Persian root), or may be a Hamza+Alef, and treated like a ligature when being sorted. The librarians who know the meaning of the words, have no problem when sorting, but the poor computer programs, you know. Any ideas for different markup? If you need examples, you can take "MEEM ALEF-MADDA KHAH THAL" which is sorted like "MEEM HAMZA ALEF KHAH THAL" (Hamza is sorted after Alef in Persian) and "MEEM FARSI-YEH REH ALEF-MADDA BEH" in which the Alef-Madda is considered a single unit, sorted before Alef. --roozbeh
Re: sequences and stuff
Branislav, We're working on this; actually I am writing a paper which deals with some of the proposed solutions. That should be ready in a day or so. In the meantime, can you give me an example of a Czech or Slovak word in which ch is a grapheme, and another in which ch meet at a morpheme boundary? It would help me quite a lot. Michael Everson ** Everson Gunn Teoranta ** http://www.egt.ie 15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland Mob +353 86 807 9169 ** Fax +353 1 478 2597 ** Vox +353 1 478 2597 27 Páirc an Fhéithlinn; Baile an Bhóthair; Co. Átha Cliath; Éire
Re: sequences and stuff
Branislav Tichy [EMAIL PROTECTED] wrote: b) there are compound words, which have these sequences on a word border, and in this case, they stands for two separate graphemes and _are_ sorted as c+h, d+z a.s.f. the proper collation algorithmus would therefore have to realise (imho), whether there is one or two graphemes (whether the word is compound)! There are similar situations in many languages. Possibly more complicated is the use of graphemes which usually contract but don't in some cases. For example, the "aa" sequence as in "gaard" in Danish is traditionally sorted as å (a-ring), after ø (o-slash), but in other situations, particularly in names, the "aa" is really "a"+"a", and should be sorted before "b". How can this be catered for algorithmically? My guess is that there are only two possible solutions: 1. use an exceptions list, or 2. break the grapheme with some marker like ZWNJ to prevent the contraction. Obviously the first creates a maintenance nightmare, and the latter has to be somehow tagged to store the data correctly. In any case there's no simple solution. Brendan
Re: sequences and stuff
On Thu, Nov 30, 2000 at 05:18:59AM -0800, Brendan Murray/DUB/Lotus wrote: Branislav Tichy [EMAIL PROTECTED] wrote: b) there are compound words, which have these sequences on a word border, and in this case, they stands for two separate graphemes and _are_ sorted as c+h, d+z a.s.f. the proper collation algorithmus would therefore have to realise (imho), whether there is one or two graphemes (whether the word is compound)! There are similar situations in many languages. Possibly more complicated is the use of graphemes which usually contract but don't in some cases. For example, the "aa" sequence as in "gaard" in Danish is traditionally sorted as å (a-ring), after ø (o-slash), but in other situations, particularly in names, the "aa" is really "a"+"a", and should be sorted before "b". How can this be catered for algorithmically? Yes, the Slovak problem may look like the Dansih "aa" problem. Just for the record, "aa" normally means "å" in Danish names, eg. Søndergaard is the last name of one of the persons that has been responsible for SC2 matters in Danish Standards. "gaard" is pronounced like "gård". I have no examples off my head on Danish names where "aa" actually means two a-s, pronounced as two sounds. The rule from the danish orthography book is that if the two a's are pronounced as two sounds, they are also sorted as two sounds, as two A's. If it is pronounced as one sound, then it is sorted as an "å" (irrespectively of whether the sound is an "a" sound). My guess is that there are only two possible solutions: 1. use an exceptions list, or 2. break the grapheme with some marker like ZWNJ to prevent the contraction. Obviously the first creates a maintenance nightmare, and the latter has to be somehow tagged to store the data correctly. In any case there's no simple solution. The two a sounds occur in combined words, like ekstraarbejde (extra work). The recommendation from danish standards is to introduce a soft-hyphen SHY between the A's. This also works for iso-8859-1. Keld
Re: sequences and stuff
Keld Jørn Simonsen [EMAIL PROTECTED] wrote: I have no examples off my head on Danish names where "aa" actually means two a-s, pronounced as two sounds. I know of at least one - what about "Haageman"? That's pronounced (using English) "Hay-e-man". Brendan
Re: sequences and stuff
On Thu, Nov 30, 2000 at 07:52:37AM -0800, Brendan Murray/DUB/Lotus wrote: Keld Jørn Simonsen [EMAIL PROTECTED] wrote: I have no examples off my head on Danish names where "aa" actually means two a-s, pronounced as two sounds. I know of at least one - what about "Haageman"? That's pronounced (using English) "Hay-e-man". I have not seen that name in Danish before. I would pronounce it "Håueman" if I saw it on a danish name list. Of cause you can chose to pronounce your name in a special way. Anyway, you may have been fooled by the "g" which may be numb, or pronounced like a short "u". so it is: Haa-ge-man Hå ue man Kind regards Keld
Re: sequences and stuff
The soft hyphen is not sufficient, since in other languages the case where two letters must be distinguished in collation may not fall on a syllable boundary, or allow hyphenation between them. The UTC looked at all the possible existing boundary-control characters; none of them really work for this problem since they all have other functions that may conflict. There was a proposal for a grapheme-break and grapheme-join pair of additional "Cf" characters. The UTC accepted the second one, and will be working with WG2 on it. Mark IMO, both are useful in different situations. The grapheme-break is more useful in the situation you cite: marking the exceptional words having characters that should not be considered a single grapheme in collation (and, perhaps, in pronunciation: e.g. "Bathill"). - Original Message - From: "Keld Jørn Simonsen" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Cc: "Unicode List" [EMAIL PROTECTED] Sent: Thursday, November 30, 2000 07:43 Subject: Re: sequences and stuff On Thu, Nov 30, 2000 at 05:18:59AM -0800, Brendan Murray/DUB/Lotus wrote: Branislav Tichy [EMAIL PROTECTED] wrote: b) there are compound words, which have these sequences on a word border, and in this case, they stands for two separate graphemes and _are_ sorted as c+h, d+z a.s.f. the proper collation algorithmus would therefore have to realise (imho), whether there is one or two graphemes (whether the word is compound)! There are similar situations in many languages. Possibly more complicated is the use of graphemes which usually contract but don't in some cases. For example, the "aa" sequence as in "gaard" in Danish is traditionally sorted as å (a-ring), after ø (o-slash), but in other situations, particularly in names, the "aa" is really "a"+"a", and should be sorted before "b". How can this be catered for algorithmically? Yes, the Slovak problem may look like the Dansih "aa" problem. Just for the record, "aa" normally means "å" in Danish names, eg. Søndergaard is the last name of one of the persons that has been responsible for SC2 matters in Danish Standards. "gaard" is pronounced like "gård". I have no examples off my head on Danish names where "aa" actually means two a-s, pronounced as two sounds. The rule from the danish orthography book is that if the two a's are pronounced as two sounds, they are also sorted as two sounds, as two A's. If it is pronounced as one sound, then it is sorted as an "å" (irrespectively of whether the sound is an "a" sound). My guess is that there are only two possible solutions: 1. use an exceptions list, or 2. break the grapheme with some marker like ZWNJ to prevent the contraction. Obviously the first creates a maintenance nightmare, and the latter has to be somehow tagged to store the data correctly. In any case there's no simple solution. The two a sounds occur in combined words, like ekstraarbejde (extra work). The recommendation from danish standards is to introduce a soft-hyphen SHY between the A's. This also works for iso-8859-1. Keld
Re: sequences and stuff
Keld Jørn Simonsen [EMAIL PROTECTED] wrote: Anyway, you may have been fooled by the "g" which may be numb, or pronounced like a short "u". so it is: Haa-ge-man Hå ue man Nope - the first syllable in this surname *is* pronounced as the English "hay" rather than "hoe". And I used this example from an ex-colleague who was the EDB-chef at a major department store in Kbh, so it's a real-life example. Brendan
Re: sequences and stuff
On Thu, Nov 30, 2000 at 09:22:54AM -0800, Brendan Murray/DUB/Lotus wrote: Keld Jørn Simonsen [EMAIL PROTECTED] wrote: Anyway, you may have been fooled by the "g" which may be numb, or pronounced like a short "u". so it is: Haa-ge-man Hå ue man Nope - the first syllable in this surname *is* pronounced as the English "hay" rather than "hoe". And I used this example from an ex-colleague who was the EDB-chef at a major department store in Kbh, so it's a real-life example. Are you sure it was the Danish pronounciation? Many Danes try to "internationalize" (that means: easier to pronounce for English speaking people) the pronounciation of their names. "Hay" is not normal pronounciation for "ha" in Danish. So he was not pronouncing a danish "a " sound. "a" would be pronounced like the first "a" in "advanced" . What happened to the "ge" ? Keld
Re: sequences and stuff
On Thu, Nov 30, 2000 at 04:55:15AM -0800, Michael Everson wrote: We're working on this; actually I am writing a paper which deals with some of the proposed solutions. That should be ready in a day or so. In the meantime, can you give me an example of a Czech or Slovak word in which ch is a grapheme, and another in which ch meet at a morpheme boundary? It would help me quite a lot. Wow, someone else has introduced the topic I have raved about repeatedly. :) Anyway, in Slovak, ch is always a single unit. But Brao has a point: A text may be multi-lingual, in which case some words may use 'ch' as a grapheme which should be sorted after 'h', while others may use it as two separate characters. As for 'dz' and 'd', it's not really a problem simply because taken as a single unit it is sorted lexicographically exactly the same as when taken as two separate characters. Incidentally, when transliterating from Greek (chi), ch is really a single unit in other languages as well. Adam -- Life is not just a matter of holding good cards, but sometimes of playing a poor hand well. -- Robert Louis Stevenson
Re: sequences and stuff
On Thu, Nov 30, 2000 at 03:44:00AM -0800, Branislav Tichy wrote: hello, this subject (or alike) has been probably already discussed, but let me ask one more question about it: sequences vrs collating i have recently read the page //www.unicode.org/unicode/standard/where/ and i basically agree with listed reasons (for not including all possible sequences...) except one. let me explain it on Slovak. there actually is example for one possible Slovak sequence (may i call it digraph?): 'ch' or 0063 0068. another possibilities are 'dz' 'd3' (d+z caron 0064 017e | 0064 007a 030c) 'ia' 'ie' 'iu' 'ou'. the problem is, that a) they _are not_ sorted as c+h, d+z... when standing for one grapheme (the order is ...d,dz,d3,e...h,ch,i,...) b) there are compound words, which have these sequences on a word border, and in this case, they stands for two separate graphemes and _are_ sorted as c+h, d+z a.s.f. the proper collation algorithmus would therefore have to realise (imho), whether there is one or two graphemes (whether the word is compound)! suggestion: one possible solution could be using codes 200bzwsp or 200czwnj 200dzwj to distinguish digraphs (like in the example with fi ligature). or maybe there could be some 'digraph gluing' code? or maybe code for word border in compound words? or it could be handled by 009a (single character introducer) code? this way sorting could be done by low-level algorithmus without any need for word dictionaries (i can't think of any other mean how to distinguish compound word and its parts properly) I would suggest you use something like SHY soft-hyphen between the combined words. In that way you also have an indication on where to hyphenate. Sorting is well understood for Slovak and special rules have been in place for these digraphs for a long time. Keld