Re: sequences and stuff

2000-12-01 Thread Roozbeh Pournader


On Thu, 30 Nov 2000, Brendan Murray/DUB/Lotus wrote:

 There are similar situations in many languages. Possibly more complicated
 is the use of graphemes which usually contract but don't in some cases. For
 example, the "aa" sequence as in "gaard" in Danish is traditionally sorted
 as å (a-ring), after ø (o-slash), but in other situations, particularly in
 names, the "aa" is really "a"+"a", and should be sorted before "b". How can
 this be catered for algorithmically?
 
 My guess is that there are only two possible solutions:
1. use an exceptions list, or
2. break the grapheme with some marker like ZWNJ to prevent the
contraction.
 
 Obviously the first creates a maintenance nightmare, and the latter has to
 be somehow tagged to store the data correctly. In any case there's no
 simple solution.

The situation is somehow worse with Persian. The letter "U+0622, Alef
With Madda Above", when at the middle of a word, is treated based on its
root when sorted. This letter, although pronounced the same, may be a
letter of its own (with Persian root), or may be a Hamza+Alef, and treated
like a ligature when being sorted. The librarians who know the meaning of
the words, have no problem when sorting, but the poor computer programs,
you know. Any ideas for different markup? If you need examples, you can
take "MEEM ALEF-MADDA KHAH THAL" which is sorted like "MEEM HAMZA ALEF
KHAH THAL" (Hamza is sorted after Alef in Persian) and "MEEM FARSI-YEH REH
ALEF-MADDA BEH" in which the Alef-Madda is considered a single unit,
sorted before Alef.

--roozbeh





Re: sequences and stuff

2000-11-30 Thread Michael Everson

Branislav,

We're working on this; actually I am writing a paper which deals with some
of the proposed solutions. That should be ready in a day or so. In the
meantime, can you give me an example of a Czech or Slovak word in which
ch is a grapheme, and another in which ch meet at a morpheme
boundary? It would help me quite a lot.

Michael Everson  **  Everson Gunn Teoranta  **   http://www.egt.ie
15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland
Mob +353 86 807 9169 ** Fax +353 1 478 2597 ** Vox +353 1 478 2597
27 Páirc an Fhéithlinn;  Baile an Bhóthair;  Co. Átha Cliath; Éire





Re: sequences and stuff

2000-11-30 Thread Brendan Murray/DUB/Lotus


Branislav Tichy [EMAIL PROTECTED] wrote:
 b) there are compound words, which have these sequences on a word border,
 and in this case, they stands for two separate graphemes and _are_ sorted
 as c+h, d+z a.s.f.
 the proper collation algorithmus would therefore have to realise (imho),
 whether there is one or two graphemes (whether the word is compound)!

There are similar situations in many languages. Possibly more complicated
is the use of graphemes which usually contract but don't in some cases. For
example, the "aa" sequence as in "gaard" in Danish is traditionally sorted
as å (a-ring), after ø (o-slash), but in other situations, particularly in
names, the "aa" is really "a"+"a", and should be sorted before "b". How can
this be catered for algorithmically?

My guess is that there are only two possible solutions:
   1. use an exceptions list, or
   2. break the grapheme with some marker like ZWNJ to prevent the
   contraction.

Obviously the first creates a maintenance nightmare, and the latter has to
be somehow tagged to store the data correctly. In any case there's no
simple solution.

Brendan




Re: sequences and stuff

2000-11-30 Thread Keld Jørn Simonsen

On Thu, Nov 30, 2000 at 05:18:59AM -0800, Brendan Murray/DUB/Lotus wrote:
 
 Branislav Tichy [EMAIL PROTECTED] wrote:
  b) there are compound words, which have these sequences on a word border,
  and in this case, they stands for two separate graphemes and _are_ sorted
  as c+h, d+z a.s.f.
  the proper collation algorithmus would therefore have to realise (imho),
  whether there is one or two graphemes (whether the word is compound)!
 
 There are similar situations in many languages. Possibly more complicated
 is the use of graphemes which usually contract but don't in some cases. For
 example, the "aa" sequence as in "gaard" in Danish is traditionally sorted
 as å (a-ring), after ø (o-slash), but in other situations, particularly in
 names, the "aa" is really "a"+"a", and should be sorted before "b". How can
 this be catered for algorithmically?

Yes, the Slovak problem may look like the Dansih "aa" problem.
Just for the record, "aa" normally means "å" in Danish names,
eg. Søndergaard is the last name of one of the persons that
has been responsible for SC2 matters in Danish Standards.
"gaard" is pronounced like "gård". I have no examples off my head on
Danish names where "aa" actually means two a-s, pronounced as two sounds.

The rule from the danish orthography book is that if the two
a's are pronounced as two sounds, they are also sorted as two sounds, as
two A's. If it is pronounced as one sound, then it is sorted as an "å"
(irrespectively of whether the sound is an "a" sound).

 My guess is that there are only two possible solutions:
1. use an exceptions list, or
2. break the grapheme with some marker like ZWNJ to prevent the
contraction.
 
 Obviously the first creates a maintenance nightmare, and the latter has to
 be somehow tagged to store the data correctly. In any case there's no
 simple solution.
 
The two a sounds occur in combined words, like ekstraarbejde (extra work).
The recommendation from danish standards is to introduce a soft-hyphen SHY
between the A's. This also works for iso-8859-1.

Keld



Re: sequences and stuff

2000-11-30 Thread Brendan Murray/DUB/Lotus


Keld Jørn Simonsen [EMAIL PROTECTED] wrote:
 I have no examples off my head on Danish names
 where "aa" actually means two a-s, pronounced as two sounds.

I know of at least one - what about "Haageman"? That's pronounced (using
English) "Hay-e-man".

Brendan




Re: sequences and stuff

2000-11-30 Thread Keld Jørn Simonsen

On Thu, Nov 30, 2000 at 07:52:37AM -0800, Brendan Murray/DUB/Lotus wrote:
 
 Keld Jørn Simonsen [EMAIL PROTECTED] wrote:
  I have no examples off my head on Danish names
  where "aa" actually means two a-s, pronounced as two sounds.
 
 I know of at least one - what about "Haageman"? That's pronounced (using
 English) "Hay-e-man".

I have not seen that name in Danish before. I would pronounce it
"Håueman" if I saw it on a danish name list. Of cause you can
chose to pronounce your name in a special way.

Anyway, you may have been fooled by the "g" which may be numb,
or pronounced like a short "u". so it is:

Haa-ge-man
Hå  ue man


Kind regards
Keld



Re: sequences and stuff

2000-11-30 Thread Mark Davis

The soft hyphen is not sufficient, since in other languages the case where
two letters must be distinguished in collation may not fall on a syllable
boundary, or allow hyphenation between them.

The UTC looked at all the possible existing boundary-control characters;
none of them really work for this problem since they all have other
functions that may conflict. There was a proposal for a grapheme-break and
grapheme-join pair of additional "Cf" characters. The UTC accepted the
second one, and will be working with WG2 on it.

Mark

IMO, both are useful in different situations. The grapheme-break is more
useful in the situation you cite: marking the exceptional words having
characters that should not be considered a single grapheme in collation
(and, perhaps, in pronunciation: e.g. "Bathill").

- Original Message -
From: "Keld Jørn Simonsen" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Cc: "Unicode List" [EMAIL PROTECTED]
Sent: Thursday, November 30, 2000 07:43
Subject: Re: sequences and stuff


 On Thu, Nov 30, 2000 at 05:18:59AM -0800, Brendan Murray/DUB/Lotus wrote:
 
  Branislav Tichy [EMAIL PROTECTED] wrote:
   b) there are compound words, which have these sequences on a word
border,
   and in this case, they stands for two separate graphemes and _are_
sorted
   as c+h, d+z a.s.f.
   the proper collation algorithmus would therefore have to realise
(imho),
   whether there is one or two graphemes (whether the word is compound)!
 
  There are similar situations in many languages. Possibly more
complicated
  is the use of graphemes which usually contract but don't in some cases.
For
  example, the "aa" sequence as in "gaard" in Danish is traditionally
sorted
  as å (a-ring), after ø (o-slash), but in other situations, particularly
in
  names, the "aa" is really "a"+"a", and should be sorted before "b". How
can
  this be catered for algorithmically?

 Yes, the Slovak problem may look like the Dansih "aa" problem.
 Just for the record, "aa" normally means "å" in Danish names,
 eg. Søndergaard is the last name of one of the persons that
 has been responsible for SC2 matters in Danish Standards.
 "gaard" is pronounced like "gård". I have no examples off my head on
 Danish names where "aa" actually means two a-s, pronounced as two sounds.

 The rule from the danish orthography book is that if the two
 a's are pronounced as two sounds, they are also sorted as two sounds, as
 two A's. If it is pronounced as one sound, then it is sorted as an "å"
 (irrespectively of whether the sound is an "a" sound).

  My guess is that there are only two possible solutions:
 1. use an exceptions list, or
 2. break the grapheme with some marker like ZWNJ to prevent the
 contraction.
 
  Obviously the first creates a maintenance nightmare, and the latter has
to
  be somehow tagged to store the data correctly. In any case there's no
  simple solution.
 
 The two a sounds occur in combined words, like ekstraarbejde (extra work).
 The recommendation from danish standards is to introduce a soft-hyphen SHY
 between the A's. This also works for iso-8859-1.

 Keld





Re: sequences and stuff

2000-11-30 Thread Brendan Murray/DUB/Lotus


Keld Jørn Simonsen [EMAIL PROTECTED] wrote:
 Anyway, you may have been fooled by the "g" which may be numb,
 or pronounced like a short "u". so it is:

 Haa-ge-man
 Hå  ue man

Nope - the first syllable in this surname *is* pronounced as the English
"hay" rather than "hoe". And I used this example from an ex-colleague who
was the EDB-chef at a major department store in Kbh, so it's a real-life
example.

Brendan





Re: sequences and stuff

2000-11-30 Thread Keld Jørn Simonsen

On Thu, Nov 30, 2000 at 09:22:54AM -0800, Brendan Murray/DUB/Lotus wrote:
 
 Keld Jørn Simonsen [EMAIL PROTECTED] wrote:
  Anyway, you may have been fooled by the "g" which may be numb,
  or pronounced like a short "u". so it is:
 
  Haa-ge-man
  Hå  ue man
 
 Nope - the first syllable in this surname *is* pronounced as the English
 "hay" rather than "hoe". And I used this example from an ex-colleague who
 was the EDB-chef at a major department store in Kbh, so it's a real-life
 example.

Are you sure it was the Danish pronounciation?
Many Danes try to "internationalize" (that means: easier to pronounce for
English speaking people) the pronounciation of their names.

"Hay" is not normal pronounciation for "ha" in Danish. So he was
not pronouncing a danish "a " sound. "a" would be pronounced like
the first "a" in "advanced" .

What happened to the "ge" ?

Keld



Re: sequences and stuff

2000-11-30 Thread G. Adam Stanislav

On Thu, Nov 30, 2000 at 04:55:15AM -0800, Michael Everson wrote:
We're working on this; actually I am writing a paper which deals with some
of the proposed solutions. That should be ready in a day or so. In the
meantime, can you give me an example of a Czech or Slovak word in which
ch is a grapheme, and another in which ch meet at a morpheme
boundary? It would help me quite a lot.

Wow, someone else has introduced the topic I have raved about repeatedly. :)

Anyway, in Slovak, ch is always a single unit. But Brao has a point: A
text may be multi-lingual, in which case some words may use 'ch' as a
grapheme which should be sorted after 'h', while others may use it as
two separate characters.

As for 'dz' and 'd', it's not really a problem simply because taken
as a single unit it is sorted lexicographically exactly the same as
when taken as two separate characters.

Incidentally, when transliterating from Greek (chi), ch is really a
single unit in other languages as well.

Adam

-- 
Life is not just a matter of holding good cards,
but sometimes of playing a poor hand well.
-- Robert Louis Stevenson



Re: sequences and stuff

2000-11-30 Thread Keld Jørn Simonsen

On Thu, Nov 30, 2000 at 03:44:00AM -0800, Branislav Tichy wrote:
 hello,
 
 this subject (or alike) has been probably already discussed, but let me
 ask one more question about it: sequences vrs collating
 i have recently read the page //www.unicode.org/unicode/standard/where/
 and i basically agree with listed reasons (for not including all possible
 sequences...) except one. let me explain it on Slovak.
 there actually is example for one possible Slovak sequence (may i call it
 digraph?): 'ch' or 0063 0068. another possibilities are 'dz' 'd3' (d+z
 caron 0064 017e | 0064 007a 030c) 'ia' 'ie' 'iu' 'ou'. the problem is,
 that
 a) they _are not_ sorted as c+h, d+z... when standing for one grapheme
 (the order is ...d,dz,d3,e...h,ch,i,...)
 b) there are compound words, which have these sequences on a word border,
 and in this case, they stands for two separate graphemes and _are_ sorted
 as c+h, d+z a.s.f.
 the proper collation algorithmus would therefore have to realise (imho),
 whether there is one or two graphemes (whether the word is compound)! 
 
 suggestion:
 one possible solution could be using codes
   200bzwsp
 or
   200czwnj
   200dzwj
 to distinguish digraphs (like in the example with fi ligature).
 
 or maybe there could be some 'digraph gluing' code?
 or maybe code for word border in compound words?
 or it could be handled by 009a (single character introducer) code?
 
 this way sorting could be done by low-level algorithmus without any need
 for word dictionaries (i can't think of any other mean how to distinguish
 compound word and its parts properly)

I would suggest you use something like SHY soft-hyphen between
the combined words. In that way you also have
an indication on where to hyphenate.

Sorting is well understood for Slovak and special rules have been in
place for these digraphs for a long time.

Keld