from:"Philippe Verdy"

Re: [A12n-Collab] Latin alpha (Re: Public Review Issues Update)

2004-08-31 Thread Philippe Verdy

From: John Hudson [EMAIL PROTECTED]
 Donald Z. Osborn wrote:

  According to data from R. Hartell (1993), the latin alpha is used in
Fe'efe'e (a
  dialect of Bamileke) in Cameroon. See
  http://www.bisharat.net/A12N/CAM-table.htm (full ref. there; Hartell
names her
  sources in her book). Not sure offhand of other uses, but I thought it
was
  proposed for Latin transcription of Tamashek in Mali at one point (I'll
try to
  check later). In any event it would seem easy to confuse the latin alpha
with
  the standard a, which would seem to either require exaggerated forms
(of the
  alpha, to clarify the difference) or limit its usefulness in practice.

 The Latin alpha is usually distinguished from the regular Latin lowercase
a by making the
 latter a 'double-storey' form, whereas the alpha is a single-storey form.
Of course, this
 means that the distinction cannot be adequately made in typefaces with a
single-storey
 lowercase a, such as Futura.

I agree with you but almost all font designs make a clear distinction
between lowercase alpha (latin or greek), and lowercase a: the alpha is a
single continuous stroke, whereas the Latin letter a is almost always (in
either single-eye or double-eye forms) a closed circle/ellipse and a
tangeant vertical stroke on the right.

I was speaking about the distinction between single-eye and double eye forms
of the Latin letter a (excluding Latin alpha), where:
- the single-eye form is generally an x-height circle or vertical ellipse
and a x-height tangeant vertical stroke (possibly curved on the lower end to
become tangeant to the baseline to become the start of a connecting edge),
- and the double-eye form is generally an half-x-height flat ellipse and a
x-height vertical tangeant curved above the ellipse to become tangeant to
the x-height horizontal line; so it has two eyes (one closed below, one open
eye above it).

The Latin small letter alpha is always a single-eye form, but sometimes
there's a second open eye on the right of the closed eye (which should not
be a ellipse, but should present some angle on its right edge).

My question was about the distinction of letter a only, even if there are
some fonts where it will be difficult to see the difference between the
single-eye letter a and the small letter alpha.

Re: Deseret in use (?) by micronation Molossia

2004-09-07 Thread Philippe Verdy

From: Doug Ewell [EMAIL PROTECTED]
Antnio Martins-Tuvlkin antonio at tuvalkin dot web dot pt wrote:
Deseret in use (?) by micronation Molossia: It is explained at
 http://www.molossia.org/alphabet.html , but they put GIFs on-line,
making no use of the U+10400 block...
I visited their site, wondering if they could use some assistance with
transcription and Unicode from an American foreign national who reads
and writes Deseret (and has been to the area recently).  But they seem
more interested in relations with residents of other micronations than
with Americans.   ,  .
Seriously, do these kings and emperors, that reign on these lands and 
claim various disputed places around the world, or even on the Moon or Mars, 
and can issue currency by buying Monopoly(tm) game bills, be taken 
seriously?
On these micronations, you'll find so many consitutional changes, for so 
few peoples (most often not more than a handful), that I doubt these peoples 
can claim create a standard.
What is real is that they have found a legal way to escape from their 
hosting country to put them out of laws, but also out of assistance. None of 
them are recognized internationally, except between themselves in a virtual 
forum (if they can pay for their Internet access used abroad...)
We can accept their volonty of independance, but they have also to accept 
what this implies. Most of these self-claimed lands have disappeared after 
less than a dozen of years (divorce, family conflicts, or simply poverty 
caused by lack of local job and resources; none of these lands could live 
without tourism, if people visiting them accept to pay their tax in 
full-fledged US dollars or Euros...)
I remember such self-claimed country by someone who bought an old oil 
platform, and anchored it in the international waters North of Europe. This 
platform is under International laws, but is not a homeland (nobody lives 
there), despite it is out of laws of the neighboring country. This has 
allowed the man owning it to escape from taxes and to found a financial 
company with black money that resists from fiscal inspections... until the 
status of the waters was resolved at the UN, by an agreement between 
neighbouring countries;

markup on combining characters (was: Compatibility mappings for new Hebrew points)

2004-09-07 Thread Philippe Verdy

From: Peter Kirk [EMAIL PROTECTED]
By the way, any suggestion of making the QQ distinction with markup is 
ruled out by the principle recently expounded on the main Unicode list 
that separate markup cannot be applied to combining characters.
Isn't this need of allowing separate markup on combining characters 
addressed by the current proposal to encode a invisible base character 
(IBC), so that markup can be applied to a non defective combining sequence?

I understand that this proposed new character would more likely be used to 
allow rendering isolated combining marks, without needing to encode their 
spacing variant, but the sequence IBC,combining mark (now possibly 
enclosed in markup) could become a candidate for possible ligaturing by 
preceding it by a ZWJ, or for word-wrap exclusion with a leading WJ...

Re: markup on combining characters

2004-09-08 Thread Philippe Verdy

From: Jony Rosenne [EMAIL PROTECTED]
Peter Kirk
You mean, you would represent a black e with a red acute accent as
something like e, ZWJ, red, IBC, acute, /red? That
looks like
a nightmare for all kinds of processing and a nightmare for rendering.
No, it is more like forecolor:black, combiningcolor:red e acute
And there is no Unicode decision against it.
And still no decision if this invisible base character will be added or not. 
It's just a public review for now, to address the first issue of rendering 
isolated non-spacing combining marks that currently don't have a spacing 
variant (I think it's a good idea as it would avoid adding most of the 
missing ones, notably for the non-generic L/G/C combining marks).

Note that your suggestion of:
  forecolor:black, combiningcolor:red e acute
should also work with any normalized form of the same text, i.e. with:
  forecolor:black, combiningcolor:red e with acute
where the combining mark is composed. The issue here is that this becomes 
tricky for renderers that will need to redecompose strings in normalized 
forms, before applying style.
Basically I prefer the Peter solution with:
  e, ZWJ?, red, IBC, acute, /red
which is more independant of the normalization form. Then the question is 
whever the text within red.../red markup should combine visually when 
rendered.

For now I see the proposed IBC (no name for it for now) only as a way to 
transform non-spacing combining marks in spacing non-combining variants, 
when they dont exist separately in Unicode (so this would not be recommanded 
for the non-spacing acute accent which already has a spacing version that 
does not require using a leading IBC.)
Technically, if an IBC character is added, a renderer will not necessarily 
render IBC, non-spacing combining acute the same way as spacing 
non-combining acute accent, even if it should better do so.
In this past sentence, the should means that the existing spacing 
non-combining marks are left as the standard legacy way to encode them, and 
they normally don't combine when rendered after a base letter, even if 
there's markup around them (except if this markup explicitly says that they 
should combine):

If I take the above example,
   e, ZWJ?, red, IBC, acute, /red
the same rich-text should also be renderable without the markup in 
plain-text as if it was:
   e, ZWJ?, IBC, acute
i.e. (with the should above) like if it was also:
   e, ZWJ?, spacing acute
I have placed the ? symbol after ZWJ to exhibit the fact that something 
would be necessary to allow this last text to remove the non-combining 
non-spacing behavior of the spacing acute character. Without it, the text:
   e, spacing acute
or equivalently (with the should above):
   e, IBC, combining acute
would not be allowed to render a combined e with an accute, and two separate 
glyphs would be rendered, and two separate character entities interpreted 
(as they are today in legacy plain-texts).

So the question remains about how to add markup on combining marks: the 
proposed IBC alone cannot solve such problems, unless there's an agreement 
that ZWJ immediately followed by IBC should be rendered as if they were not 
present (but in that case, a spacing acute becomes semantically and 
graphically distinct from IBC, combining acute: this is what will happen 
in any case with normalization forms due to the Unicode stability policy, as 
existing spacing marks must remain undecomposable in NFD or NFKD forms).

I also note that IBC is intended to replace the need to use a standard SPACE 
as the base character for building a spacing variant of combining marks when 
there's no standard spacing variant encoded in Unicode (this is a legacy 
hack, which causes various problems because of whitespace normalization in 
many plain-text formats or applications, or in XML and HTML, and the special 
word-breaking behavior of spaces). I don't see it as a way to deprecate the 
existing block of spacing marks.

Re: markup on combining characters

2004-09-08 Thread Philippe Verdy

From: Asmus Freytag [EMAIL PROTECTED]
At 12:49 AM 9/8/2004, Philippe Verdy wrote:
And still no decision if this invisible base character will be added or 
not. It's just a public review for now,
Well, hold your horses for a bit here.
If something's out of review, there won't be a decision until the review 
is over.

Anything that has this much potential exposure is something we should move 
very slowly on, to make sure we get it right.
Isn't the public review there specially to think about such things?
It's not too soon to discuss it now, because the most serious issues will 
hapen when the new character will be encoded, possibly with missing or 
incorrect properties.

I don't know if a formal proposal has been sent to ISO/IEC WG too. May be 
this review is there to allow creating such formal proposal, to be encoded 
later by ISO if it accepts to give it a codepoint, and then accepted too by 
UTC when properties are fixed and usage is properly documented.

Re: [BULK] - Re: markup on combining characters

2004-09-10 Thread Philippe Verdy

From: Asmus Freytag [EMAIL PROTECTED]
On the other hand, all aspects to *coloring* of characters
do not belong in the plain text stream - but that was not
the question.
I think suggested solutions that define markup that apply to
combining characters but place that markup outside of the
combining sequence would be a better answer than protocols
trying to put markup inside the combining character sequence.
My personal take is that the UTC might make a recommendation
to that effect, but it's not part of the standard proper.
It's not clear that the issue has practical urgency - if
I should be mistaken on that, I'd like to find out how and why.
Placing markup out of the combining sequence seems attractive, apparently, 
but exposes to other difficulties about how to refer to parts of combining 
sequences (I did not say parts of characters, because I agree that 
combining characters are not part of characters, but effectively true 
abstract characters per the Unicode definition), when combining sequences 
are themselves subject to transformations like normalization.

A solution would be to specify in the markup which normalization to apply to 
the combining sequence before refering to its component characters, with 
some syntax like:
   font style=color:red nfd(2,1);ecombining-acute;/font
which would resist to normalization of the document such as NFC in:
   font style=color:red nfd(2,1);e-with-acute;/font
Here some syntax in the markup style indicates an explicit NFD normalization 
to apply to the plain-text fragment encoded in the text element, before 
specifying a range of characters to which the style applies (Here it says 
that color:red applies to only 1 character starting at the second one in the 
surrounded text fragment, after it has been forced to NFD normalization.

May be this seems tricky, but other simplified solutions may be implemented 
in a style language, such as providing more basic restrictions using new 
markup attributes:
   font style=combining-color:rede-with-acute;/font
where the new combining-color attribute implies such prenormalization and 
automatic selection of character ranges to which to apply coloring. May be 
there are better solutions, that will not imply augmenting the style 
language schema with lots of new attribute names, such as in:
   font style=color:combining(red)e-with-acute;/font
Here also, Unicode itself is not affected. But markup languages and 
renderers are seriously modified to take new markup property names or values 
into account.

Re: Questions about diacritics

2004-09-13 Thread Philippe Verdy

From: Gerd Schumacher [EMAIL PROTECTED]
2. Another invisible diacritics carrier
I also found an acute on diphtongs, placed on the boundary of both letters
(au, ei, eu, oe, and ui).
Wouldn't such diacritic be hold by the currently proposed invisible base 
character (in the Public Review section of the Unicode website), by encoding 
for example:
   a,INVISIBLE LETTER,combining acute,u
If you think there's a grapheme cluster here, I suggest using ZWJ to attach 
the three default grapheme clusters:
   a,ZWJ,INVISIBLE LETTER,combining acute,ZWJ,u
to create a kerning ligature between the two vowels.

The invisible letter in PR-41 is also intended to support the INV
character found in ISCII for standard Brahmic scripts of India, but with
probable interoperability problems.
But I currently do not see indication for its correct usage in the Latin 
script, except as a way to transform a combining diacritic into a 
non-combining one in isolation, when the legacy use of SPACE causes 
interoperability problems such as in XML and HTML or with word-breaking 
algorithms.

As the intent is to create a spacing diacritic, not using a 
joining/ligaturing control before and after it would not create the desired 
effect, as the acute above would be shown on a blank space between 'a' and 
'u', as wide as the acute accent itself.

The PR-41 proposal document suggests that the typical use of the Invisible 
Letter would be to display a isolated spacing diacritic between two spaces 
(or punctuations), a case where the XML/HTML treatment of whitespace 
sequences is to collapse them before rendering or interpreting them.

Your request is quite similar to the case of double diacritics already 
encoded in Unicode, except that double diacritics are displayed on the whole 
display width above the two letters, when your usage would just be to put a 
standard width diacritic centered on the kerning space between them. For an 
acute accent, it's unlikely that doubling its width would be very readable, 
where it could be confused with a macron. May be your centered diacritic 
should be encoded like the other double diacritics.

Re: Questions about diacritics

2004-09-13 Thread Philippe Verdy

From: Peter Kirk [EMAIL PROTECTED]
Surely the intention is for INVISIBLE LETTER, combining acute to be 
equivalent (although it cannot be canonically equivalent) to spacing 
acute, U+00B4? But then would this kind of ligature mechanism with ZWNJ 
and U+00B4 be appropriate? I would think not.
INVISIBLE LETTER,combining acute will not be canonically equivalent 
effectively, depite it should produce and behave like the spacing acute.

As ZWJ is intended to indicate that there's effectively a ligature 
opportunity between two grapheme clusters, I don't see why one would not 
support a,ZWJ,SPACING ACUTE to kern the spacing acute on the right side of 
a. It won't create an accent *centered* above the letter, but it now allows 
the accent to move within the spacing area of the preceding letter.

I accept the fact that this is just a ligature opportunity for renderers, 
with no different semantics than in absence of the joiner. But I wonder if 
the digraph with the centered accent above is not simply that: the accent is 
a notation that does not change the semantics of the surrounding two vowels, 
with no orthographic consideration.

In that case, this is really a rendering feature, and using ZWJ could be 
appropriate here, notably because IL,combining acute will remain 
canonically distinct from U+00B4, which also has the wrong character 
properties (not a letter, this is a symbol and a word-breaker by itself...). 
Most uses of isolated diacritics however are mainly symbolic rather than 
orthographic. The IL however changes this, and becomes appropriate within 
the middle of words.

Re: Questions about diacritics

2004-09-13 Thread Philippe Verdy

From: Doug Ewell [EMAIL PROTECTED]
Philippe Verdy verdy underscore p at wanadoo dot fr wrote:
I also found an acute on diphtongs, placed on the boundary of both
letters (au, ei, eu, oe, and ui).
Wouldn't such diacritic be hold by the currently proposed invisible
base character (in the Public Review section of the Unicode website),
by encoding for example:
a,INVISIBLE LETTER,combining acute,u
If you think there's a grapheme cluster here, I suggest using ZWJ to
attach the three default grapheme clusters:
a,ZWJ,INVISIBLE LETTER,combining acute,ZWJ,u
to create a kerning ligature between the two vowels.
I thought one of the unstated, beneficial side effects of INVISIBLE
LETTER was that it might reduce the need for non-intuitive ZWJ and ZWNJ
sequences.  I may be wrong, though; I haven't followed the INVISIBLE
LETTER debate very closely.
In the (short) PR-41 document, the intent is really to substitute the SPACE 
character by another one to serve as a base character for isolated 
diacritics. (SPACE is known to cause problems in HTML/XML due to whitespace 
compression and in text parsers such as word-breakers). It won't deprecate 
the existing spacing diacritics block, but will avoid adding new spacing 
variants for the existing or future diacritics that may need them.

The current semantics of SPACE and its reuse to serve as a base for 
diacritics requires changing the character properties of SPACE when a 
diacritic follows it, and this is really a bad exception to the general 
framework where a combining sequence should inherit almost all its 
properties from its base character. I don't see the proposal as a way to 
avoid any use of joiners/non-joiners in its current form.

Re: Questions about diacritics

2004-09-14 Thread Philippe Verdy

Good point, but is the ZWNJ control supposed to be used as a base character 
with a defined height? I thought it was just a control for indicating where 
ligatures are preferably to avoid when rendering, leaving it fully ignorable 
if the renderer has no other option than rendering the ligature. For this 
application, the following character was a base character.
Other uses of ZWNJ before diacritics are in Indic scripts, or in the Hebrew 
proposals (in Public Review for Meteg), to control the meaning of the 
following character.

So I do think that the LateX2e compound word mark should map to 
ZWNJ,INVISIBLE LETTER rather than just ZWNJ...
The (-)burg abbreviation as (-)bg (with a non-spacing but non-combining 
breve) should then be encoded with the invisible letter, in combination with 
ZWNJ to make it non-spacing.)

- Original Message - 
From: Jrg Knappen [EMAIL PROTECTED]
To: Philippe Verdy [EMAIL PROTECTED]
Cc: Doug Ewell [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Tuesday, September 14, 2004 6:06 PM
Subject: Re: Questions about diacritics


In LaTeX2e with the Cork coding (for TeXnicians: \usepackage[T1]{fontenc})
there is a so-called compound word mark. It has the functions of
teh ZERO WIDTH NON JOINER in the UCS: It breaks ligatures, it can be used
to produce a final s in the middle of a word.
By design, it has zero width but x height. So it can be used to carry
accents to be placed in the middle between two characters.
My classic for this situation is the german -burg abbreviature often seen
in cartography: It is -bg. with breve between b and g. The abbreviature
-bg. without accent means -berg.
--Jorg Knappen

Re: Questions about diacritics

2004-09-14 Thread Philippe Verdy

Since INVISIBLE LETTER is spacing, wouldn't it make more sense to define
Isn't rather INVISIBLE LETTER *non-spacing* (zero-width minimum), even 
though it is *not combining* ?
I mean here that its width would be zero unless a visible diacritic expands 
it. It is then distinct from other whitespaces which have a non-zero minimum 
width, but still expand too with a diacritic above them (width expansion is 
normally part of the job for the renderer or positioning/ligating tables of 
characters in fonts).

I would expect that an INVISIBLE LETTER not followed by any diacritic will 
*really* be invisible, and will not alter the positioning of subsequent base 
characters (and would not even prevent their kerning into the previous base 
letter such as in CAPITAL LETTER V, INVISIBLE LETTER, CAPITAL LETTER A, 
where A can still kern onto the baseline below V.

Historic scripts for Albanian: Elsaban and Beitha Kukju

2004-09-16 Thread Philippe Verdy

This page:
http://www.omniglot.com/writing/albanian.htm
shows two historic scripts that have been used to write Albanian (Shqip):
- the Elsaban script in the 18th century, which looks like Old Greek for the 
language Tosk variant. However there are lots of unique letter forms, and 
mapping to Old Greek is not straightforward.
- the Beitha Kukju script invented in 1840 and named after its inventor. 
This second one looks very like a modified version of the Latin script (the 
scans reproduce handwriting), but with major changes in the letterforms and 
some unique letters for 'j, d-with-stroke, th, kj, ng, ks, tsj, and ts. It 
is quite hard to read for Latin readers, and some forms may cause confusion 
for Latin readers (notably the letters for e, d, d-with-stroke, h, y and ü; 
so I think it's a distinct script rather than a variant of the Latin script.

Are these alphabets represented in Unicode?
The page also gives the modern Latin alphabet (including Latin digraphs), 
based on Western European Latin letters.

Re: Questions about diacritics

2004-09-17 Thread Philippe Verdy

From: Doug Ewell [EMAIL PROTECTED]
In the case of INVISIBLE LETTER, it seems likely -- based on the
comments of experts -- that the benefits outweigh the disadvantages.
But new control characters (and quasi-controls like IL) have tended to
cause more problems and confusion for Unicode in the past than new
graphically visible characters.  The possibility of misuse has to be
evaluated, and the rules do have to be stated clearly.  Combinations
involving IL plus SPACING ACCENT, or IL plus ZW(N)J, or whatever, should
be part of the rules; what effect should such combinations have, and are
they discouraged?  For IL, that is probably good enough.
The most important misuse of IL could be avoided by saying in the standard 
that a renderer should make this character visible if it is not followed by 
a combining character that it expects. This would avoid possible spoofing by 
including it within some critical texts such as people and company names in 
signatures. A candidate rendering would be the dotted circle and square as 
seen in the proposal, or a dotted square with IL letters inside. This 
glyph would appear even if visible controls editing mode is not enabled.

Re: Unibook 4.0.1 available

2004-09-17 Thread Philippe Verdy

From: Doug Ewell [EMAIL PROTECTED]
Marion Gunn mgunn at egt dot ie wrote:
Is it really so hard
to make multi-platform, open-office-type utilities?
Actually, yes, it is.  Mac users don't want an application to be too
Windows-like, Windows users don't want an application to be too Mac-like
(we'll see how the latest version of Photoshop goes over), and isolating
all the differences in platform-specific modules while leaving the core
functionality in common modules is a lot of work.  If it were easy, it
would be done more often.
Isn't Java hiding most of these platform details, by providing unified 
support for platform-specific look and feel? Aren't there now many PLAF and 
themes manager available with automatic default selection of the look and 
feel of each platform?
Aren't there enough system properties in these development tools so that the 
application can simply consult these properties to autoadapt to the platform 
differences?
Some known issues were related to filesystem differences, but even on MacOS 
X, Linux or Windows, these systems have to manage multiple filesystems 
simulaneously, so a good application made only for one platform needs to 
consult filesystem properties to get naming conventions, etc... On Linux 
only, and now also on Solaris and AiX, the need to support multiple window 
managers also influences any single-platform development.
Also softwares have to adapt to various versions and localizations of the OS 
kernel and core libraries to get a wider compatible audience.
Whatever we do today, we nearly always need to separate the core modules of 
the application from its system integration layer, using various wrappers. 
Not doing that will greatly limit the compatibility of the application, and 
even customers don't know the exact details of how to setup the application 
to work in his environment.
It's certainly not easy, and there are tons of options, but writing a system 
wrapper once avoids many customer support costs later when a customer is 
furious of having paid for a product that does not work on his host. We are 
speaking here about software development, not about ad-hoc services for 
deployment on a unified platform (but even today, the cost of licences and 
upgrades makes that nearly nobody has a standard platform to deploy an 
application).

Re: Unicode Shorthand?

2004-09-18 Thread Philippe Verdy

From: Chris Jacobs [EMAIL PROTECTED]
- Original Message - 
From: Christopher Fynn [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Sunday, September 19, 2004 12:08 AM
Subject: Unicode  Shorthand?

Is there any plan to include sets of shorthand (Pitman, Gregg etc.)
symbols in Unicode? Or are they something which is specifically excluded?
Pitman and Gregg are common in English-speaking countries, but most of these 
shorthand methods work well only with a particular language and are specific 
to it.

Note that shorthand transcription is still not dead today, because of the 
natural speed of writing with it (more than 120 words/minute, instead of 
roughly 60 words/minute with stenotype or dactylography), and also because 
the quality of transcriptions from magnetic tapes or audio records is still 
highly discutable, notably when the audio environment is noisy (for 
juridical applications, it can be a big problem when one answer by a witness 
can't be understood clearly from the tape record).

One solution used today (because stenographs become rare and old) is that 
the stenotypist or dactylograph that transcript a conversation must be 
present when the tape is created.

I don't know if it is excluded. A reason to exclude it would be if it were 
a cipher of something already in.

The only set of shorthand I know something of, dutch Groote, follows the 
pronounciation of the words rather than the spelling.

Can shorthand be seen as a cipher of IPA ?
Not at all. Most shorthand do not reflect the same level of precision found 
in IPA, and the same sign represent several phonemes.

See for example the wellknown French stenographie Prévost-Delaunay method, 
with a small online presentation and initiation on 
http://perso.wanadoo.fr/lepetitstenographe/index.html
In this method, most signs have multiple meanings, and there are 
abbreviations for phonemic elements commonly found at end of words, plus 
specialized signs for common semantics or words that are specific to the 
French language.

It's not impossible to create a rendering system for such stenographic 
system, however the general layout is more complex than with traditional 
alphabets, because the layout of characters is highly dependant of the 
context of previous letters, and the system includes glyphic differences for 
initial, medial and final forms, and special joining rules that alter the 
glyph form, just to ease its fast transcription without holding up the 
drawing pen.

Re: Unicode Shorthand?

2004-09-18 Thread Philippe Verdy

From: D. Starner [EMAIL PROTECTED]
Christopher Fynn wrote:
Is there any plan to include sets of shorthand (Pitman, Gregg etc.)
symbols in Unicode? Or are they something which is specifically excluded?
They're a form of handwriting, which is generally excluded. Why do
they need to be encoded in a computer? General practice, at least,
is to transcribe them into standard writing first.
Don't forget that shorthand methods are still taught today, with methods 
published in books. Books are published today using special encodings or 
using image scans. Scanned images are often hard to create cleanly, and this 
is often a problem for the first readers of such publications, when the 
system requires carefully drawn signs, that would benefit from numeric 
composition.

There are good reasons why a shorthand-written text should be encoded as 
such, without going through transcription to the normal alphabetic system.

Re: Unicode Shorthand?

2004-09-18 Thread Philippe Verdy

From: Christopher Fynn [EMAIL PROTECTED]
Philippe Verdy wrote:

It's not impossible to create a rendering system for such stenographic 
system, however the general layout is more complex than with traditional 
alphabets, because the layout of characters is highly dependant of the 
context of previous letters, and the system includes glyphic differences 
for initial, medial and final forms, and special joining rules that alter 
the glyph form,  
Sounds a bit like Arabic...
Not really, because the actual rendering is bidimensionnal, not linear. It's 
difficult to predict the line height, as the baseline changes according to 
the context of previous characters in the word, and its writing direction 
(forward or backward).

Re: Unicode Shorthand?

2004-09-19 Thread Philippe Verdy

From: Christopher Fynn [EMAIL PROTECTED]
Philippe Verdy wrote:
Not really, because the actual rendering is bidimensionnal, not linear. 
It's difficult to predict the line height, as the baseline changes 
according to the context of previous characters in the word, and its 
writing direction (forward or backward).
Phillipe
As Werner mentioned, this is like Nastaleeq. All the things you mention 
are rendering issues - not character encoding issues - and not very 
different from things necessary to render some other complex
scripts. As long all these changes are based on contextual rules they can 
be handled with a fairly simple encoding once the essential characters 
that make up the script are determined.

regards
I do agree that this shorthand method looks very much like Arabic, but my 
answer was really about making a difference with IPA. This is clearly not a 
pure phonetic notation, it has its own orthographic conventions, as well as 
very unique rendering rules, which make systems capable of rendering Arabic 
not enough to render shorthands.

Your precision about Nastaleeq is correct, as this is the nearest script 
with which the standard French shorthand script looks like. But even with 
Nastaleeq there's a clear concept of a baseline that helps rendering it in 
an acceptable way. Rendering French shorthand on a constant baseline would 
be inacceptable: There's a baseline defined only for the begining of words, 
not for each individual character, and this left-to-right baseline is 
visible for the whole text, but only because words are very often 
abbreviated (there are also specific symbols for common abbreviations, and 
often articles or particles are not written, but if needed some functional 
suffixes are added).

My mother learned that script in the early 60's and used that throughout her 
carrier for her work as a secretary in a juridic domain. In many cases, most 
words are abbreviated by noting only the first 1 or 2 lexemes, and adding 
eventually a functional suffix. Not all words need to be noted, and she also 
used some personal abbreviated symbols for her most recurrent terms.

The script is really compact: a single A5 sheet of handdrawn shorthand was 
enough to note more than 2 A4 page of typesetted text (in Times our Courrier 
12 points). She was able to note more than 120 words per minute, to note 
conversations with several participants such as public meetings, discussions 
about juridic problems, negociations... She still used a magnetic tape, for 
the case where she would have forgotten to note some terms or if there were 
cases where she could not remember the exact meaning of some items, but she 
rarely needed to use it to transcript the noted text back to a typesetted 
form (using the shorthand notes was even more practical than using 
dictaphones when typesetting it in a word processor later, as she had an 
immediate global view of the sentences to type).

I have always been impressed by her ability of noting so many things in a so 
compact form and so fast.

Re: [OT] Decode Unicode!

2004-09-25 Thread Philippe Verdy

From: Curtis Clark [EMAIL PROTECTED]
on 2004-09-24 10:05 Peter Constable did quote:
After the DNA, the ASCII-Code is the most successful code on this
planet.
Things get more and more complex. DNA is a 2-bit code.
Not completely true. It is a bit less than 2 bits, due to its replication 
chains, and the presence of insertion points where cross-overs are possible. 
But the effective code is a bit more complex than just the ATCG system, as 
some studies have demonstrated that the DNA alone has no function out of its 
substrate, whose nature influence its decoding.

There are some extra pieces of information that are not coded directly in 
the DNA, and the DNA itself has a 3D structure which cannot be modeled 
completely with just this alphabet (try computing the position of sulfurs 
and oxidations only from this chain!).

Research on DNA solves this problem by isolating active subchains of the DNA 
whose behavior does not depend significantly on the substrate. The DNA is 
splitted by locus points where variation can occur. And not all of the DNA 
is actively coding useful information; large fragments are simply there to 
consolidate its structure, or to recover from replication damages.

In fact you can determine much more things from ARN fragments than from ADN 
itself. Simply because ARN is not only the replication of ADN, but also the 
result of its structuration in the substrate, with which it will help 
synthetize proteinic chains. Other information are also contained in the 
mediators that help transform the ARN information into proteins. Some of 
these mediators are sometimes external to the cell, or may come from 
parasitic agents (bacteries, virus), or live in synbiotic condition with the 
cell that need this pollution to live itself. Suppress those parasitic or 
synbiotic agents and the DNA alone will not allow the cell to survive...

Re: UTF-8 stress test file?

2004-10-11 Thread Philippe Verdy

From: Terje Bless [EMAIL PROTECTED]
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Theodore H. Smith [EMAIL PROTECTED] wrote:
I'd like to see a UTF-8 stress test file.
The top result on Google for the query UTF-8 Stress Test is
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt.
This test file is out of date and incorrect: it uses Unicode incorrectly, 
where it should relate to the old RFC definition of UTF-8 referenced by 
previous versions of ISO/IEC 10646: in that file, all UTF-8 sequences with 5 
bytes or more are invalid (they are not boundary cases).
So the list of impossible bytes is longer than documented there.
The more exact definition of UTF-8, shared now by Unicode and by the current 
version of ISO/IEC 10646 is documented in the conformance section of the 
Unicode standard.
Still, this file will be useful to determine if your browser or editor 
effectively shows substitutes (like ?) where it should for all invalid 
sequences. But if your browser just says that this is not a UTF-8 encoded 
file, it will be right, if it does not display it at all:
- the file mixes UTF-8 and UTF-16
- invalid sequences may raise an exception that informs the user that the 
file can't be decoded.
- a browser or text editor may as well attempt to trigger its 
charset-autodetection mechanism to try finding another charset. If the file 
is then displayed assuming ISO-8859-1 and showing each byte of UTF-8 or 
UTF-16 sequences as if they were ISO-8859-1 characters, it will not be a 
conformance problem for the browser or text editor.

Re: UTF-8 stress test file?

2004-10-12 Thread Philippe Verdy

From: Doug Ewell [EMAIL PROTECTED]
Theodore H. Smith delete at elfdata dot com wrote:
- the file mixes UTF-8 and UTF-16
Does this file mix UTF-8 and UTF-16? I thought it just had surrogates
encoded into UTF-8? Of course a surrogate should never exist in UTF-8.
You are right.  Philippe's statement was incorrect, and also puzzling.
Have you read the file content? It clearly and explicitly speaks about
UTF-16, which has nothing to do in a text file for UTF-8, unless the file
was used as a test for CESU-8 (which is not UTF-16 as well, and not even
UTF-8). My statement was correct: it is based on the fact that the test file
was created for the older (RFC version) of UTF-8 used in old versions of ISO
10646, and never referenced (at least explicitly until the v4.01
clarification) by Unicode in any version.

Re: UTF-8 stress test file?

2004-10-12 Thread Philippe Verdy

From: Clark Cox [EMAIL PROTECTED]
unless the file was used as a test for CESU-8
The whole point of the CESU-8-like section is that it is not legal UTF-8.
Except that the document does not even cite CESU-8 but only UTF-16! The 
text itself is puzzling as well as nearly all its suggestions about 
conformance levels or the way the text should be rendered, or the way a 
parser should recover after encoding violations...

Re: UTF-8 stress test file?

2004-10-12 Thread Philippe Verdy

From: Philipp Reichmuth [EMAIL PROTECTED]
Don't you think you are stretching things a bit?  This is an UTF-8 parser 
stress test file.  If an application opens it in a different encoding, 
well, of course the results will be different, and things will not look 
UTF-8-ish.  Again, this is a non-issue.  It's like distributing a Linux 
binary for testing something and then getting complaints that it doesn't 
work under DOS and that it shouldn't make assumptions on operating 
systems.
That's not the good point I wanted to focus. Things CANNOT look UTF-8-ish 
in a UTF-8 conforming editor or browser that will correctly detect all 
encoding errors in that file, and thus will never properly present the text 
properly aligned. What a conforming editor or browser *may* eventually do is 
to recover and mandatorily signal to the user the positions of errors 
(possibly by using a replacement glyph as if each error was coding a U+FFFD 
substitute), but how many errors will you signal given that the error 
recovery level is not defined in the Unicode/ISO/IEC UTF-8 standard?
Even in the old ISO/IEC10646 standard, recovery is only possible after 
errors only if uninterpretable byte sequences were still properly parsed 
into sub-sequences (of unspecified length) where a substiture could be used.

The problem is in the length of each invalid byte sequence; for example, if 
there's a 4-bytes old UTF-8 encoding sequence (or longer) the error will be 
detected at the first byte, recovery will take place at the second byte 
after the first byte as been interpreted as a invalid sequence represented 
by a substitute glyph, but then each of the immediately following trailing 
byte will signal an error.

Suppose that the parser recovers until it can find a new starter byte, it 
will still need to parse this byte to see if its a leading byte for a longer 
sequence, so the recovery is not necessarily immediately possible after the 
first invalid byte, or after the supposed end of the byte sequence. Now if 
the parser will reover by skipping all bytes until a valid sequence is 
found, there will be only 1 encoding error thrown on the leading byte, and 
only 1 substitution glyph.

We are navigating within unspecified areas where error recovery after 
decoding errors is not defined in the current UTF-8 standard itself (not 
even in the old RFC version with ISO/IEC 10646-1:2000)

And as I said, the document itself is not complete enough, because it 
forgets other invalid sequences for non-characters.

Re: internationalization assumption

2004-09-30 Thread Philippe Verdy

From: Antoine Leca [EMAIL PROTECTED]
On Tuesday, September 28th, 2004 03:22 Tom wrote:
Let's say.  The test engineer ensures the functionality and validates
the input and output on major Latin 1 languages, such as German,
French, Spanish, Italian,
Just a side point: French cannot be fully addressed with Latin 1.
True, due to the missing (but rare) oe or OE ligature (which is present in
the newer Latin 9, as well as in the Windows ANSI codepage 1252 for western
European languages).
Anyway, no French users actually complain of this omission: either they use
ISO-8859-1 and the ligatures will simply be replaced by separate vowels
(which is still correct for French collation, even though the strict French
orthograph requires using a ligature when *rendering*; in addition, French
keyboards typically never include a key to enter these ligatures, which are
only entered with assisted word processors with on-the-fly
autocorrection), or they will use the Windows 1252 codepage without seeing
that these characters were added to Latin 1 by Microsoft in its Windows
codepage.
A few common sample words that use these ligatures are oeil (english:
eye), oeuf (english: egg) and boeuf (english: beef), and coeur
(english: heart). (Note that this message does not use the mandatory
ligature).
There are some other words, but they are really uncommon in French
conversations (most of them are in the medical and botanic vocabulary).
This ligature cannot be automated so simply in renderers, because there are
exceptions: see coexister where the two vowels are clearly voiced
separately and must never be ligated. But one way to determine if oe must
be ligated in French is when it is followed by another vowel (normally an
'i' or 'u'), and if the e has no accent.
The ae ligature is used in French, but not in the common language (I think
it is used only in some technical juridic or religious terms, inherited from
Latin, or in some medical and botanic jargon): I can't even remember of one
French word that uses it; that's why there were some fonts designed for
French where the oe and OE ligatures replaced the ae and AE
ligatures.
(Note that I say ligature and not vowel, because it is their actual
usage in French, that also matches its collation rules).
With those considerations, would a software that only supports the 
ISO-8859-1 character set be considered not ready for French usage? I think 
not, and even today most French texts are coded with this limited subset, 
without worrying about the absence of a rare ligature, whose absence is 
easily infered by readers.

Re: internationalization assumption

2004-09-30 Thread Philippe Verdy

About the French ligatures 'oe' (and 'ae'), I should have noted this 
excellent summary page (in French) on its usage and history:
http://fr.wikipedia.org/wiki/Ligature_(typographie)
Note that Latin- or Greek-inherited words use the ligature when the vowels 
are not to be pronounced separately, but with the etymological 'o' not 
vocalized. So it remains only the final 'e' vowel, sometimes pronounced like 
'é', or more recently and very commonly like the digraph 'eu'.

The French page on Wikipedia is more complete than the corresponding English 
page; but the German page contains interesting information about ligatures 
in German or other central european languages.

Re: internationalisation assumption

2004-10-01 Thread Philippe Verdy

I use my own keyboard with the standard AZERTY French layout with some 
extensions. I would not use the QWERTY-based Swedish layout.
Can be downloaded for Windows on
http://www.rodage.org/pub/French-Sahel.html
(built with MSKLC, available for free under LGPL).
Its layout is shown on http://www.rodage.org/pub/French-Sahel.pdf

- Original Message - 
From: Stefan Persson [EMAIL PROTECTED]
To: Unicode Mailing List [EMAIL PROTECTED]
Sent: Thursday, September 30, 2004 5:05 PM
Subject: Re: internationalisation assumption

Philippe Verdy wrote:
in addition, French
keyboards typically never include a key to enter these ligatures, which 
are
only entered with assisted word processors with on-the-fly
autocorrection
In that case, I'd recommend French people to use my Swedish Linux keyboard 
which *does* contain the  ligature.  For Swedish people, the  ligature 
fills a much smaller usage; it's only used by Swedes who need to write 
documents in French, or use English transcriptions of ancient Greek names 
such as dipus.

Stefan

Re: Grapheme clusters

2004-10-06 Thread Philippe Verdy

From: Chris Harvey [EMAIL PROTECTED]
The users seem determined to put the entire alphabet into the PUA, thus 
making a single character for ng, kw, ii etc. I would like to be 
able to present them with something that works and avoid this kind of 
catastrophe.
A better alternative to PUAs, which would require specific fonts and no 
interopable solution would be to use controls that make explicit grapheme 
clusters: ZWJ notably, and make sure that the editor handles it effectively 
as a single cluster, including for backspace.

Or, may be using existing combining modifier letters, even if they look like 
superscript in existing fonts (if you are ready to go to PUAs, you would 
need to develop a font for them), but as we don't know the whole extents of 
the alphabet, it's hard to determine which solution is best.

I am assuming (I'm possibly wrong) that you'll need it to support some 
African languages, and if so, there are existing proposals to increase their 
support in Unicode with pending new Latin letters. Using PUAs could be an 
interim solution, before new characters are introduced, notably if you need 
combining modifier letters to act with the base letter as a single cluster.

If you need that to support the Latin transliteration of Native North 
American languages that you support on your web site, as a convenient tool 
allowing a reverse transliteration to the native script (which has 
constraints on its syllabic structure), and a convenient way to fix the 
Latin orthography in order to create richer contents transliterated 
appropriately and automatically into the native script, may be you need 
really a specific editor that can check and enforce the Latin orthography.

For example you cite the case of Pacific coast schwas, raised consonants and 
ejectives (like  kw q), or Hawayian long vowels (with macrons, rarely 
supported in fonts) which are difficult to enter with existing keyboards and 
fonts. Using a more basic ASCII-based orthography seems like an input method 
for such languages, and an intermediate before the production of actual 
existing Unicode characters using the proper combining or modifier letters 
(in that case, Unicode itself is not the issue, and you may wonder how to 
create an input method editor which can show a simplified ASCII-only 
transliteration which can reliably be converted to the more exact 
orthography.

Re: internationalization assumption

2004-10-07 Thread Philippe Verdy

RE: internationalization assumptionWell the main issue for 
internationalization of software is not the character sets with which it was 
tested. It is in fact trivial today to make an application compliant with 
Unicode text encoding.

What is more complicate is to make sure that the text will be properly 
displayed. The main issues that cause most of the problems come in the 
following area:

- dialogs and GUI interfaces need to be resized according to text lengths
- a GUI may have been built with a limited set of fonts, all of them with 
the same line height for the same point size; if you have to display Thai 
characters, you'll need a larger line height for the same point size.

- some scripts are not readable at small point sizes, notably Han sinograms 
or Arabic

- the GUI layout should be preferably reversed for RTL languages.
- you need to be aware BiDi algorithm and you'll have to manage the case of 
mixed directions each time you have to include portions of texts from a 
general LTR script within a RTL interface (for Hebrew or Arabic notably): 
ignoring that, your application will not insert the appropriate BiDi 
controls that are needed to properly order the rendered text, notably for 
mirrored characters such as parentheses. For some variable inclusions in a 
RTL resource string, you may need to insert some surrounding RLE/PDF pair so 
that the embedded Latin items will display correctly.

- The GUI controls such as input boxes need should be properly aligned so 
that input will be performed from the correct side.

- Tabular data may have to be presented with distinct alignments, notably if 
items are truncated in narrow but extensible columns (traditionally, tabular 
text items are aligned on the left and truncated on the right, but for 
Hebrew or Arabic, they should be aligned and truncated in the opposite 
direction)

- You have to be aware of the variation of scripts that may be used even in 
a pure RTL interface: a user may need to enter sections of texts in another 
script, most often Latin. You have to wonder how these foreign text items 
will be handled.

- In editable parts of the GUI, mouse selection will be more complex than 
what you think, notably with mixed RTL/LTR scripts.

- You can't assume that all text will be readable with a fixed-width font. 
Some scripts require using variable-width letters.

- You have to worry about grapheme clusters, notably in Hebrew, Arabic, and 
nearly all Indian scripts. This is more complex than what you think for 
Latin, Gree, Cyrillic, Han, Hiragana or Katakana texts. Even with the Latin 
script, you can't assume that all grapheme clusters will be made of only 1 
character. For various reasons, common texts will be entered using combining 
characters, without the possibility to make precomposed clusters (this is 
specially true for modern Vietnamese that uses multiple diacritics on the 
same letter).

- Text handling routines, that change the presentation of text (such as 
capitalisation) will not work properly or will not be reversible: even in 
the Latin script, there are some characters which are available with only 1 
case. Titlecasing is another issue. Such automated presentation effects 
should be avoided, unless you are aware of the problem.

- Plain-text searches often need to support indifferent case. This issue is 
closely related to collation order, which is sensitive to local linguistic 
conventions, and not only to the used script. For example, plain-text search 
in Hebrew will often need to support searches with or without vowel marks, 
which are combining characters, simply because they are optional in the 
language. When this is used to search and match identifiers such as 
usernames or filenames, various options will be exposed to you. In addition, 
there are lots of legacy text that are not coded with the most accurate 
Unicode character, simply because they are entered with more restricted 
input methods or keyboards, or were coded with more restricted legacy 
charsets (the 'oe' ligature in French is typical: it is absent from 
ISO-8859-1 and from standard French keyboards, although it is a mandatory 
character for the language; however it is present in Windows codepage 1252, 
and may be present in texts coded with it, because itwill be entered through 
assisted editors or word processors that can perform autocorrection of 
ligatures on the fly)

- GUI keyboard accelerators may not be workable with some scripts: you can't 
assume that the displayed menu items will contain a matching ASCII letter, 
so you'll need some way to allow keyboard navigation of the interface. This 
issue is related to accessibility guidelines: you need to offer a way for 
users to see which keyboard accelerators they can use to navigate easily in 
your interface. Don't assume that accelerators for one language will be used 
as easily for another language.

- toolbar buttons should avoid graphic icons with text elements, unless 
these items are also

Polytonic Greek pneuma letters (spirits) and half-eta glyphs

2004-10-07 Thread Philippe Verdy

This page on the French version of wikipedia notes that Polytonic Greek used 
in the 3rd century B.C. alternate letters to denote the initial spirits 
(pneuma dasú for the hard spirit, and pneuma psílon for the soft 
spirit), rather than the modern 9-shaped combining accents.

http://fr.wikipedia.org/wiki/Diacritiques_de_l%27alphabet_grec
(Note: to see all letters in Internet Explorer, you have to configure it to 
use the Arial Unicode MS font from Office or the free Code2000 font, and 
to indicate to Internet Explorer, in the Accessibility options, to ignore 
the fonts styles selected on web pages: the default font selected in the 
Wikipedia CSS stylesheet for Internet Explorer forces the Arial font which 
does not contain glyphs for all these characters; apparently Wikipedia has 
problems to find a reliable way to configure their stylesheets to work with 
various versions of Windows or IE).

These letters were noted initially by Aristophane with a variant of the 
historic H letter that noted the /h/ sound (but was later borrowed when it 
became unused to note the sound /è/ with eta), by cutting the H (eta) 
glyph in two half-parts (and sometimes found with L-shaped glyphs without 
the lower part of the vertical). These historic phonemes subsist today only 
as diacritics for modern polytonic greek, but this is not the case of 
historic texts where they may still be pronounced /h/ on initial vowels or 
diphtongues or rho.

The same page gives an encoding for the latest non-combining form where 
these spirits are represented by upper tacks (before they became 
diacritics). My question is: can these historic half-eta letters be unified 
with these tacks, or are they distinct letters?

Are there variants encoded for these historic half-eta letters, to mean that 
they should not be shown with the upper tack glyphs but with the historic 
half-eta glyphs?

Re: text-transform

2004-10-23 Thread Philippe Verdy

From: fantasai [EMAIL PROTECTED]
Comments on CSS (but not how-to questions) should be directed to
the www-style mailing list at w3.org, not unicode:
  http://lists.w3.org/Archives/Public/www-style/
OK for the numeric versus capitalize|uppercase|lowercase remark, which 
is related to form validation and has probably nothing to do within the 
Unicode list. But the general discussion of the behavior of BiDi with 
vertical scripts, or horizontal scripts rendered vertically (or even in 
Boustrophedon) is still something that the Bidi algorithm in Unicode is not 
solving completely. There are bidirectional properties that are inherent to 
scripts and their characters, and that are in the direct focus of Unicode 
standardization.

Although this was discussed in relation with CSS3, this is still a big issue 
of Unicode because it's not a problem specific to CSS as it directly affects 
any rendering of plain text.

The CSS3 article was very interesting to read because it really speaks about 
problems that exist today with scripts already in Unicode, and for which the 
BiDi properties do not seem enough to effectively write a generic renderer 
for all of them (including the interation of Latin/Greek/Cyrillic with 
Han/Hiragana/Katakana, or the special interaction of Hiragana/Katakana in 
Han text.)

I bet that if the proposed CSS3 model works, it will demonstrate which 
properties need to be added by Unicode in its standard, for use in other 
non-CSS based applications. May be this will require now BiDi controls and a 
more complex algorithm to handle them. For now, the only safe way to do 
that is to base the augmented properties on the script property of 
characters (but still with ambiguity problems for general purpose characters 
like punctuation and spacing).

Re: basic-hebrew RtL-space ?

2004-11-01 Thread Philippe Verdy

From: kefas [EMAIL PROTECTED]
Inserting unicode/basic-hebrew reults in a convinient
RtL, right-to-left, advance of the cursor, but the
space-character jumps to the far right.  Is there a
RtL-space?
In MS-Word and OpenOffice I can only change whole
paragraphs to RtL-entry.  But quoting just  a few
words in hebrew WITHIN a paragraph would be helpful to
many.
And this is what the embedding controls are made for:
- surround an RTL subtext (Hebrew, Arabic...) within LTR paragraphs 
(Latin...) with a RLE/PDF pair.
- souround an LTR subtext (Latin, ...) within RTL paragraphs (Hebrew, ...) 
with a LRE/PDF pair.

There's no need of a separate RTL space, given that the regular ASCII SPACE 
(U+0020) character is used within all RTL texts as the standard default word 
separator, and it inherits it has a weak directionality, that does not force 
a direction break, but that his inherited from the surrounding text.

A good question however is whever the space should inherit its direction 
from the previous ctext or the next one.
- If the previous text has a strong directionality, then the space should 
inherit its direction. This should be the case everytime you are entering 
text with a space at end: it's very disturbing to see this new space shift 
on the opposite side, when entering some space-sparated hebrew words within 
a Latin text, because the editor assumes that no more Hebrew will be added 
on the same line (this causes surprizing editing errors, for example when 
creating a translation resource file where translated resources are prefixed 
by an ASCII key, for example when editing a .po file for GNU programs using 
gettext()).
- If the previous text in the same paragraph has no directionality, then it 
inherits its direction from the text after it (if it has a strong 
directionality);
- if this does not work then a global context for the whole text should be 
used, or alternatively the directionality of the end of the previous 
paragraph (this influences where the cursor would go to align such 
weakly-directed paragraph with the previous paragraph, including the default 
start margin position.)

The regular Bidi algorithm should be used to render a complete text, but 
strict Bidi rules should not be obeyed everytime when composing a text, 
where the current cursor position should act as a sentence break with a 
strong inherited directionality: the text can then be redirected at this 
position when the cursor moves to other parts of the text.

I don't think this is an issue of renderers but of editors (notably in 
Notepad, where you won't know exactly where to enter a space during edition, 
unless you use the contextual menu that allows switching the global default 
directionality, and swap the alignment to the side margins; sometimes, when 
you want to know where there are REL/RLE and PDF Bidi controls, it's nearly 
impossible to determine it vizually in Notepad, unless you use an external 
tool such as native2ascii, from the Java SDK, to change the encoding with 
clearly visible marks). It's unfortunate, given that Notepad (since Windows 
XP) offers you a directly accessible contextual menu to enter Bidi controls 
and change the global direction and alignment to side margins. (But notepad 
has a visible controls editing mode, to solve such ambiguities.)

Related: The other Hebrew characters in the alphabetic
presentation forms insert themselves in LtR-fashion?
Why this difference?
I read about Logical and Visual entry, but don't see
how that answers my 2 questions above.
Visual entry should never be used. It was used for some legacy encodings to 
render text on devices that don't implement the Bidi algorithm and can only 
render text as LTR. Nobody enters RTL text in pseudo-visual LTR order; 
only the logical input order is needed.

But don't mix the input order and the encoding order as they can be 
different (it should not if the text is converted and stored in Unicode, 
where only the logical order is legal for any mix of Latin, Greek, Cyrillic, 
and Hebrew, Arabic).

The case for Thai is different because its input order is (historically) 
visual rather than logical, and then the text is encoded using the same 
(visual) order. This is not changed with Thai in Unicode, to keep its 
compatibility with the national Thai standard TIS-620 (and further 
revizions). So even though Thai uses an non-logical order, its input order 
and encoding order is the same.

The difference of encoding orders is known mainly for historic texts created 
for modern Hebrew, and more rarely Arabic, or for texts encoded in a private 
pre-press encoding used to prepare the global layout of pages (these texts 
are more easily and fast processed in complex page layouts if they are 
prepared in visual order before flowing them in the page layout template; 
such applications use specific encodings in a richer rendering context than 
just plain text, so this is out of scope of the Unicode standard itself).

Re: Opinions on this Java URL?

2004-11-13 Thread Philippe Verdy

From: A. Vine [EMAIL PROTECTED]
I'm just curious about the \0 thing. What problems would having a \0 in 
UTF-8 present, that are not presented by having \0 in ASCII? I can't see 
any advantage there.
Beats me, I wasn't there.  None of the Java folks I know were there 
either.
The problem is in the way strings that get passed to JNI via the legacy 
*UTF() APIs are accessed: there's no indicator of the string length, so it 
would be impossible to know if the \0 terminates the string if if is allowed 
in the content of the string data.
The C080 encoding is a way to escape this character, so that it can be 
passed to JNI using the legacy *UTF() APIs that exist since Java 1.0.
This encoding is also part of the Java class file format, where string 
constants are also encoded this way. Note that the Java String object allows 
storing ANY UTF-16 code unit, including invalid ones (0xFFFE and 0x), as 
well as isolated or unpaired surrogates. So Java internally does not use 
UTF-16 strictly. Using a plain UTF-8 representation would have prevented the 
class format to support such string instances, which are invalid for 
Unicode, but not in Java. Using CESU-8 would not work either.

There are legacy Java applications that use the String object to store 
unrestricted arrays of unsigned 16-bit integers (Java native type char), 
without any association with the fact that it may represent valid 
characters, and it has the advantage that such representation allows fast 
loading of classes containing large constant pools (these classes won't 
perform a long class initialization code, like the one performed when 
initilizing an array of integer type, but will directly use the String 
constant pool which is decoded and loaded into chars directly by native CPU 
code in the JVM rather than with interpreted bytecode which will never be 
compiled; this may seem a bad programming practice, but the Java language 
specs allows this, and Sun will not remove such possibility without breaking 
compatibility with those programs).

This modified UTF should then be regarded as a specific encoding scheme 
that supports the unrestricted encoding form used Java String instances 
(extended UTF-16, more exactly UCS-2) which, by initial design, can 
represent and store *more* than just valid Unicode strings.

The newer JNI interface allows reading/returning String instance data 
directly in UCS-2 encoding form, without using the specific modified UTF 
encoding scheme: there's a API parameter field to pass the actual string 
length, so the interface is binary safe. Applications can then use it to 
pass any valid Unicode string, or even invalid ones (with invalid code units 
or unpaired surrogates) if they wish. There's no requirement that this data 
represent only true characters. Note that even Windows uses an unrestricted 
UCS-2 representation in its Unicode-enabled Win32 APIs.

The newer UCS-2 interface is enough for JNI extensions to generate true 
UTF-8 if they wish. I don't see the interest of adding an additional support 
for true UTF-8 in JNI, given that this support is trivial to implement using 
either the null-terminated *UTF() JNI APIs or the UCS-2-based JNI APIs... In 
addition, this support is not really needed for performance (the UCS-2 
interface is the fastest one for JNI, as it avoids the JNI extension to 
allocate internal work buffers to work with native OS APIs that can also use 
UCS-2 directly without using extra code-converters).

Re: U+0000 in C strings (was: Re: Opinions on this Java URL?)

2004-11-15 Thread Philippe Verdy

- Original Message - 
From: John Cowan [EMAIL PROTECTED]
To: Doug Ewell [EMAIL PROTECTED]
Cc: Unicode Mailing List [EMAIL PROTECTED]; Philippe Verdy 
[EMAIL PROTECTED]; Peter Kirk [EMAIL PROTECTED]
Sent: Monday, November 15, 2004 7:05 AM
Subject: Re: U+ in C strings (was: Re: Opinions on this Java URL?)

Doug Ewell scripsit:
As soon as you can think of one, let me know.  I can think of plenty of
*binary* protocols that require zero bytes, but no *text* protocols.
Most languages other than C define a string as a sequence of characters
rather than a sequence of non-null characters.  The repertoire of 
characters
than can exist in strings usually has a lower bound, but its full 
magnitude
is implementation-specific.  In Java, exceptionally, the repertoire is
defined by the standard rather than the implementation, and it includes
U+.  In any case, I can think of no language other than C which does
not support strings containing U+ in most implementations.
This is exactly the inclusion of U+ as a valid character in Java strings 
that requires that this character be preserved in the JNI interface and in 
String serializations.

Some are thinking here that this is a broken behavior, but there's no other 
wimple way to represent this character when passing a Java String instance 
to and from a JNI interface, or though serialization such as in class files.

My opinion is that the Java behavior does not define a new encoding, it is 
rather a transfer encoding syntax (TES), so that it can effectively 
serialize String instances (which are UCS-2 encoded using the 16-bit char 
Java datatype, and not only the UTF-16 restriction of UCS-2 which also 
requires paired surrogates, but does not make the '\u' and '\uFFFE' char 
or code unit illegal as they are simply mapped to U+ and U+FFFE code 
points, even if these code points are permanently assigned as non-characters 
in Unicode and ISO/IEC 10646).

The internal working storage of Java Strings is not a character set (CCS or 
CES), and these strings are not necessarily bound to Unicode (even if Java 
provides lots of Unicode-based character properties, and character sets 
conversion libraries), as they can store as well other charsets, using other 
charset encoding/decoding libraries than those found in java.io.* and 
java.text.* packages. Once you admit that, Java String instances are just 
arrays of code units, not arrays of code points, their interpretation as 
encoded characters being left to other layers.

Should there exist any successor to Unicode (or a preference in a Chinese 
implementation to handle String instances internally with GB18030), with 
different mappings from code units to code points and characters, the 
working model of Java String instances and char datatype would not be 
affected. This would still be conforming to Java specifications, if the 
standard java.text.* and java.io.* or java.nio.* packages that perform the 
various mappings between code units and code points, characters and byte 
streams are not modified: new alternate packages could be used, without 
changing the String object and the unsigned 16-bit integer char datatype.

In Java 1.5, Sun chose to support supplementary characters without changing 
the char and String representations, but the Character object was extended 
to support the static representation of code points as static 32-bit int, 
and include the mapping from any Unicode code points in the 17 planes with 
char code units. The String class has then been extended to allow parsing 
char-encoded strings by int code points (so with the automatic support 
and detection of surrogate pairs), but the legacy interface was preserved. 
In ICU4J, the UCharacter object does not use a static representation but 
stores code points directly as int, unlike Character whose instances 
still only store a single 16-bit char, and offers only a static support 
for code points: there's still no Character(int codepoint) constructor, 
only a Character(char codeunit), because Character keeps its past 
serialization for compatibility, and Character is also bound to the 16-bit 
char datatype for object-boxing (automatic boxing only exists in Java 1.5, 
explicit boxing in previous and current versions is still supported).

If Java needs some more extension, it's to include the ICU4J UCharacter 
class that would allow storing 32-bit int codepoints, or building a 
UCharacter from a char-coded surrogates pair of code units, or from a 
Character instance; and also to add a UString class using internally 
arrays of int-coded code units, with converters between String and 
UString. Such extension would not need any change in the JVM, just new 
supported packages.

But even with all these extensions, the U+ Unicode character would 
remain valid and supported, and there would still remain the need to support 
it in JNI and in internal JVM serializations for String instances. I really 
don't like the idea of some people

Re: Opinions on this Java URL?

2004-11-15 Thread Philippe Verdy

From: Christopher Fynn [EMAIL PROTECTED]
Isn't it already deprecated?  The URL that started this thread
http://java.sun.com/j2se/1.5.0/docs/api/java/io/DataInput.html
is marked as part of the Deprecated API
Deprecated does not mean that it is not used. This interface remains 
accessible when working with internal class file format. I don't understand 
however why the storage format of the string constants pool was not changed 
when the class format was updated in Java 1.5

(Classes compiled for Java 1.5 won't run on previous versions of Java, due 
to the addition of new class interface elements like annotations, and 
generics, however classes that don't use these new features can still be 
compiled in Java 1.5 for compatibility with Java 1.4 and lower, and they 
will still run in Java 1.5; this means that Java 1.5 still needs to 
recognize the legacy class format that uses the modified UTF serialization 
of the String constants pool; as Java 1.4.1 also introduced the support for 
supplementary characters, it may have been useful that Sun changed at the 
same time its modified UTF encoding in classes to encode supplementary 
characters as 4 bytes if possible, when they are represented in the String 
instance as a valid surrogate pair, instead of 6 bytes today with the 
separate encoding of surrogates, to optimize the size of the String 
constants pool containing them; I don't know if this has been done in the 
new compact distribution format that replaces the legacy Zipped JAR format).

Re: Eudora 6.2 has been released

2004-11-19 Thread Philippe Verdy

From: Peter Kirk [EMAIL PROTECTED]
On the contrary, it is your mobile sync software which is of no use if 
communication with the outside world is required, if it doesn't support 
standards-conformant mail clients like Thunderbird, but only communicates 
in non-standardised ways with the products of a single company.
Note that some PDAs come bundled with a synchronization software for PC that 
supports only Outlook (not even Outlook Express), and with a CDROM and 
licence to install Outlook (Toshiba PDAs for example, running on Windows 
CE).

The synchronization uses Microsoft ActiveSync, which uses its native support 
for Outlook local folders only (does not work if you have something else 
than a standard POP3 account configured or a private Exchance account, 
because it won't synchronize other types of Outlook folders like HTTP on MSN 
or Hotmail... despite it's also a Microsoft product).

And there's nothing for Mac users.
Well you're free to buy other PDAs, or to buy and install other 
synchronization software for your PDA. Synchronization still lacks good 
standards, like instant messaging and chat, or management of personnal 
calendars or contact lists...

May be there's something in Sun's Open-Office that can connect you to 
Windows CE PDAs or other types of PDAs?

Re: Unicode HTML, download

2004-11-20 Thread Philippe Verdy

From: Edward H. Trager [EMAIL PROTECTED]
Hi, Elaine,
There is of course no limit to how many writing systems
one can have on a Unicode-encoded HTML page.
My recommendations would be to:
(3) Use Cascading Style Sheet (CSS) classes to control display of fonts
...
   A better CSS class would additionally specify the font-family,
   for example, something like the SIL Ezra font
(http://scripts.sil.org/cms/scripts/page.php?site_id=nrsiid=EzraSIL_Home)
(4) Since your readers may not have certain fonts, In the case of legally
downloadable fonts like SIL Ezra, I would definitely put a link to the
download site so readers can download the (Hebrew) fonts if they need 
it to view
your page.
Probably a bad advice here: Elaine speaks about a technical glossary, which 
would probably be written in modern Hebrew, for which there's not much 
complication with traditional accents.

So any suitable font for modern Hebrew (on Windows XP, the default fonts 
provided are suitable: Arial, Tahoma, Times New Roman, David, David 
Transparent, Myriam, Myriam Transparent. With Office installed: Arial 
Unicode MS) could be prefered by users, and configured in their browser.
Why forcing them to use SIL Ezra in the CSS stylesheet?

At least you should say to Elaine to use a font family with multiple font 
names, in order of preference, separated by commas, and surrounded by quotes 
if font names are not single identifiers:


!DOCTYPE ...
htmlhead
meta http-equiv=Content-Type content=text/html; charset=UTF-8 /
titlesome title/title
style type=text/css!--
.he {
   font-family: SIL Ezra, Arial Unicode MS, David, Myriam, Tahoma, 
Arial, sans-serif;
   direction: rtl;
}
.r {
 text-align: right;
 right-margin: 2em;
}
//--/style

/headbody
p class=he r(some hebrew text goes here)/p
/body/html

(Note that, like in the above example, you can specify multiple class names 
separated by spaces in the class= attribute, so it's possible to create 
style rules for localized font families, that can be reused independantly 
with other style classes. This may be useful notably if the document 
displays multiple languages in a tabular format where many attributes in a 
column should be set nearly identically for the other column, differing only 
by the font families to use for each language/script.)

Re: Unicode HTML, download

2004-11-20 Thread Philippe Verdy

From: E. Keown [EMAIL PROTECTED]
Great idea!  I code in the seldom-seen AHTML ('Archaic
HTML'), as you all suspected.
A friend tested a page I wrote last month and found it
wouldn't work on any of his 5 browsersoh well.
Well, Elaine, if you want maximum compatibility, you should better use XHTML 
which adds more restrictions than it adds features. It's the old HTML which 
causes most troubles across distinct browsers, due to its ambiguities, or 
differences of implementations (frames, table formats with non-zero cell 
spacing and cell padding, backgrounds, column widths in percentages 
specified by HTML
   width=x%
attributes instead of by CSS
   style=width: x%
attributes or stylesheet rules...).

So:
(1) enforce the XML rules: close all tags (notably p.../p or 
li.../li paragraphs, and br / or img ... / and meta ... / empty 
elements), make them all properly nested.

(2) use only the standard subset of HTML elements and attributes. And make 
sure you don't include HTML block elements within HTML inline elements (for 
example font elements surrounding p paragraphs...)

(3) use simple CSS stylesheets, with only one rule per element or class. And 
don't overuse some advanced CSS2 or CSS3 style features. Keep some tolerance 
for table column widths (make sure that font sizes can be reduced or 
increased for accessibility).

(4) test your pages in IE-6, FireFox-1.0 (excellent!), Netscape-4 (old...), 
and on Mac Safari if you can: it should be enough to work well with most 
other browsers (Netscape 6+ should behave mostly like IE-6 and FireFox on 
Windows, as long as you don't need JavaScript).

Re: [even more increasingly OT-- into Sunday morning] Re: Unicode HTML, download

2004-11-21 Thread Philippe Verdy

From: Christopher Fynn [EMAIL PROTECTED]
I'd also like to figure out a way to trigger this kind of behavior  in 
other browsers as well as in IE (using Java Script or Java rather than VB) 
as not quite everyone uses IE - (but I guess you are not going to give me 
any more clues on how to do that :-) )
If only there was a portable way to determine in JavaScript that a string 
can be rendered with the existing fonts, or to enumerate the installed fonts 
and get some of their properties... we could prompt the user to install some 
fonts or change their browser settings, or we could autoadapt the CSS style 
rules, notably the list of fonts inserted in the font-family: or 
abbreviated font: CSS properties...

There are limited controls with the CSS @ keys that allow building 
virtual font names, but not enough to tune the font selections by script 
or by code point ranges. And Javascript is of little help to paliate.
Certainly there's a need to include in a refined standard DOM for styles the 
properties needed to manage prefered font stacks associated to a virtual 
font name (for example, in a way similar to what Java2D v1.5 allows), that 
can then be referenced directly within legacy HTML font name=virtualname 
or in CSS font-family: virtualname properties (some examples of virtual 
font names are standardized in HTML: serif, sans-serif, monospace; 
Java2D or AWT adds dialog and dialoginput; but other virtual names could 
be defined as well like decorated or handscript or ocr).

The key issue here is to create documents that refer to font families 
according to their usage rather than their exact appearance and the limited 
set of languages and scripts they support.

Another possibility would be to create a portable but easily tunable font 
format (XML based? so that they can be created or tuned by scripting through 
DOM?) which would be a list of references to various external but actual 
fonts or glyph collections, and parameters to allows selecting in them with 
various priorities. For now this is not implemented in font technologies 
(OpenType, Graphite, ...) but within vendor-specific renderer APIs (than 
contain some rules to create such font mappings).

Re: Unicode HTML, download

2004-11-21 Thread Philippe Verdy

From: Doug Ewell [EMAIL PROTECTED]
The best advice for Elaine's situation becomes simpler.  To maximize the
likelihood that readers will see the right glyphs, add a font-family
style line that lists a variety of available fonts, in decreasing order
of coverage and attractiveness.
My bad advice comes effectively from the confusion about two SIL related 
fonts: one with legacy encoding (handled in browsers as if it was ISO-8859-1 
encoded, so that you need to insert text in the HTML page using only the 
code points in the Latin-1 page starting at U+, even though they do not 
represent the correct Unicode characters), and the other coded with Unicode 
(for which you need to encode your text with Habrew code points...).

But your advice, Doug, still won't work when multiple fonts in the 
font-family style use distinct encodings: Mixing SIL Ezra with Arial, or 
similar Unicode encoded fonts will never produce the intended fallbacks if 
users don't have SIL Ezra effectively installed and selectable in their 
browser environment.

Legacy encoded fonts only contain a codepage/charset identifier (most often 
ISO-8859-1) and no character to glyph translation table; also don't work 
properly with browsers configured for accessibility, where only the 
user-defined prefered fonts are allowed, and fonts specified in HTML pages 
must be ignored by the browser, user styles having been set to higher 
priority (even if one uses the important (!) CSS style rule markers), 
unless the default font mapping associated with the codepage/charset 
identifier effectively corresponds to what would be found in a regular 
char-to-glyph mapping table present in that font.

Re: Unicode HTML, download

2004-11-21 Thread Philippe Verdy

From: Doug Ewell [EMAIL PROTECTED]
Cryptically naming these two CSS classes .he and .heb, which
provides no indication of which is the Unicode encoding and which is the
Latin-1 hack, merely makes a bad suggestion worse.
It was not cryptocraphic: he was meant for Hebrew (generic, properly 
Unicode encoded, suitable for any modern Hebrew), and heb for Biblic 
Hebrew where a legacy encoding may still be needed, in absence of workable 
Unicode support for now: this won't be the same language however, so a 
change of encoding may be justified. I was not advocating for mixing 
encodings within the same text for the same language...

But I was nearly sure that a technical jargon in Hebrew would probably not 
need Biblic Hebrew, except for illustration purpose within small delimited 
block quotes or spans, where there will be simultaneously changes of:
- language level
- needed character set, some characters not being encodable with Unicode
- a needed changed encoding (from Unicode to Latin-1 override hack)
- specific font to render the legacy encoding.
In that case, it is acceptable to have the general text in modern Hebrew 
properly coded with Unicode, even if the small illustrative quotes remain 
fully in a non standard mapping, and won't appear correctly without the 
necessary font.

Note that PDF files DO mix encodings within the embedded fonts that PDF 
writers dynamically create for only the necessary glyphs. These encodings 
are specific to the document, for each embedded font... This is why PDF 
files can encode text that still don't have Unicode character mappings. You 
can see that when you attempt to copy/paste text fragments from PDF files in 
sections using embedded fonts; the pasted text will not reproduce the same 
characters as what you can see in the PDF reader; copy/pasting however works 
for PDF files using external fonts with standard mappings.

Re: [increasingly OT--but it's Saturday night] Re: Unicode HTML, download

2004-11-21 Thread Philippe Verdy

From: E. Keown [EMAIL PROTECTED]
Dear Doug Ewell, fantasai and List:
I will try to sort out these diverse pieces of advice.
What's the point, really, of going far beyond, even
beyond CSS, into XHTML, where few computational
Hebraists have gone before?
You're right Helen, the web is full of non XHTML conforming documents. You 
probably don't need full XHTML conformance too, but having your document 
respect the XML nesting and closure of elements is certainly a must today, 
because it avoids most interoperability problems in browsers.

So: make sure all your HTML elements and attributes are lowercase, and close 
ALL elements (even empty elements that should be closed by  / instead of 
just , for example br / instead of br, and even li.../li, or 
p.../p).
And then don't embed structural block elements
   (like p.../p or div../div or blockquote.../blockquote
   or li.../li or table.../table)
within inline elements
   (like b.../b or font.../font or a href=../a
   or span.../span)
Note that most inline elements are related to style, and they better fit 
outside of the body by assigning style classes to the structural elements 
(most of them are block elements).

XHTML has deprecated most inline style elements, in favor of external 
specification of style through the class property added to structural block 
elements. XHTML has an excellent interoperability with a wider range of 
browsers, including old ones, except for the effective rendering of some CSS 
styles.

The cost to convert an HTML file to full XML well-formedness is minor for 
you, but this allows you to use XML editors to make sure the document is 
properly nested, a pre-condition that will greatly help its interoperable 
interpretation.

If you have FrontPage XP or 2003, you can use its apply XML formatting 
rules option to make this job nearly automatically, and make sure that all 
elements are properly nested and closed.

Re: Ezra

2004-11-21 Thread Philippe Verdy

From: Edward H. Trager [EMAIL PROTECTED]
Are you saying the difference in names is SIL Ezra vs. Ezra SIL ?
That's too confusing!
You're not alone to be confused. I had completely forgotten the existence of 
two versions of the same font design. I may have just seen that it used 
PUAs, so I did not install it (I did not remember that it used PUAs, and the 
wording of the sentence that introduced it in this discussion made me think 
that it was NOT using Unicode, and thus not PUAs which are Unicode things; 
that's where I supposed it was using some legacy Latin-1 override or similar 
hacks found in some special-purpose fonts, or in legacy non-TrueType-based 
font formats, like PostScript mappings within a 0-based indexed vector or 
hashed dictionnary of glyph names...)

Re: My Querry

2004-11-23 Thread Philippe Verdy

From: Antoine Leca [EMAIL PROTECTED]
I do not know what does mean fully compatible in such a context. For
example, ASCII as designed allowed (please note I did not write was
designed to allow) the use of the 8th bit as parity bit when transmitted 
as
octet on a telecommunication line; I doubt such use is compatible with
UTF-8.
The parity bit is not data; it's a framing bit used for transport/link 
purpose only.

ASCII is 7 bit only, so even if a parity bit is added (parity bit can be 
added as well to 8-bit quantities...), it won't be part of the effective 
data, because once the transport unit is received and checked, it has to be 
cleared (so an '@' character will effectively be equal to 64 in ASCII, not 
to 192 if a even parity bit is added.)

By saying UTF-8 is fully compatible with ASCII, it says that any ASCII-only 
encoded file needs no reencoding of its bytes to make it UTF-8.

Note that this is only true for the US version of ASCII (well, ASCII is 
normally designating only the last standard US variant of ISO 646, other 
standard national variants or proprietary variants of ISO 646 should not be 
named ASCII, but more accurately, for example, ISO 646-FR:1989, or without 
the ISO prefix if this is a proprietary charset and not an approved charset 
published in the ISO 646 standard).

Re: Shift-JIS conversion.

2004-11-25 Thread Philippe Verdy




You just need a mapping table from Unicode 
codepoints to Shift-JIS code positions, and a very simple code point parser to 
translate UTF-8 into Unicode code points.
You'll find a mapping table in the Unicode UCD, on 
its FTP server. The UTF-8 form is fully documented in the Conformance section of 
the Unicode standard and requires no table to convert UTF-8 to 21-bit Unicode 
codepoints.

There are existing tools that perform that for you, 
because they integrate both:

- Java (international edition) has a Shift-JIS 
mapping to Unicode which is reversible. It is used with the Charset support in 
java.io.* and java.nio.* packages and classes. You can even use the prebuilt 
tool native2ascii (from the Java SDK) to do that:

 native2ascii -encoding UTF-8 
 filename.UTF-8.txt
  |  
native2ascii -reverse -encoding SHIFT-JIS  
filename.SHIFT-JIS.txt

- GNU recode on Linux/Unix may do that for you 
too.

- the Open-Sourced ICUoffered byIBM has 
an API and support mappings for lots of charsets.


  - Original Message - 
  From: 
  pragati 
  
  To: [EMAIL PROTECTED] 
  Sent: Thursday, November 25, 2004 6:00 
  AM
  Subject: Shift-JIS conversion.
  
  Hello,
  
   Can anyone please tell me how to convert from 
  UTF-8to shift-JIS?
  Please let me know if there is any formula to do it other 
  than using readymade functions as provided by pearl. Because these functions 
  do not provide mapping for all characters.
  
  Warm Regards,Pragati Desai.
  
  Cybage Software Private Ltd.ph(0)- 020-4044700Extn: 
  302mailto: [EMAIL PROTECTED]

Re: Misuse of 8th bit [Was: My Querry]

2004-11-25 Thread Philippe Verdy

From: Antoine Leca [EMAIL PROTECTED]
On Wednesday, November 24th, 2004 22:16Z Asmus Freytag va escriure:
I'm not seeing a lot in this thread that adds to the store of
knowledge on this issue, but I see a number of statements that are
easily misconstrued or misapplied, including the thoroughly
discredited practice of storing information in the high
bit, when piping seven-bit data through eight-bit pathways. The
problem  with that approach, of course, is that the assumption
that there were never going to be 8-bit data in these same pipes
proved fatally wrong.
Since I was the person who did introduce this theme into the thread, I 
feel
there is an important point that should be highlighted here. The widely
discredited practice of storing information in the high bit is in fact 
like
the Y2K problem, a bad consequence of past practices. Only difference is
that we do not have a hard time limit to solve it.
Whever an application chooses to use the 8th (or even 9th...) bit of a 
storage or memory or networking byte used also to store an ASCII-coded 
character, as a zero, or as a even or odd parity bit, of for other purpose 
is the choice of the application. It does not change the fact that this 
(these) extra bit(s) is not used to code the character itself.
I see this usage as a data structure, that *contains* (I don't say *is*) a 
character code. This completely out of topic of the ASCII encoding itself 
which is only concerned by the codes assigned to characters, and only 
characters.
In ASCII, or in all other ISO 646 charsets, code positions are ALL in the 
range 0 to 127. Nothing is defined outside of this range, exactly like 
Unicode does not define or mandate anything for code points larger than 
0x10, should they be stored or handled in memory with 21-, 24-, 32-, or 
64-bit code units, more or less packed according to architecture or network 
framing constraints.
So the question of whever an application can or cannot use the extra bits is 
left to the application, and this has no influence on the standard charset 
encoding or on the encoding of Unicode itself.

So a good question to ask is how to handle values of variables or instances, 
that are supposed to contain a character code, but whose internal storage 
can make values out of the designed range fit in the storage code unit. For 
me it is left to the application, but many applications will simply assume 
that such a datatype is made to accept a unique code per designated 
character. Using the extra storage bits for something else will break this 
legitimate assumption, and so applications must be prepared specially to 
handle this case, by filtering values before checking for character 
identity.

Neither Unicode or US-ASCII or ISO 646 define what an application can do 
there. The code positions or code points they define are *unique* only in 
their *definition domain*. If you use larger domains for values, nothing 
defines in Unicode or ISO 646 or ASCII how to interpret the value: these 
standards will NOT assume that the low-order bits can safely be used to 
index equivalent classes, because these equivalence classes cannot be 
defined strictly within the definition domain of these standard.

So I see no valid rationale behind requiring applications to clear the extra 
bits, or to leave the extra bits unaffected, or to force these applications 
to necessarily interpreting the low order bits as valid code points.
We are out of the definition domain, so any larger domain is 
application-specific, and applications may as well use ASCII or Unicode 
within storage code units which add some offsets, or multiply the standard 
codes by a constant, or apply a reordering transformation (permutation) on 
them and other possible non-character values.

When ASCII and ISO 646 in general define a charset with 128 unique code 
positions, they don't say how this information will be stored (an 
application may as well need to use 7 distinct bytes (or other 
structures...), not necessarily consecutive, to *represent* the unique codes 
that represent ASCII or ISO 646 characters), and they don't restrict the 
usage of these codes separately of any other independant information (such 
as parity bits, or anything else). Any storage structure that allows keeping 
the identity and equivalences of the original standard code in its 
definition domain is equally valid as a representation of the standard, but 
this structure is out of scope of the charset definition.

Re: Shift-JIS conversion.

2004-11-25 Thread Philippe Verdy

- Original Message - 
From: Addison Phillips [wM]
To: pragati ; [EMAIL PROTECTED]
Sent: Thursday, November 25, 2004 6:21 PM
Subject: RE: Shift-JIS conversion.

Dear Pragati,
You can write your own conversion, of course. The mapping tables of 
Unicode-SJIS are readily availably. You should note that there are several 
vendor specific variations in the mapping tables. Notably Microsoft code 
page 932, which is often called Shift-JIS, has more characters in its 
character set than standard Shift-JIS (and it maps a few characters 
differently too...)

The important fact that you should be aware of: Shift-JIS is an encoding 
of the JIS X0208 character set.
UTF-8 is an encoding of the Unicode character set.
More exactly, UTF-8 is an encoding of the ISO/IEC 10646 character set (the 
character set here designates the set of characters, i.e. the repertoire 
that describes characters with a name and a representative glyph and some 
annotations, to which a numeric code is then assigned, the code point. The 
char. set is

Unicode by itself is not a character set, only an implementation of the 
ISO/IEC 10646 character set, in which which the Unicode standard assign 
additional properties and behavior for characters allocated in ISO/IEC 
10646. The link between Unicode and ISO/IEC 10646 is the assigned code point 
and character name, which are now common between the two standards.

Of course the Unicode technical commitee may propose new assignments to 
ISO/IEC, but this is still ISO/IEC 10646 which maintains the repertoire and 
approves or rejects the proposals. A new character proposal may be rejected 
by Unicode, but accepted by ISO/IEC 10646; and it is the ISO/IEC 10646 vote 
that prevails (so Unicode will have to accept this ISO/IEC decision, even if 
it has voted against it in a prior decision).

On the opposite, ISO/IEC 10646 says nothing about character properties or 
behaviors. It can suggest, but the Unicode committee will make its own 
decisions for the character properties and behavior that it chooses to 
standardize. If Unicode wants to make its decisions widely accepted by all 
users of the ISO/IEC 10646 repertoire, it's in the interest of Unicode of 
trying to make these decisions in conformance with other existing national 
or international standards, to maximize interoperability of national or 
international applications based on the ISO/IEC 10646 character set.

Re: Misuse of 8th bit [Was: My Querry]

2004-11-26 Thread Philippe Verdy

From: Antoine Leca [EMAIL PROTECTED]
On Thursday, November 25th, 2004 08:05Z Philippe Verdy va escriure:
In ASCII, or in all other ISO 646 charsets, code positions are ALL in
the range 0 to 127. Nothing is defined outside of this range, exactly
like Unicode does not define or mandate anything for code points
larger than 0x10, should they be stored or handled in memory with
21-, 24-, 32-, or 64-bit code units, more or less packed according to
architecture or network framing constraints.
So the question of whever an application can or cannot use the extra
bits is left to the application, and this has no influence on the
standard charset encoding or on the encoding of Unicode itself.
What you seem to miss here is that given computers are nowadays based on
8-bit units, there have been a strong move in the '80s and the '90s to
_reserve_ ALL the 8 bits of the octet for characters. And what was asking 
A.
Freitag was precisely to avoid bringing different ideas about 
possibilities
to encode other class of informations inside the 8th bit of a ASCII-based
storage of a character.
This is true for example in an API that just says that a char (or whatever 
datatype used in some convenient language) contains an ASCII code or Unicode 
code point, and expects that the datatype instance will be equal to the 
ASCII code or Unicode code point.
In that case, the assumption of such API is that you can compare the char 
instance for equality instead of comparing only the effective code points, 
and this greately simplifies the programmation.
So an API that says that a char will contain ASCII code positions should 
always assume that only the instance values 0 to 127 will be used; same 
thing if an API says that an int contains an Unicode code point.

The problem lives only in the usage of the same datatype to store also 
something else (even if it's just a parity bit or bit forced to 1).

As long as this is not documented with the API itself, it should not be 
used, to preserve the rational assumption about identities of chars and 
identies of codes.

So for me, a protocol that adds a parity bit to the ASCII code of a 
character is doing that on purpose, and this should be isolated in that 
documented part of its API. If the protocol wants to snd this data to an API 
or interface that does not document this use, it should remove/clear the 
extra bit, to make sure that the character identity is preserved and 
interpreted correctly (I can't see how such a protocol implementation can 
expect that a '@' character coded as 192 will be correctly interpreted by 
the other simpler interface that expects that all '@' instances will be 
equal to 64...)

In safe programming, any unused field in a storage unit should be given a 
mandatory default. As the simplest form that perserves the code identity in 
ASCII or code point identity in Unicode is the one that use 0 as this 
default, extra bits should be cleared. If not, anything can appear within 
the recipient of the character:

- the recipient may interpret the value as something else than a character, 
behaving as if the characterdata was absent (so there will be data loss, in 
addition to unpected behavior). Bad practice, given that it is not 
documented in the recipient API or interface.

- the recipient may interpret the value as another character, or may not 
recognize the expected character. It's not clearly a bad programming 
practice for recipients, because it is the simplest form of handling for 
them. However the recipient will not behave the way expected by the sender, 
and it is the sender's fault, not the recipient's fault.

- the recipient may take additional unexpected actions in addition to the 
normal handling of the character without the extra bits. It would be a bad 
programming practive of recipients, if this specific behavior is not 
documented, so senders should not need to care about it.

- the recipient may filter/ignore the value completely... resulting in data 
loss; this may be sometimes a good practice, but only if this recipient 
behavior is documented.

- the recipient may filter/ignore the extra bits (for example by masking); 
for me it's a bad programming practice for recipients...

- the recipient may substitute the incorrect value by another one (such as a 
SUB ASCII control or a U+FFFD Unicode substitute to mark the presence of an 
error, without changing the string length).

- an exception may be raised (so the interface will fail) because the given 
value does belong to the expected ASCII code range or Unicode code point 
range (the safest practice for recipients, that are working under the 
design by contract model, is to check the domain value range of all its 
incoming data or parameters, to force the senders to obey the contract).

Don't expect blindly that any interface capable of accepting ASCII codes in 
8-bit code units will also accept transparently all values outside of the 
restricted ASCII code range, unless this behavior

Re: Relationship between Unicode and 10646 (was: Re: Shift-JIS conversion.)

2004-11-26 Thread Philippe Verdy

From: Doug Ewell [EMAIL PROTECTED]
My impression is that Unicode and ISO/IEC 10646 are two distinct
standards, administered respectively by UTC and ISO/IEC JTC1/SC2/WG2,
which have pledged to work together to keep the standards perfectly
aligned and interoperable, because it would be destructive to both
standards to do otherwise.  I don't think of it at all as the slave and
master relationship Philippe describes.
Probably not with the assumptions that one can think about slave and 
master, but it's still true that there can only be one standard body for 
the character repertoire, and one formal process for additions of new 
characters, even if two standard bodies are *working* (I don't say *decide*) 
in cooperation.

The alternative would have been that UTC and WG2 are allocated each some 
code space for making the allocations they want, but with the risk of 
duplicate assignments. I really prefer to see the system like the master 
and slave relationships, because it gets a simpler view for how characters 
can be assigned in the common repertoire.

For example, Unicode has no more rights than national standardization bodies 
making involved at ISO/IEC WG2. All of them will make proposals, all of them 
will amend proposals, or suggest modifications, or will negociate to create 
a final specification for the informal drafts. All what I see in the Unicode 
standardization process is that it will finally approve a proposal, but 
Unicode cannot declare it standard until there's been a formal agreement at 
ISO/IEC WG2, which really rules the effective allocations in the common 
repertoire, even if most of the preparation work will have been heavily 
discussed within UTC, creating the finalized proposal and with Unicode 
partners or with ISO/IEC members.

At the same time, ISO/IEC WG2 will also study the proposals made by other 
standardization bodies, including the specifications prepared by other ISO 
working groups, or by national standardization bodies. Unicode is not the 
only approved source of proposals and specifications for ISO/IEC WG2 (and I 
tend to think that Unicode best represent the interests of private 
companies, whilst national bodies are most often better represented by their 
permanent membership at ISO where they have full rights of voting or vetoing 
proposals, according to their national interests...)

The Unicode standard itself agrees to obey to ISO/IEC 10646 allocations in 
the repertoire (character names, representative glyphs, code points, and 
code blocks), but in exchange, ISO/IEC has agreed with Unicode to not decide 
about character properties or behavior (which are defined either by Unicode, 
or by national standards based on the ISO/IEC 10646 coded repertoire, for 
example the P.R.Chinese GB18030 standard, or by other ISO standards like ISO 
646 and ISO 8859).

So, even if the UTC decides to veto a proposal submitted by Unicode members, 
nothing prevent the same members to find allies within national standard 
bodies, so that they submit the (modified) proposal to ISO/IEC 10646, 
instead of Unicode which refuses to transmit that proposal.

I want to demonstrate some recent example: the UTC decided to vote against 
the allocation of a new invisible character, with the properties of a 
letter, a zero-width, and the same allowances of break opportunities as 
letters, considering that the existing NBSP was enough, despite it causes 
various complexities related to the normative properties of NBSP used as a 
base character for combining diacritics. This proposal (that was previously 
in informal discussion) has been rejected by UTC, but this leaves Indian and 
Israeli standards with complex problems for which Unicode proposes no easy 
solution.

So nothing prevents India and Israel to reformulate the proposal at ISO/IEC 
WG2, which may then accept it, even if Unicode previously voted against it. 
If ISO/IEC WG2 accepts the proposal, Unicode will have no other choice than 
accepting it in the repertoire, and so giving to the new character some 
correct properties. Such proposal will be easily accepted by ISO/IEC WG2 if 
India and Israel demonstrate that the allocation allows making distinctions 
which are tricky or computationnally difficult or ambiguous to resolve when 
using NBSP. With a new distinct character, on the opposite, it can be 
demonstrated by ISO/IEC 10646 members to Unicode that defining its Unicode 
properties is not difficult, and simplifies the problem for correctly 
representing complex cases found in large text corpus.

Unicode may think that this is a duplicate allocation, because there will 
exist cases where two encoding are possible, but without the same 
difficulties for implementations of applications like full-text search, 
collation, or determination of break opportunities, notably in the many 
cases where the current Unicode rules are already contradicting the 
normative behavior of existing national standards (like ISCII in India). My 
opinion is that the

Re: CGJ , RLM

2004-11-26 Thread Philippe Verdy

From: Mark Davis [EMAIL PROTECTED]
I want to correct some misperceptions about CGJ; it should not be used for
ligatures.
True. CGJ is a combining character that extends the grapheme cluster started 
before it, but it does not imply any linking with the next grapheme cluster 
starting at a base character.

So, even if one encodes, A+CGJ+E, there will still be two distinct grapheme 
clusters A+CGJ and E, and the exact role of the trailing CGJ in the A+CGJ is 
probably just a pollution, given that this CGJ has no influence on the 
collation order, so that the sequence A+CGJ+E will collate like A+E, and it 
does not influence the rendering as well.

A correct ligaturing would be A+ZWJ+E, with the effect of creating three 
default grapheme clusters, that can be rendered as a single ligature, or as 
separate A and E glyphs if the ZWJ is ignored.

For example, a ligaturing opportunity can be encoded explicitly in the 
French word efficace:
ef+ZWJ+f+ZWJ+icace.

Note however that the ZWJ prohibits breaking, despite in French there's a 
possible hyphenation at the first occurence, where it is also a syllable 
break, but not for the second occurence that occurs in the middle of the 
second syllable.

I don't know how one can encode an explicit ligaturing opportunity, while 
also encoding the possibility of an hyphenation (where the sequence above 
would be rendered as if the first ZWJ had been replaced by an hyphen 
followed a newline.)

To encode the hyphenation opportunity, normally I would use the SHY format 
control (soft hyphen):
ef+SHY+fi+SHY+ca+SHY+ce

If I want to encode explicit ligatures for the ffi cluster, if it is not 
hyphenated, I need to add ZWJ:
ef+ZWJ+SHY+f+ZWJ+i+SHY+ca+SHY+ce(1)

The problem is whever ZWJ will have the expected role of enabling a ligature 
if it is inserted between a letter and a SHY, instead of the two ligated 
glyphs. In any case, the ligature should not be rendered if hyphenation does 
occur, else the SHY should be ignored. So two rendering are to be generated 
depending on the presence or absence of the conditional syllable break:
- syllable break occurs, render as: ef-+NL+f+ZWJ+icace, i.e. with a 
ligature only for the fi pair, but not for the ff pair and not even for 
the generated f+hyphen...
- syllable break does not occur, render as ef+ZWJ+f+ZWJ+icace, i.e. 
with the 3-letter ffi ligature...

I am not sure if the string coded as (1) above has the expected behavior, 
including for collation where it should still collate like the unmarked word 
efficace...

Re: CGJ , RLM

2004-11-26 Thread Philippe Verdy

Which statements? My message is mostly a read as a question, not as an 
affirmation... I also took the precaution of using terms like not sure 
if..., or i don't know if..., which mean that it's a problem for which I 
can't find easy solutions, i.e. the interaction of ligature opportunities 
and hyphenation (syllable break opportunities), and how a document can be 
prepared to allow both in renderers, without breaking the semantics and 
collation of words in the document (notably if one wants to preserve the 
full-text search capabilities for such prepared documents)...

- Original Message - 
From: Mark Davis [EMAIL PROTECTED]
To: Philippe Verdy [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Friday, November 26, 2004 9:09 PM
Subject: Re: CGJ , RLM


The statements below are incorrect, but I don't have the time to correct
them all.

Re: CGJ , RLM

2004-11-26 Thread Philippe Verdy

From: Doug Ewell [EMAIL PROTECTED]
Perhaps a better question to ask would be why you need to indicate both
hyphenation points and ligation points in text that is going to be
collated.
Because one would want to:
- prepare documents for correct rendering (including both ligatures and 
hyphenation capabilities easily rendered in simple text browsers without 
using any lexical analysis) and use such prepared document as the prefered 
form for archiving, and then want to
- have such prepared corpus still usable for full-text searches...

Re: CGJ , RLM

2004-11-26 Thread Philippe Verdy

From: Doug Ewell [EMAIL PROTECTED]
Philippe Verdy verdy underscore p at wanadoo dot fr wrote:
If I want to encode explicit ligatures for the ffi cluster, if it is
not hyphenated, I need to add ZWJ:
ef+ZWJ+SHY+f+ZWJ+i+SHY+ca+SHY+ce(1)
Great Scott!  You can use ZWJ to suggest a ligation opportunity, and SHY
to suggest a hyphenation opportunity, but if you need to suggest both
within the same word, let alone *between the same pair of letters*, you
have probably stepped over the plain-text line.
If encoding ligation oportunity is not plain-text, why then have it in 
Unicode?
If hyphenation opportunity is not plain-text, why then have it in Unicode?

Both exist in Unicode, and I don't think that they are considered not 
plain-text. So why would you want to restrict their usage so that they will 
be used only separately?

The ZWJ and SHY format controls for these two targets are added on purpose 
when preparing documents for later rendering. They shouldn't affect the 
collation of text and will not change their semantic, and this 
transformation of text cannot be fully automated without using complex 
lexical and linguistic knowledge. That's why they should be allowed in texts 
kept for archiving.

If you want to use later those prepared texts on more simpler renderers and 
parsers, you can still ignore and filter out the ZWJ and SHY very easily, so 
this preparation work, performed most often by typists, is normally 
reversible.

Nobody is required to use them, but if one wants to do it for better 
rendering of prepared documents, why would Unicode forbid it? Was my 
question really so stupid?

Re: No Invisible Character - NBSP at the start of a word

2004-11-27 Thread Philippe Verdy

From: Jony Rosenne [EMAIL PROTECTED]
One of the problems in this context is the phrase original meaning. What
we have is a juxtaposition of two words, which is indicated by writing the
letters of one with the vowels of the other. In many cases this does not
cause much of a problem, because the vowels fit the letters, but sometimes
they do not. Except for the most frequent cases, there normally is a note 
in
the margin with the alternate letters - I hope everyone agrees that notes 
in
the margin are not plain text.
Are you making here a parallel with the annotations added on top or below 
ideographs in Asian texts, using the ruby notation (for example in HTML) 
which may also be represented in plain-text Unicode with the interlinear 
annotation?

Are you arguing that interlinear annotations are not plain-text? If so why 
were they introduced in Unicode?

The notations in questions are not merely presentation features, they have 
their own semantic which merit being treated as plain-text, because their 
structure also ressembles a linguistic grammar, not far from the other 
common annotations also found in Latin text with phrases between parentheses 
or em-dashes.

Plain text is widely used since ever to embed several linguistic levels, 
which are also often represented too in the spoken language, by variation of 
tonality. The content of these annotations is also plain text. The graphic 
representation itself is not that important, it is just there to easily 
demonstrate the relations that exist between one level of the written 
language and the annotation language level.

If a text appears to mix these levels, there's no reason not to represent 
it. These annotations are present in the text, there must be a way to 
represent them in its encoding, even if it implies encoding mixed words 
belonging to different interpretation levels (such as Qere and Ketiv texts 
in Biblic Hebrew).

You are arguing against millenia of written language practices, just too 
much focused on the common Latin usage where many concessions to your 
intuitive model have already been integrated into Unicode (think about the 
various characters that have been added as symbols or special punctuations, 
or about other annotations added on top of Latin letters such as 
mathematical arrows...

I see less problems with the correct representation of Ketiv and Qere 
annotations mixed within plain text, and rendered as supplementary letters 
on top or around the core Hebrew letters, than with the representation 
concessed to the Latin script for various usages (including technical 
annotations or punctuations, or formatting controls...)

Re: (base as a combing char)

2004-11-27 Thread Philippe Verdy

From: Addison Phillips [wM] [EMAIL PROTECTED]
For example, Dutch sometimes treats the sequence ij as a single letter 
(it turns out that there are characters for the letter 'ij' in Unicode 
too, but they are for compatibility with an ancient non-Unicode character 
set). Software must be modified or tailored to provide behavior consistent 
with the specific language and context.
Not sure about that: not all Dutch ij letter pairs are a single grapheme, 
so there are cases where the two letters must be treated as distinct and not 
as a single letter. For this reason, Dutch will need a distinct ij letter, 
coded as a single character, and with its own capitalization rules (the 
uppercase or titlecase form of ij will be the single letter IJ, not two 
letters and not Ij; also there exists cases where diacritics can be added 
on top of the ij letter, which is then more tied as a single letter than a 
simple digraph.)

This distinction is also often made visible in the typography (where the 
single letter ij digraph is shown with the leg of the j kerned deeply 
below (and sometimes to the left of) the leading i, unlike cases where 
they are treated as two letters where no kerning occurs (the 'i' is shown 
completely on the left of the bottom-left leg of 'j'), and it is even more 
evident in the uppercase style (where there will even be the standard small 
distance between I and J glyphs when they are two distinct letters, but 
where the uppercase I may be drawn in the middle of the left leg of J).

Note the very near ressemblance of the ij signel letter with a y with a 
diaeresis (so you'll find also Dutch texts that use y with diaeresis instead 
of the correct ij letter, notably in texts coded with legacy charsets). 
This distinction is also preserved for uppercase, where the missing IJ 
single letter appears encoded with Y with diaeresis...

These cases in Dutch where there's a distinction between the single letter 
digraph and two letters are rare, so it is often acceptable to encode the 
digraph with two letters, without creating linguistic ambiguities (in most 
cases...), or with y with diaeresis/umlaut (which otherwise is not a letter 
used in Dutch).

For me, your allusion to legacy charsets is about the deprecating use of y 
with diaeresis, not about the use of a distinct IJ letter which is needed 
for Dutch and should be treated as distinct from the I then J letters 
pair.

Re: Relationship between Unicode and 10646

2004-11-27 Thread Philippe Verdy

From: Peter Kirk [EMAIL PROTECTED]
I don't want to go along with Philippe entirely on this, but surely he 
must be right on this last point. Formally, Unicode is effectively the 
agent of just one national body in this decision-making process.
To be honest, Peter, I never said that Unicode was a national body, because 
I know that there are several non-US governments that are full members of 
Unicode and voting at the UTC, and because I know that the official 
representation of US in ISO is ANSI, not the Unicode Consortium.

But it's true that the United States have delegated several times their 
official international representation to the Unicode Concertium, acting on 
behalf of the US government for some decisions or some limited domains (this 
is valid because Unicode is incorporated in US, a necessary condition to 
represent the US government in international organizations); this is a 
private contractual arrangement between Unicode and the official US 
representant, but this does not change the rights of Unicode at ISO.

So the true representant of US in ISO (and also ITU) is certainly not 
Unicode, but ANSI, or any other US-incorporated organization that the US 
government chooses to represent it (other US private organizations are given 
a US mandate for the management of some public resources or standards, like 
IANA, ARIN, ICANN, and IEEE, despite these organizations have also 
integrated some international voting members)

Re: CGJ , RLM

2004-11-27 Thread Philippe Verdy

I'm not the one that proposed encoding a AE ligature with A+ZWJ+E. I just 
spoke about cases like true typographical ligatures like ffi. I do know 
that AE or ae in French is better encoded with their distinct unique code, 
even if French consider this letter as two letters (which may justify the 
prefered encoding as A+ZWJ+E, for which no collation tailoring is needed).

The current practice in most French texts is to use either the two separate 
vowels A+E or a+e, or to use the separate codepoint for the ae letter or the 
AE letter (and then use tailored collation to sort the ae single letter and 
AE single letter with a+e and A+E, i.e. between a+e and a+f).

I've never seen any French text coded with A+ZWJ+E or a+ZWJ+e...
Same remark for the French oe and OE ligatures (which, like the ae and AE 
ligatures, are orthographic, not typographical like ffi) that French also 
considers (i.e. collates, sorts) as two letters o+e or O+E.

- Original Message - 
From: Asmus Freytag [EMAIL PROTECTED]
PS: since we have a perfectly fine AE as a character, there seems little 
gained in attempting a ligature. My suspicion would be that fonts would 
not provide the necessary mappings since the character code is available.

Re: (base as a combing char)

2004-11-27 Thread Philippe Verdy

From: John Cowan [EMAIL PROTECTED]
the need to encode Dutch
ij as a single character, which is neither necessary nor practical.
(U+0132 and U+0133 are encoded for compatibility only.)  In cases where
ij is a digraph in Dutch text, i+ZWNJ+j will be effective.
I suppose you wanted to speak about the rare cases in Dutch where ij is NOT 
a digraph for a single letter, and for which i+ZWNJ+j could be effective... 
if only it was not opposed to the tradition (and many legacy encodings and 
keyboards), that do generate U+0132 and U+0133 or an y/Y with diaeresis when 
this is a digraph, considering that i+j in that case is not a digraph but 
two distinct letters.

There will remain an ambiguity for long time in Dutch, simply because 
ISO-8859-1 (U+ to U+00FF) is too often the only subset offered to Dutch 
typists, where neither U+0132 and U+10133 are present, nor ZWNJ (in that 
case, those that want the distinction often use an y with diaeresis for 
lowercase, and don't mark the difference for uppercase (as there's no 
uppercase Y with diaeresis in ISO-8859-1) which occurs much more rarely 
(Windows users can however use an uppercase Y with diaeresis, U+0178, to 
mark the single-letter digraph, because it is present in Windows codepage 
1252 at the code position 0x9F).

I doubt seeing one day a ZWNJ key mapped on standard Dutch keyboards, given 
that most occurences of the non-digraph two-letters i+j come from some 
imported (originally non-Dutch) rare words. (But Windows notepad and some 
Windows text input components include a contextual menu to insert this 
formating control...)

The problem with ZWNJ is that it is just encoding a typographic distinction, 
not a semantic one that Dutch users would expect: this means that it has no 
semantic itself, and its rendering is also optional. Those that want a 
strong distinction will more likely use U+0132 and U+0133 in their word 
processors, assisted by Dutch lexical correctors so that they will just need 
to enter i then j, and let the word processor substitute the two letters 
appropriately by the ij ligated letter when it is appropriate, leaving other 
instances unchanged.

As the ij ligated letter is most certainly the most frequent case for 
entering Dutch text, it may be the default behavior of a Dutch input method, 
and the assisting dictionnary will just need to reference the rare cases 
where the substitution must not occur (the substitution will not occur 
within text sections marked as belonging to another language, and users can 
also cancel with backspace this automatic substitution in their word 
processor).

Other less performing word processors, without assisting dictionnaries, may 
substitute instead the occurences of y/Y with diaeresis that are inputed by 
users into U+0132/U+0133 (a solution which may be quite easy for Belgian and 
French users that can easily make use of the diaeresis dead key, also useful 
for entering French text)...

This means that modern word processors will contain lots of U+0132/U+0133 
which will be clearly distinct from the other cases where i and j are left 
isolated; and ZWNJ will not be needed!

Re: Re: Relationship between Unicode and 10646]

2004-11-29 Thread Philippe Verdy

From: Patrick Andries [EMAIL PROTECTED]
Enfin, je ne suis plus si sûr que les sociétés américaines considèrent 
encore
Unicode comme quelque chose de stratégique, il s'agit surtout d'efforts 
individuels
de la part de techniciens passionés dans ces entreprises, passionnés qu'on 
laisse
encore faire sans doute parce que cela crée un bon capital de sympathie 
multiculturel.
C'est d'ailleurs ce qui me fait doûter de plus en plus de l'intérêt de 
continuer à soutenir Unicode, s'il n'obéit même plus à des objectifs 
économiques jugés utiles par les seuls membres américains capables de 
soutenir son développement uniquement depuis les Etats-Unis, alors 
qu'Unicode n'est pas encore au point pour bon nombre d'autres pays qui, eux, 
ont des impératifs économiques à soutenir leurs propres langues.

S'il n'y a plus grand chose à faire concernant les écritures latines, ou 
cyrilliques, et si les idéographes chinois sont maintenant laissés à la 
gestion du Rapporteur Idéographique travaillant en Extrème-Orient, il serait 
peut-être bon d'envisager que le développement d'Unicode concernant les 
écritures Africaines, ou du Moyen-Orient se fasse dans des lieux plus 
appropriés que les Etats-Unis, notamment concernant les décisions.

L'Europe offre des lieux de rencontre semble-t-il plus appropriés pour ces 
alphabets mal supportés par Unicode, dont les décisions sont fondées sur des 
rapports distants, sans implication économique sérieuse de la part des 
sociétés encore participantes (si elles continuent à soutenir et payer leurs 
collègues encore engagés pour ce travail de passionnés).

Il semble que bien des sociétés ou organisations Européennes ou du 
Moyen-Orient, ou d'Afriquepourraient participer plus facilement au sujet des 
langues qui leur tiennent à coeur, en effectuant ces réunions de décision 
dans un lieu plus centré.

Il est d'ailleurs dommage, à l'heure des communications virtuelles, 
qu'Unicode s'en tienne encore, pour la question du vote final, à vouloir 
faire cela uniquement lors de comités restreints aux Etats-Unis, comme si le 
vote électronique n'existait pas! Cela n'empêchera pas la tenue de réunions 
de discussions ou d'arbitrage en différents lieux mais Unicode et ceux qui 
le soutiennent fairaient pas mal d'économies en travaillant de façon moins 
centralisée, et en acceptant de déléguer une partie de son travail.

Il est symptomatique par exemple de voir que la moitié des votants 
potentiels d'Unicode n'utilisent jamais les ressources électroniques en 
ligne (que rien n'interdit de mettre en forme selon des procédures 
administratives propres à Unicode), en ne prenant leurs décisions que sur la 
base de documents imprimés (chers à produire et distribuer) lors de 
conventions (chères aussi pour y assister, à cause de frais de 
déplacement, hébergement, et des heures de travail supplémentaires payées 
uniquement pour ce sujet!), et que des documents importants puissent de ce 
fait échapper à leur analyse...

Re: CGJ , RLM

2004-11-29 Thread Philippe Verdy

From: Otto Stolz [EMAIL PROTECTED]
Note that there is no algorithm to reliably derive the position of the
syllable break from the spelling of a Word. You could even concoct pairs
of homographs that differ only in the position of the syllable break
(and, consequently, in their respective meaning). So far, I have only
found the somewhat silly example
- Brief+SYH+lasche (letter flap) vs.
- Brie+SYH+flasche (bottle to keep Brie cheese in),
but I am sure I could find better examples if I would try in earnest.
French hyphenation does not work reliably based only on orthographic rules.
It works wuite well, but with many exceptions, that require using an
hyphenation dictionnary. I think it's true also of almost all alphabet-based
languages, and even for some languages written with so-called syllabic
scripts, probably as a matter of style, where separate vocal syllables must
not be broken, as those breaks are not the best according to meaning
(notably for compound words).
The case of German is that there are many possible compound words, and
breaks preferably occur between radical words rather than between syllables,
with exceptions:
- due to other stylistic constraints, or
- on short particles that should better not be detached from their
respective radical (but where do you best break the hereinzugehen or
simply zugehen verbs?),
- also because not all verb particles are detachable, as they belong to the
radical (many excamples with the be particle or radical prefix)
Even if you allow hyphenation only between lexical units, there will exist
some exceptions that can't be resolved without understanding the semantic.
Such compound words with no separator are extremely rare in English, and
very rare in French.
(French examples: there's a clear vocal syllable break in millionce after 
-li- and before -on- prononced with separate vowels, but in million, 
no break occurs within -lions which is a single syllable, pronounced with 
a diphtong; none of these examples are compound words.)

But hyphenation is still preferable in German than only word breaks (on
spaces), due to the average length of compound words, whose margin alignment
may look ugly and hard to read in narrow columns like in newspapers or in
dictionnaries. In Dutch, there's more freedom for the creation of compounds,
that can often be written with or without a separator (a modern Dutch style
prefers using separators, or not creating any compound, by using word
separation with space, but historically Dutch was using the German style
still in use today despite its possible semantic ambiguities).
I think that a German writer that sees a possible ambiguity will often
tolerate to use an unconditional hyphen to create compound words (in your
example, he would write Brief-Lasche or Brie-Flasche but not
Brieflasche whose interpretation is problematic because there's no easy
way to determine it even with the funny semantic of the two alternatives;
unless the author is sure that ligatures are correctly handled with a
ligature on fl for the interpretation as Brie-Flasche, and no ligature,
and a narrow spacing, between f and l for the interpretation as
Brief-Lasche).
(Historically, German texts were full of ligatures -- much more often than 
in other Latin-based written languages -- those ligatures tending now to 
disappear from most modern publications; with the German rule that a 
ligature should not occur between two syllables, and should be present 
within the same radical, it's easy to see how ligatures are part of the 
orthographic system and that they have a semantic value which helps the 
correct understanding of text, so it would be even more important to use 
ZWNJ or ZWJ in German words, and not letting a renderer do this job 
automatically but inaccurately; for simplicity, I think that ZWNJ inserted 
between radicals to avoid their ligature would be easier to manage than ZWJ 
between two ligaturable letters that must be kept in the same syllable).

fl/fi ligature examples

2004-11-29 Thread Philippe Verdy

From: Otto Stolz [EMAIL PROTECTED]
Just because the st ligature is so uncommon (and the long  with its
t ligature is almost extinct), I was looking for an example involving
fl, or fi).
with ff :
   affable, baffe, biffer, Buffy, affriolant, effaroucher, effacer, ...
with ffl :
   effleurer, baffle, affligeant, ...
with fl :
   afleurer, flower, fleur, floral, floraison, inflation, dflation, flic, 
infliger...
with ffi :
   traffic, efficace, effilocher, officier, affiche, affine, ...
with fi :
   fi, fin, final, fil, fils, filature, filin, firme, firmament, 
aficionados, dfi, figure...

Many more examples of modern and widely used words (at least in English and 
French, but probably too in most Romance languages and other European 
languages including Roman Latin radicals)...
Other widely used ligatures include st and ct: est, test, acte, octet...

Re: Ideograph?!?

2004-11-29 Thread Philippe Verdy

From: Michael Norton (a.k.a. Flarn) [EMAIL PROTECTED]
What's an ideograph? Also, what's a radical?
Are they the same thing?
Some radicals (in the Han script) may be ideographs, but most ideographs are 
not radicals: they often (not always) combine 1 or more radicals, with 1 or 
more strokes that are not radicals themselves.

Radicals in the Han script serve to their classification, and help users to 
locate ideographs in dictionnaries, but they also consider the additional 
strokes (radicals are themselves made of a wellknown number of strokes).

Ideographs rarely represent alone a concept or word, but most often a single 
syllable. In Chinese many words are short and consist in 2 syllables, and so 
are written with two ideographs.

We should call these characters syllabographs instead of ideographs, but 
this may conflict with the concept of syllabaries that are much simpler, 
unlike Han ideographs that can each represent very complex syllables (with 
diphtongs, multiple consonnants, and distinctive tones), and sometime (in 
fact rarely) a concept or word (which may spelled with more than one 
syllable, depending on local dialects).

Many words are created from two ideographs, and the concept behind each 
ideograph is unrelated or sometimes very far to the meaning of the whole 
word. In that case, the pair of ideographs is chosen mostly because the 
concepts are pronounced similarly in some dialect of Chinese (sometimes old 
dialects), and so they can be read phonetically (For example, Beijing is 
written with the two ideographs for bei and jing, but you may wonder why 
bei and jing were used, and which concepts they represent, and their 
relation to the name of the city...).

For these reasons, some linguists prefer to speak about sinographs 
(reference to Chinese), or sometimes pictographs (because of their visual 
form, instead of their meaning)...

Re: Keyboard Cursor Keys

2004-11-30 Thread Philippe Verdy

From: Peter R. Mueller-Roemer [EMAIL PROTECTED]
Doug Ewell wrote:
Robert Finch wrote:
'm trying to implement a Unicode keyboard device, and I'd rather have
keyboard processing dealing with genuine Unicode characters for the
cursor keys, rather than having to use a mix of keyboard scan codes
and Unicode characters.
This will quickly spiral out of control as you move past the easy
cases like adding character codes for cursor control functions.
the easy cases like adding character codes for cursor control functions
are not so easy when you have a short phrase or R-text (Right-to Left)
embedded in a line of Englisch (L-text).
(...)
This is not related to the Robert's concern about why he has to use a mix of
scan codes and Unicode characters in a keyboard driver.
Effectively scan codes are not characters, but this is how a keyboard
communicate with the OS, before the OS translate these scancodes into
characters, according to a keyboard map.
When there's no plain-text character associated to a key, the keyboard map
will not map characters, but will leave these scan codes mostly intact (in
fact Windows drivers translate some physical scancodes to logical scancodes
to translate some functions which have various positions depending on
keyboard, but that should be treated as equivalent.
With MSKLC, you won't notice these changes in the generated customized
table, because this translation is performed either in the BIOS, or in the
keyboard hardware, or in a default scancode map within the generic keyboard
driver, so that the effective keyboard mapping is from the pair of a virtual
(translated) scancodes and keyboard mode, to characters. The virtual
scancodes are simpler, and also hide some details about the special byte
0x00 that can prefix some extension keys before their actual scancode.
Also the physical scancodes are defined on 7 bits only, the 8th one
corresponding to a keypress or keyrelease status; keyboards may generate
multiple keypress scancodes at regular intervals as long as the key is
maintained pressed (the rate of this autorepeatition is not specified in the
keyboard mapping, but with an external setting, depending on user's
preferences, sent to the running keyboard driver, which may give this rate
information to the hardware as a configuration command.
Several scancode translations are performed in the generic keyboard driver,
such as recognizing the AltGr key of European keyboard as equivalent to
Ctrl+Alt (if this is enabled in a flag set in the custom keyboard map), or
the translation of AltGr+keys on the numeric pad to compose either local OEM
characters or local ANSI characters. Note however that the generic
keyboard driver included in MSKLC has no support for the composition of
characters per their Unicode hexadecimal code point; for such a thing, you
need a custom driver code, and not only a simple mapping table. Same thing
if you want to support more complex input modes (for example with Asian
character sets), that can't be represented easily with a simple table of
pairs combining a current state mode and a logical scancode.
Keyboard drivers also contain several other hardware specific commands to
set some advanced features in keyboards, and MSKLC will not let you program
them, but the generic driver it contains will contain several standard
features, enabled with a physical keyboard interface driver.
What remains to the application is a set of keypress/keyrelease events with
a virtual (translated scancode), that an application may trap if the
application does not want that these events be translated through the
keyboard mapping in the generic driver. The generic driver then intercepts
the untrapped events, and translate them to characters according to the key
mapping table you have created in MSKLC.
Most applications will then be interested only in trapping some function
keys that never generate a character in the mapping table, and will leave
the other virtual scancodes (the VK_key codes) translated by the installed
local keymap, which will generate other character events.
The form of the generated character events depends on the intercepting
application: if the application with the keyboard focus is Unicode-aware, it
will wait for Unicode char events, and the keyboard mapping will then send
these characters; if the application with the keyboard focus is waiting for
characters using some legacy interface emulation (BIOS, DOS, or Windows
ANSI), the mapping will be used, but the Unicode characters in the keymap
will first be transcode into the appropriate charset. If an Unicode
character can't be converted into the appropriate charset, the application
will receive an error event, that by default will take the form of a sound
event, or the generation of a default character, and sometimes a combination
of boths. Such events won't occur for virtual keys that have no mapping to a
Unicode character in the localized keymap.
Note: Virtual keys is what allows many models of keyboard to be unified,

Re: Relationship between Unicode and 10646

2004-11-30 Thread Philippe Verdy

From: Peter Kirk [EMAIL PROTECTED]
On 30/11/2004 19:53, John Cowan wrote:
Your main misunderstanding seems to be your belief that WG2 is a
democratic body; that is, that it makes decisions by majority vote. ...
Thank you, John. This was in fact my question: will the amendment be 
passed automatically if there is a majority in favour, or does it go back 
for further discussion until a consensus is reached? You have clarified 
that the latter is true. And I am glad to hear it.
Probably, the WG2 will now consider alternatives to examine how Phoenician 
can be represented. The current proposal may be voted no for other reasons 
that just a formal opposition against the idea of encoding it as a separate 
script, possibly because the proposal is still incomplete, or does not 
resolve significant issues, or does not help making Phoenician texts better 
worked with computers...

Ther may exist arguments caused by the difficulties to treat several 
variations of Phoenician, or possibly a misrepresentation of what the new 
script is supposed to cover (given that Phoenician is itself at the 
connecting node of separate scripts, and may cause specific difficulties 
when some variations are occuring in direction to the future Greek or Hebrew 
or Arabic scripts).

If the script itself is not well delimited, there's no reason to encode it, 
but preferably to approach it from one of the existing branches. How the 
various branches will converge to the original script may cause lots of 
unresolved questions, and other more complex problems if Phoenician is not 
the root of the tree and as other predecessors.

So may be it's too soon to encode Phoenician now, given that its immediate 
successors are still not encoded, and a formal model for them is still 
missing.

In addition, there may already several alternatives for its representation, 
with too strong and antogonist arguments from either Ellenists or Semitists, 
that have adopted distinct models for the same origin text, based on the 
models they have established for its successors.

So there's possibly a need to reconciliate (unify) these models, even if 
this requires encoding some well-identified letters with distinct codes, 
depending on their future semantic evolutions, or the set of variants they 
should cover.

My opinion, is that Semitists are satisfied today when handling Phoenician 
text as if it was a historic variant Hebrew, and Ellenists satisfied as if 
it was a historic variant of Greek (which itself could be written 
alternatively as RTL or LTR or boustrophedon).

A way to reconciliate those approaches can consist in a transliteration 
scheme. So until such a working transliteration scheme is created, that will 
specify the matching rules, it may be hard to define prematurely the set of 
letters needed for representing Phoenician texts.

My view does not exclude a future encoding of Phoenician, to avoid constant 
transliterations for the same texts, but for now the need to do it now is 
not justified, and not urgent.

In the interim, fonts can be built for Phoenicians according to the encoding 
of Hebrew, or according to the encoding of Greek, and this can fit with the 
respective works performed by the two categories of searchers.

Now if both agree on the same set of base letters and variants, they could 
create a more definitive set of representative letters and variants, and 
formulate a future proposal for a separate script encoding, from which an 
easy transliteration scheme from legacy Hebrew or Greek will be possible.

What do you think of this answer?

Re: Nicest UTF

2004-12-02 Thread Philippe Verdy

There's no *universal* best encoding.
UTF-8 however is certainly today the best encoding for portable 
communications and data storage (but it competes now with SCSU which uses a 
compressed form where, on average, each Unicode character is represented by 
one byte, in most documents; but other schemes also exist that use deflate 
compression on UTF-8).

The problem with UTF-16 and UTF-32 is byte ordering, where byte is meant in 
terms of portable networking and file storage, i.e. 8-bit in almost all 
current technologies. With UTF-16 and UTF-32, you need to get a way to 
determine how bytes are ordered in the code unit, as read from a 
byte-oriented stream. You need not with UTF-8.

The problem with UTF-8 is that it will be most often inefficient or not easy 
to work with within applications and libraries, that are easier accessing 
strings and counting characters coded on fixed-width code units.

Although UTF-16 is not strictly fixed-width, it is quite easy to work with, 
and is often more efficient than UTF-32 due to memory allocations.

UTF-32 however is the easiest solution when applications really want to 
handle each possible character encoded on one Unicode code point with a 
single code unit.

All UTF encodings (including the SCSU compressed encoding, or BOCU-8 which 
is a variant of UTF-8, or also now the GB18030 Chinese standard which is now 
a valid representation of Unicode) have their pros and cons.

Choose among them because they are widely documented, and offer good 
interoperabilities within lots of libraries handling them with similar 
semantics.

If you are not satisfied in your application by these encodings, you may 
even create your own one (like Sun did when modifying UTF-8 to allow 
representing any Unicode string within a null-terminated C string, and also 
allow any sequence of 16-bit code units, even the invalid ones where 
surrogates are unpaired, to be represented on 8-bit streams). If you do 
that, don't expect this encoding to be easily portable and recognized by 
other systems, unless you document it with a complete specification and make 
it available for free alternate implementations by others.

- Original Message - 
From: Arcane Jill [EMAIL PROTECTED]
To: Unicode [EMAIL PROTECTED]
Sent: Thursday, December 02, 2004 2:19 PM
Subject: RE: Nicest UTF


Oh for a chip with 21-bit wide registers!
:-)
Jill
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Behalf Of Antoine Leca
Sent: 02 December 2004 12:12
To: Unicode Mailing List
Subject: Re: Nicest UTF
There are other factors that might influence your choice.
For example, the relative cost of using 16-bit entities: on a Pentium it 
is
cheap, on more modern X86 processors the price is a bit higher, and on 
some
RISC chips it is prohibitive (that is, short may become 32 bits; 
obviously,
in such a case, UTF-16 is not really a good choice). On the other extreme,
you have processors where byte are 16 bits; obviously again, then UTF-8 is
not optimum there. ;-)

Re: Nicest UTF

2004-12-02 Thread Philippe Verdy

If you need immutable strings, that take the least space as possible in 
memory for your running app, then consider using SCSU, for the internal 
storage of the string object, then have a method return an indexed array of 
code points, or a UTF-32 string when you need it to mutate the string object 
into another.

SCSU is excellent for immutable strings, and is a *very* tiny overhead above 
ISO-8859-1 (note that the conversion from ISO-8859-1 to SCSU is extremely 
trivial, may be even simpler than to UTF-8!)

From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
For internals of my language Kogut I've chosen a mixture of ISO-8859-1
and UTF-32. Normalized, i.e. a string with chracters which fit in
narrow characters is always stored in the narrow form.
I've chosen representations with fixed size code points because
nothing beats the simplicity of accessing characters by index, and the
most natural thing to index by is a code point.
Strings are immutable, so there is no need to upgrade or downgrade a
string in place, so having two representations doesn't hurt that much.
Since the majority of strings is ASCII, using UTF-32 for everything
would be wasteful.
Mutable and resizable character arrays use UTF-32 only.

Re: Nicest UTF

2004-12-03 Thread Philippe Verdy

From: Doug Ewell [EMAIL PROTECTED]
I appreciate Philippe's support of SCSU, but I don't think *even I*
would recommend it as an internal storage format.  The effort to encode
and decode it, while by no means Herculean as often perceived, is not
trivial once you step outside Latin-1.
I said: for immutable strings, which means that these Strings are 
instanciated for long term, and multiple reuses. In that sense, what is 
really significant is its decoding, not the effort to encode it (which is 
minimal for ISO-8859-1 encoded source texts, or Unicode UTF-encoded texts 
that only use characters from the first page).

Decoding SCSU is very straightforward, even if this is stateful (at the 
internal character level). But for immutable strings, there's no need to 
handle various initial states, and the states associated with each conponent 
character of the string has no importance (strings being immutable, only the 
decoding of the string as a whole makes sense).

The stateful decoding of SCSU can be part of an accessor from a storage 
class, which can also be optimized easily to avoid multiple reallocations of 
the decoded buffer.

SCSU can only be a complication if you want mutable strings; however mutable 
strings are needed only if you intend to transform a source text and work on 
its content. If this is a temporary need to create other immutable strings, 
you can still use SCSU for encoding the final results, and work with UTFs 
for intermediate results.

In a text editor, where you'll constantly need to work at the character 
level, the text is not immutable, and this is effectively not a good 
encoding for working on it (but all UTFs, including UTF-8 or GB18030) are 
easy to work with at this level.

In practice, a text editor often needs to split the edited text into 
manageable fragments encoded separately, for performance reason (as text 
insertion and deletion in a large buffer is a lengthy and costly operation). 
Given that UTFs can increase the memory need, it is not completly stupid to 
think about using a compression scheme for individual fragments of the large 
text file; the cost of encoding/decoding SCSU, if this limits the number VM 
swaps to the disk to access to more fragments, can be an interesting 
optimization, as the total size on disk will be smaller, reducing the number 
of I/O operations, and so enhancing the program responsiveness to user 
commands.

(Note that there already exists applications of such compression schemes 
even within filesystems that support editable but still compressed files... 
SCSU is not the option used in this case, because it is too specific to 
Unicode texts, but they use a much more complex compression scheme, most 
often derived from Lempel-Ziv-Welsh compression algorithms, and this is not 
significantly increasing the total load time, given that this also 
significantly reduces the frequency of disk I/O, which is a much longer and 
costly operation...)

The bad thing about SCSU is that the compression scheme is not 
deterministic: you can't compare easily too instances of strings encoded 
with SCSU (because several alternatives are possible) without actually 
decoding it prior to performing their collation (with standard UTFs, 
including the chinese GB18030 standard, the encoding is deterministic and 
allows comparing encoded strings without first decoding them).

But this argument is also true for almost all compression schemes, even for 
the well-known deflate algorithm or for very basic compressors like RLE, 
or a newer bzip2 compression (depending on the compressor implementation 
used and some tunable parameters, and the number of alternatives and size of 
internal dictionaries considered during the compression).

The advantage of SCSU over generic data compressors like deflate is that 
it does not require a large and complex state (all the SCSU decoding states 
are managed with a very limited number of fixed-sized variables), so its 
decompression can be easily hardcoded and optimized a lot, up to a point 
were the cost of decompression will be nearly invisible to almost all 
applications: the most significant costs will be most often within collators 
or text parsers; a compliant UCA collation algorithm is much more complex to 
implement and optimize than a SCSU decompressor, and it is more CPU- and 
resource-intensive.

Re: Nicest UTF

2004-12-03 Thread Philippe Verdy

RE: Nicest UTFFrom: Lars Kristan
I agree. But not for reasons you mentioned. There is one other important 
advantage:
UTF-8 is stored in a way that permits storing invalid sequences. I will 
need to
elaborate that, of course.
Not true for UTF-8. UTF-8 can only store valid sequences of code points, in 
the valid range from U+ to U+D7FF and U+E000 to U+10 (so excluding 
surrogate code points).

But it's true that there are non standard extensions of UTF-8 (such as Sun's 
one for Java) that allow escaping some byte values normally generated by the 
standard UTF-8 (notably the single byte 0x00 representing U+), or that 
allow representing isolated or incorrectely paired surrogate code points 
which may be present in a normally invalid Unicode string, or that allow to 
represent non-BMP characters with 6 bytes, where each pair of 3 bytes 
represent surrogate code units (not code points!).

Only the CESU-8 variant of UTF-8 is documented and standardized (where 
non-BMP characters are represented by encoding on two groups of 3 bytes the 
two surrogate code units that would be used in UTF-16 to represent the same 
character). CESU-8 is less efficient than UTF-8, but even in that case it 
does not allow representing invalid Unicode strings containing surrogate 
*code points* which are not characters (I did not say *code units*), even if 
they are apparently correctly paired (the concept of paired surrogates 
only exist within the UTF-16 encoding scheme, that represent strings not as 
stream of characters coded with code points, but as streams of 16-bit code 
units).

If you need extensions like this, you do because you need to represent data 
which is not valid Unicode text. Such extended scheme is not a UTF, but a 
serialization format for this type of data (even if this type can represent 
all instances of valid Unicode text).

Re: OpenType vs TrueType (was current version of unicode-font)

2004-12-03 Thread Philippe Verdy

From: Gary P. Grosso [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Friday, December 03, 2004 5:10 PM
Subject: RE: OpenType vs TrueType (was current version of unicode-font)

Hi Antoine, others,
Questions about OpenType vs TrueType come up often in my work, so perhaps 
the list will suffer a couple of questions in that regard.

First, I see an O icon, not an OT icon in Windows' Fonts folder for 
some fonts and a TT icon for others.  Nothing looks like OT to me, so 
are we talking about the same thing?
See www.opentype.org: OpenType is a trademark of Microsoft Corporation 
(bottom of page)
The handdrawn-like O is a logo used by Microsoft as the icon representing 
OpenType fonts.

However the OpenType web site is apparently fixed only to this presentation 
page, with a single link to MonoType Corporation, not to the previous 
documentation hosted by Microsoft.

Is Microsoft stopping supporting OpenType, and about to sell the technology 
to the MonoType font foundry?

Re: Nicest UTF

2004-12-03 Thread Philippe Verdy

From: Asmus Freytag [EMAIL PROTECTED]
A simplistic model of the 'cost' for UTF-16 over UTF-32 would consider
1) 1 extra test per character (to see whether it's a surrogate)
2) special handling every 100 to 1000 characters (say 10 instructions)
3) additional cost of accessing 16-bit registers (per character)
4) reduction in cache misses (each the equivalent of many instructions)
5) reduction in disk access (each the equivaletn of many many
instructions)
(...)
For 4 and 5, the multiplier is somewhere in the 100s or 1000s, for each
occurrence depending on the architecture. Their relative weight depends
not only on cache sizes, but also on how many other instructions per
character are performed. For text scanning operations, their cost
does predominate with large data sets.
I tend to disagree with you on points 4 and 5: cache misses, and disk 
accesses (more commonly refered to as data locality in computing 
performances) really favors UTF-16 face to UTF-32, simply because UTF-16 
will be more compact for almost every text you need to process, unless you 
are working on texts that only contain characters from a script *not present 
at all* in the BMP (this sentence excludes Han, even if there are tons of 
ideographs out of the BMP, because these ideographs are almost never used 
alone, but used seldomly within tons of other conventional Han characters in 
the BMP).

Given that these scripts are all historic ones, or were encoded for 
technical purpose with very specific usage, a very large majority of texts 
will not use significant numbers of characters out of the BMP, so the use of 
surrogates in UTF-16 will remain a minority. In all cases, even for texts 
made only of characters out of the BMP, UTF-16 can't be larger than UTF-32.

The only case where it would be worse than UTF-32 is for the internal 
representation of strings in memory, where 16-bit code units can't be 
represented with 16-bit only, for example if memory cells are not 
individually addressable below units of at least 32 bits, and the CPU 
architecture is very inefficient when working with 16-bit bitfields within 
32-bit memory units or registers, due to extra shifts and masking operations 
needed to pack and unpack 16-bit bitfields into a single 32-bit memory cell.

I doubt that such architecture would be very successful, given that too many 
standard protocols depend on being able to work with datastreams made of 
8-bit bytes: with such architecture, all data I/O would need to store 8-bit 
bytes in separate but addressable 32-bit memory cells, which would really be 
a poor usage of available central memory (such architecture would require 
much more RAM to work with equivalent performances for data I/O, and even 
the very costly fast RAM caches would need to be increased a lot, meaning 
higher hardware construction costs).

So even on such 32-bit only (or 64-bit only...) architectures (where for 
example the C datatype char would be 32-bit or 64-bit), there would be 
efficient instructions in the CPU to allow packing/unpacking bytes in 32-bit 
(or 64-bit) memory cells (or at least at the register level, with 
instructions allowing to work efficiently with such bitfields).

Re: Nicest UTF

2004-12-03 Thread Philippe Verdy

From: Theo [EMAIL PROTECTED]
From: Asmus Freytag [EMAIL PROTECTED]
So, despite it being  UTF-8 case insensitive, it was totally blastingly 
fast. (One person reported counting words at 1MB/second of pure text, from 
within a mixed Basic / C environment). You'll need to keep in mind, that 
the counter must look up through thousands of words (Every single word its 
come across in the text), on every single word lookup.

Anyhow, from my experience, UTF-8 is great for speed and RAM.
Probably true for English or most Western European Latin-based languages 
(plus Greek and Coptic).

But for other languages that still use lots of characters in the range 
U+ to U+03FF (C0 and C1 controls, Basic Latin, Latin-1 suplement, Latin 
Extended-A and -B, IPA Extensions, Spacing Modifier Letters, Combining 
Diacritical Marks, Greek and Coptic) UTF-8 and UTF-16 may be nearly as 
efficient.

For all others, that need lots of characters out of the range U+ to 
U+03FF (Cyrillic, Armenian, Hebrew, Arabic, and all Asian or Native-American 
or African scripts, or even PUAs), UTF-16 is better (more compact in memory, 
so faster).

UTF-32 will be better only for historic texts written nearly completely with 
characters out of the BMP (for now, only Old Ialic, Gothic, Ugaritic, 
Deseret, Shavian, Osmanya, Cypriot Syllabary), if C0 controls (such as TAB, 
CR and LF), or ASCII SPACE, or NBSP are a minority.

Re: OpenType vs TrueType (was current version of unicode-font)

2004-12-03 Thread Philippe Verdy

From: Peter Constable [EMAIL PROTECTED]
Why would you think the creation of this site might suggest that
Microsoft is selling off its IP in relation to OpenType to Monotype? If
Motorola created a site www.pentium4.org, would you jump to the
conclusion that they were selling off that IP?
What alarmed me is that this domain was previously referencing Microsoft's 
documentation.
Also the fact that MonoType was sold by Agfa, with his name changed.
Also the fact that Microsoft's presentation of OpenType (previously TrueType 
Open, previously TrueType) has removed the reference to Apple's 
contributions in TrueType, leaving only Microsoft as the owner of the 
trademark and technology, also partly attributed to Adobe).
With Apple now supporting other layout tables, that are not referenced in 
the Microsoft documentations for OpenType, this really suggested me a branch 
split after disagreement (also increased by the new status of Monotype).
What is strange also is that the www.opentype.org web site is a page whose 
title refers to Arial Unicode MS. Isn't it a Microsoft font? These things 
all combined are very intrigating.
Is there a way outside OpenType for other system vendors than Microsoft and 
Apple? This standard loks more and more proprietary...

Re: Nicest UTF

2004-12-04 Thread Philippe Verdy

From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
Philippe Verdy [EMAIL PROTECTED] writes:
Random access by code point index means that you don't use strings
as immutable objects,
No. Look at Python, Java and C#: their strings are immutable (don't
change in-place) and are indexed by integers (not necessarily by code
points, but it doesn't change the point).
Those strings are not indexed. They are just accessible through methods or 
accessors, that act *as if* they were arrays. There's nothing that requires 
the string storage to use the same exposed array, and in fact you can as 
well work on immutable strings, as if they were vectors of code points, or 
vectors of code units, and sometimes vectors of bytes.

Note for example the difference between the .length property of Java arrays, 
and the .length() method of java String instances...

Note also the fact that the conversion of an array of bytes or code units 
or code points to a String requires distinct constructors, and that the 
storage is copied rather than simply referenced (the main reason being that 
indexed vectors or arrays are mutable in their indexed content, but not 
String instances which become sharable).

Anyway, each time you use an index to access to some components of a String, 
the returned value is not an immutable String, but a mutable character or 
code unit or code point, from which you can build *other* immatable Strings 
(using for example mutable StringBuffers or StringBuilder or similar objects 
in other languages). When you do that, the returned character or code unit 
or code point does not guarantee that you'll build valid Unicode strings. In 
fact, such character-level interface is not enough to work with and 
transform Strings (for example it does not work to perform correct 
transformation of lettercase, or to manage grapheme clusters). The most 
powerful (and universal) transformations are those that don't use these 
interfaces directly, but that work on complete Strings and return complete 
Strings.

The character-level APIs are convenience for very basic legacy 
transformations, but they do not solve alone most internationalization 
problems; or they are used as a protected interface that allow building 
more powerful String to String transformations.

Once you realize that, which UTF you use to handle immutable String objects 
is not important, because it becomes part of the blackbox implementation 
of String instances. If you consider then the UTF as a blackbox, then the 
real arguments for an UTF or another depends on the set of String-to-String 
transformations you want to use (because it conditions the implmentation of 
these transformations), but more importantly it affects the efficiency of 
the String storage allocation.

For this reason, the blackbox can determine itself which UTF or internal 
encoding is the best to perform those transformations: the total volume of 
immutable string instances to handle in memory and the frequency of their 
instanciation determines which representation to use (because large String 
volumes will sollicitate the memory manager, and will seriously impact the 
overall application performance).

Using SCSU for such String blackbox can be a good option if this effectively 
helps in store many strings in a compact (for global performance) but still 
very fast (for transformations) representation.

Unfortunately, the immutable String implementations in Java or C# or Python 
does not allow the application designer to decide which representation will 
be the best (they are implemented as concrete classes instead of virtual 
interfaces with possible multiple implementations, as they should; the 
alternative to interfaces would have been class-level methods allowing the 
application to trade with the blackbox class implementation the tuning 
parameters).

There are other classes or libraries within which such multiple 
representations are possible and easily and transparently convertible from 
one to the other. (Note that this discussion is related to the UTF used to 
represent code points, but today, there are also needs to work on strings 
within grapheme cluster boundaries, including the various normalization 
forms, and a few libraries do exist for which the various normalizations can 
be changed without changing the immutable aspect of Strings, the 
complexity being that Strings do not always represent plain-text...)

Re: Nicest UTF

2004-12-05 Thread Philippe Verdy

- Original Message - 
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Sunday, December 05, 2004 1:37 AM
Subject: Re: Nicest UTF

Philippe Verdy [EMAIL PROTECTED] writes:
There's nothing that requires the string storage to use the same
exposed array,
The point is that indexing should better be O(1).
SCSU is also O(1) in terms of indexing complexity... simply because it keeps 
the exact equivalence with codepoints, and requires a *fixed* (and small) 
number of steps to decode it to code points, but also because the decoder 
states uses a *fixed* (and small) number of variables for the internal 
context (unlike more powerful compression algorithms like dictionnary-based, 
Lempel-Ziv-Welsh-like, algorithms such as deflate).

Not having a constant side per code point requires one of three things:
1. Using opaque iterators instead of integer indices.
2. Exposing a different unit in the API.
3. Living with the fact that indexing is not O(1) in general; perhaps
  with clever caching it's good enough in common cases.
Altough all three choices can work, I would prefer to avoid them.
If I had to, I would probably choose 1. But for now I've chosen a
representation based on code points.
Anyway, each time you use an index to access to some components of a
String, the returned value is not an immutable String, but a mutable
character or code unit or code point, from which you can build
*other* immatable Strings
No, individual characters are immutable in almost every language.
But individual characters do not always have any semantic. For languages, 
the relevant unit is almost always the grapheme cluster, not the character 
(so not its code point...). As grapheme clusters need to be represented on 
variable lengths, an algorithm that could only work with fixed-width units 
would not work internationaly or would cause serious problems for correct 
analysis or transformation of true languages.

Assignment to a character variable can be thought as changing the
reference to point to a different character object, even if it's
physically implemented by overwriting raw character code.
When you do that, the returned character or code unit or code point
does not guarantee that you'll build valid Unicode strings. In fact,
such character-level interface is not enough to work with and
transform Strings (for example it does not work to perform correct
transformation of lettercase, or to manage grapheme clusters).
This is a different issue. Indeed transformations like case mapping
work in terms of strings, but in order to implement them you must
split a string into some units of bounded size (code points, bytes,
etc.).
Yes, but why do you want that this intermediate unit be the code point? Such 
algorithm can be developped with any UTF, or even with compressed encoding 
schemes through accessor or enumerator methods...

All non-trivial string algorithms boil down to working on individual
units, because conditionals and dispatch tables must be driven by
finite sets. Any unit of a bounded size is technically workable, but
they are not equally convenient. Most algorithms are specified in
terms of code points, so I chose code points for the basic unit in
the API.
Most is the right term here: this is not a requirement, and it's not 
because it is the simplest way to implement such algorithm that it will be 
the most efficient in terms of performance or resource allocations. Most 
experiences prove that the most efficient algorithms are also complex to 
implement.

Code points are probably the easiest thing to describe what an text 
algorithm is supposed to do, but this is not a requirement for applications 
(in fact many libraries have been written that correctly implement the 
Unicode algorithms, without even dealing with code points, but only with 
in-memory code units of UTF-16 or even in UTF-8 or GB18030, or directly with 
serialization bytes of UTF-16LE or UTF-8 or SCSU or ether encoding schemes).

Which represent will be the best is left to implementers, but I really think 
that compressed schemes are often introduced to increase the application 
performances and reduce the needed resources both in memory and for I/O, but 
also in networking where interoperability across systems and bandwidth 
optimization are also important design goals...

Re: script complexity, was Re: OpenType vs TrueType

2004-12-05 Thread Philippe Verdy

Richard Cook rscook at socrates dot berkeley dot edu wrote:
Script complexity is not so easily quantified. Has anyone tried to
sort scripts by complexity? In terms of the present discussion, Han
would be viewed as a simple script, and yet it is simple only in
terms of the script model in which ideographs are the smallest unit.
In a stroke-based Han script model, Han is at least as complex as any.
If Han had not been encoded with a ideograph-based model, may be(?) we would 
have needed much less code points. However the main immediate problem would 
have been that the layout of composite radical and strokes in the 
ideographic square is very complex, highly contextual, and in fact too much 
variable across dialects and script forms to allow a layout algorithm to be 
designed and standardized.

At least one could have standardized a Han strokes-to square layout system, 
but it would have required a huge dictionnary, requiring many 
dialect-specific sections to handle the variant forms and placement of the 
composing strokes. In addition, the square model is not imperitive in Han, 
because there are various styles for writing it, where the usual square 
model is much relaxed, or simply not observed on actual documents.

To model such variations in a stroke-based model, it would have been needed 
to encode:
- the strokes themselves (all, not just the radicals!)
- stroke variants
- descriptive composition pseudo-characters (like the existing IDC in 
Unicode)
- dialectal composition rules.
And then to create a very complex specification to describe each ideograph 
according to this model, and allow a renderer to redraw the ideographs from 
such composition grapheme clusters.
The second problem is that GB* and BigFive encodings already existed as 
widely used standards, but there was no concrete and interoperable solution 
to represent Han characters with such composed sequences.

This modeling was possible for Hangul, but with a simplification: the 
encoded jamos sometime represent several strokes (considered as letters, 
also because they have a clear phonetic value, but sometimes grouped within 
the same jamo to simplify the design of the Hangul layout system, notably 
for double-consonnant SANG* jamos). But a simpler system of jamos was 
still possible (for example it was easy to model the double-consonnant jamos 
as two successive simpler jamos, and then update the Hangul syllable model 
accordingly)

Re: Unicode for words?

2004-12-05 Thread Philippe Verdy

From: Ray Mullan [EMAIL PROTECTED]
I don't see how the one million available codepoints in the Unicode 
Standard could possibly accommodate a grammatically accurate vocabulary of 
all the world's languages.
You have misread the message from Tim: he wanted to use code points above 
U+10 within the full 32-bit space (meaning more than 4 billions 
codepoints, when Unicode and ISO-10646 only allow 2 millions...)

He wanted to use that to encode words on a single code point, as a possible 
compression scheme. But he forgets that words can have its component letters 
affected by style or during rendering.

Also a font or renderer would be unable to draw the text without having 
the equivalent of an indexed dictionnary of all words on the planet!

If compression is a goal, he forgets that the space gain offered by such 
compression will be very modest face to more generic data compressors like 
deflate or bzip2 that can compress the represented texts more efficiently 
without even needing such large dictionnary (that is in perpetual evolution 
by every speaker of any language, without any prior standard agreement 
anywhere!).

Forget his idea, it is technically impossible to do. At best you could 
create some protocols that will compact some widely used words (this is what 
WAP does for widely used HTML elements or attributes), but this is still not 
a standard outside of this limited context.

Suppose that Unicode encodes the common English words the, an, is, 
etc... then a protocol could decide that these words are not important and 
will filter them. What will happen if these words do appear in non-English 
languages where they are semantically significant? These words would be 
missing. To paliate this inconvenient the codepoints would only designate 
the words used in one language and not the other, so an would have 
different codes whever it is used in English or in another language.

The last problem is that too many languages do not have well-established and 
computerized lexical dictionnaries, and grammatical rules that allow 
composing words are not always known. The number of words in a single 
language cannot also be bound to a known maximum (a good example in German 
where composed words are virtually unlimited!)

So forget this idea: Unicode will not create a standard to encode words. 
Words will be represented after modeling them to a script system made of 
simpler sets of letters or ideographs or punctuation and diacritics. The 
representation of words with those letters is an orthographic system, 
specific to each language, that Unicode will not standardize.

Re: Nicest UTF

2004-12-05 Thread Philippe Verdy

From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
Philippe Verdy [EMAIL PROTECTED] writes:
The point is that indexing should better be O(1).
SCSU is also O(1) in terms of indexing complexity...
It is not. You can't extract the nth code point without scanning the
previous n-1 code points.
The question is why you would need to extract the nth codepoint so blindly. 
If you have such reasons, because you know the context in which this index 
is valid and usable, then you can as well extract a sequence using an index 
in the SCSU encoding itself using the same knowledge.

Linguistically, extracting a substring or characters at any random index in 
a sequence of code points will only cause you problems. In general, you will 
more likely use index as a way to mark a known position that you have 
already parsed sequentially in the past.

However it is true that if you have determined a good index position to 
allow future extraction of substrings, SCSU will be more complex because you 
not only need to remember the index, but also the current state of the SCSU 
decoder, to allow decoding characters encoded starting at that index. This 
is not needed for UTF's and most legacy character encodings, or national 
standards, or GB18030 which looks like a valid UTF, even though it is not 
part of the Unicode standard itself.

But remember the context in which this discussion was introduced: which UTF 
would be the best to represent (and store) large sets of immutable strings. 
The discussion about indexes in substrings is not relevevant in that 
context.

Re: Unicode for words?

2004-12-05 Thread Philippe Verdy

Don't misinterpret my words or arguments here: the purpose of the question 
was strictly about which UTF or other transformation would be good for 
interoperability, and storage, and whever it would be a good idea to encode 
words with standard codes.

So in my view, it is completely unneeded to create such standard codes for 
common words, if these words are in the natural human language (it may make 
sense for computer languages, but this is specific to the implementation of 
such language, and should be part of its specification rather than being 
standardized in a general purpose encoding like Unicode code points, made to 
fit also all the needs for the representation of human languages, which are 
NOT standardized and constantly evolving.) Creating such standard codes for 
human words would not only be an endless task, but also a work that would 
rapidly become obsoleted, and not based on the very variable uses of human 
languages. Let's keep Unicode simple without attempting to encode words 
(even for Chinese, we encode ideographic characters, but not words made 
often of two characters each representing a single syllable).

If you want to encode words, you create an encoding based on a pictographic 
representation of human languages, and you are going to another way than the 
way followed for a very long history of evolution by the inventors of script 
systems. You would be returning to the first ages of humanity... where men 
had lots of difficulty to understand each other, and difficulties to 
transmit their acquired knowledge.

This does not exclude other UTF representation to implement algorithms, only 
as an intermediate form which eases the processing. However, you are not 
required to create an actual instance of the other UTF to work with it, and 
there are many examples where you can perfectly work with a compact 
representation that will fit marvelously in memory with excellent 
performance, and where the decompressed form will only be used locally.

In *many* cases, notably if the text data to manage like this is large, 
adding an object representation with just an API to access to a temporary 
decompressed form, it will improve the global performence of the system, due 
to reduced internal processing resource needs. A code that decompresses SCSU 
to UTF-32 can fit in less than 1KB of memory, but it will allow saving as 
many megabytes of memory as you wish for your large database, given that 
SCSU will take an average of nearly one byte per character (or code point) 
instead of 4 with UTF-32.

Such examples exist in real-world applications, notably in spelling and 
grammatical correctors, whose performance depend completely on the total 
size of the information thay have in their database, and the level at which 
this information is compressed (to minimize the impact on system resources, 
which is mostly determined by the quantity of information you can fit into 
fast memory without swapping between fast memory and slow disk storage). 
The most efficient correctors use very compact forms with very specific 
compression and indexing schemes through a transparent class managing the 
conversion between this compact form and the usual representation of text as 
a linear stream of characters.

Other examples exist in some RDBMS to allow improve the speed of query 
processing for large databases, or the speed of full-text searches, or in 
their networking connectors to reduce the bandwidth taken by result sets. 
The interest of data compression becomes immediate as soon as the data to 
process must go through any kind of channels (networking links, file 
storage, database table) with lower throughput than fast but expensive or 
restricted internal processing memory (including memory caches if we 
consider data locality).

From: D. Starner [EMAIL PROTECTED]
Philippe Verdy writes:
Suppose that Unicode encodes the common English words the, an, is, 
etc... then a protocol
could decide that these words are not important and will filter them.
Drop the part of the sentence before then. A protocol could delete 
the, an, etc. right
now. In fact, I suspect several library systems do drop the, etc. right 
now. Not that this
makes it a good idea, but that's a lousy argument.
If such a library does this, only based on the presence of the encoded 
words, without wondering in which language the text is written, that kind of 
processing text will be seriously inefficient or inaccurate when processing 
other languages than English for which you will have built such a library.

For plain-text (which is what Unicode deals about), even the an, the, 
is words (and so on...) are equally important as other parts of the text. 
Encoding frequent words with a single compact code may be effective for a 
limited set of applications, but it will not be as much effective as a more 
general compression scheme (deflate, bzip2, and so on...) which will work 
best independantly of the language, and without needing (when

Re: Nicest UTF

2004-12-05 Thread Philippe Verdy

From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
Now consider scanning forwards. We want to strip a beginning of a
string. For example the string is an irc message prefixed with a
command and we want to take the message only for further processing.
We have found the end of the prefix and we want to produce a string
from this position to the end (a copy, since strings are immutable).
All those are not demonstration: decoding IRC commands or similar things 
does not constitute the need to encode large sets of texts. In your 
examples, you show applications that need to handle locally some strings 
made for computer languages.

Texts of human languages, or even a collection of person names, or places 
are not like this, and have a much wider variety, but with huge 
possibilities for data compression (inherent to the phonology of human 
languages and their overall structure, but also due to repetitive 
conventions spread throughout the text to allow easier reading and 
understanding).

Scanning backward a person name or human text is possibly needed locally, 
but such text has a strong forward directionality without which it does not 
make sense. Same thing if you scan such text starting at random positions: 
you could make many false interpretations of this text by extracting random 
fragments like this.

Anyway, if you have a large database of texts to process or even to index, 
you will, in fine, need to scan this text linearily first from the beginning 
to the end, should it be only to create an index for accessing it later 
randomly. You will still need to store the indexed text somewhere, and in 
order to maximize the performance, or responsiveness of your application, 
you'll need to minimize its storage: that's where compression takes place. 
This does not change the semantic of the text, does not remove its 
semantics, but this is still an optimization, which does not prevent a 
further access with more easily parsable representation as stateless streams 
of characters, through surjective (sometimes bijective) converters between 
the compressed and uncompressed forms.

My conclusion: there's no best representation to fit all needs. Each 
representation has its merits in its domain. The Unicode UTFs are excellent 
only for local processing of limited texts, but they are not necessarily the 
best for long term storage or for large text sets.

And even for texts that will be accessed frequently, compressed schemes can 
still constitute optimizations, even if these texts need to be decompressed 
repeatedly each time they are needed. I am clearly against the arguments 
with one scheme fits all needs, even if you think that UTF-32 is the only 
viable long-term solution.

Fw: Nicest UTF

2004-12-05 Thread Philippe Verdy

From: Doug Ewell [EMAIL PROTECTED]
Here is a string, expressed as a sequence of bytes in SCSU:
05 1C 4D 6F 73 63 6F 77 05 1D 20 69 73 20 12 9C BE C1 BA B2 B0 2E
See how long it takes you to decode this to Unicode code points.  (Do
not refer to UTN #14; that would be cheating. :-)
Without looking at it, it's easy to see that this tream is separated in
three sections, initiated by 05 1C, then 05 1D, then 12. I can't remember
without looking at the UTN what they perform (i.e. which Unicode code 
points
range they select), but the other bytes are simple offsets relative to the
start of the selected ranges. Also the third section is ended by a regular
dot (2E) in the ASCII range selected for the low half-page, and the other
bytes are offsets for the script block initiated by 12.

Immediately I can identify this string, without looking at any table:
Mossov? is ??.
where  is some openining or closing quotation mark and where each ?
replaces a character that I can't decipher only through my
defective memory. (I don't need to remember the details of the standard
table of ranges, because I know that this table is complete in a small and
easily available document).
A computer can do this much better than I can (also it can even know much
better than I can what corresponds to a given code point like U+6327, if it
is effectively assigned; I'll have to look into a specification or to use a
charmap tool, if I'm not used to enter this character in my texts).
The decoder part of SCSU still remains extremely trivial to implement, 
given
the small but complete list of codes that can alter the state of the
decoder, because there's no choice in its interpretation and because the 
set
of variables to store the decoder state is very limited, as well as the
number of decision tests at each step. This is a basic finite state 
automata.

Only the encoder may be a bit complex to write (if one wants to generate 
the
optimal smallest result size), but even a moderate programmer could find a
simple and working scheme with a still excellent compression rate (around 1
to 1.2 bytes per character on average for any Latin text, and around 1.2 to
1.5 bytes per character for Asian texts which would still be a good
application of SCSU face to UTF-32 or even UTF-8).

Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

2004-12-06 Thread Philippe Verdy

- Original Message - 
From: Arcane Jill [EMAIL PROTECTED]
Probably a dumb question, but how come nobody's invented UTF-24 yet? I 
just made that up, it's not an official standard, but one could easily 
define UTF-24 as UTF-32 with the most-significant byte (which is always 
zero) removed, hence all characters are stored in exactly three bytes and 
all are treated equally. You could have UTF-24LE and UTF-24BE variants, 
and even UTF-24 BOMs. Of course, I'm not suggesting this is a particularly 
brilliant idea, but I just wonder why no-one's suggested it before.
UTF-24 already exists as an encoding form (it is identical to UTF-32), if 
you just consider that encoding forms just need to be able to represent a 
valid code range within a single code unit.
UTF-32 is not meant to be restricted on 32-bit representations.

However it's true that UTF-24BE and UTF-24LE could be useful as a encoding 
schemes for serializations to byte-oriented streams, suppressing one 
unnecessary byte per code point.

(And then of course, there's UTF-21, in which blocks of 21 bits are 
concatenated, so that eight Unicode characters will be stored in every 21 
bytes - and not to mention UTF-20.087462841250343, in which a plain text 
document is simply regarded as one very large integer expressed in radix 
1114112, and whose UTF-20.087462841250343 representation is simply that 
number expressed in binary. But now I'm getting /very/ silly - please 
don't take any of this seriously.)  :-)
I don't think that UTF-21 would be useful as an encoding form, but possibly 
as a encoding scheme where 3 always-zero bits would be stripped, providing a 
tiny compression level, which would only be justified for transmission over 
serial or network links.

However I do think that such optimization would have the effect of 
removing byte alignments, on which more powerful compressors are working. If 
you really need a more effective compression use SCSU or apply some deflate 
or bzip2 compression to UTF-8, UTF-16, or UTF-24/32... (there's not much 
difference between compressing UTF-24 or UTF-32 with generic compression 
algorithms like deflate or bzip2).

The UTF-24 thing seems a reasonably sensible question though. Is it just 
that we don't like it because some processors have alignment restrictions 
or something?
There does exists, even still today, 4-bit processors, and 1-bit processors, 
where the smallest addressable memory unit is smaller than 8-bit. They are 
used for lowcost micro-devices, notably to build automated robots for the 
industry, or even for many home/kitchen devices. I don't know whever they do 
need Unicode to represent international text, given that they often have a 
very limited user interface, incapable of inputing or output text, but who 
knows? May be they are used in some mobile phones, or within smart 
keyboards or tablets or other input devices connected to PCs...

There also exists systems where the smallest addressable memory cell is a 
9-bit byte. This is more an issue here, because the Unicode standard does 
not specify whever encoding schemes (that serialize code points to bytes) 
should set the 9th bit of each byte to 0, or should fill every 8 bit of 
memory, even if this means that 8-bit bytes of UTF-8 will not be 
synchronized with memory 9-bit bytes.

Somebody already introduced UTF-9 in the past for 9-bit systems.
A 36-bit processor could as well address the memory by cells of 36 bits, 
where the 4 highest bits would be either used for CRC control bits 
(generated and checked automatically by the processor or a memory bus 
interface within memory regions where this behavior would be allowed), or 
either used to store supplementary bits of actual data (in unchecked regions 
that fit in reliable and fast memory, such as the internal memory cache of 
the CPU, or static CPU registers).

For such things, the impact of the transformation of addressable memory 
widths through interfaces is for now not discussed in Unicode, which 
supposes that internal memory is necessarily addressed in a power of 2 and a 
multiple of 8 bits, and then interchanged or stored using this byte unit.

Today, we assist to the constant expansion of bus widths to allow parallel 
processing instead of multiplying the working frequency (and the energy 
spent and temperature, which generates other environmental problems), so why 
the 8-bit byte unit would remain the most efficient universal unit? If you 
look at IEEE floatting point formats, they are often implemented in FPU 
working on 80-bit units, and a 80-bit memory cell could as well become 
tomorrow a standard (compatible with the increasingly used 64-bit 
architectures of today) which would no longer be a power of 2 (even if this 
stays a multiple of 8 bits).

On a 80-bit system, the easiest solution for handling UTF-32 without using 
too much space would be a unit of 40-bits (i.e. two code points per 80-bit 
memory cell). But if you consider that 21 bits only are used in Unicode,

Re: proposals I wrote (and also, didn't write)

2004-12-06 Thread Philippe Verdy

From: E. Keown [EMAIL PROTECTED]
I wrote 3 Hebrew diacritics proposals between
May-July. (...)
1.  Proposal to add Samaritan Pointing to the UCS
http://www.lashonkodesh.org/samarpro.pdf
WG2 number:  N2748
2. Proposal to add Palestinian Pointing to ISO/IEC 10646
http://www.lashonkodesh.org/palpro.pdf
3. Proposal to add Babylonian Pointing to ISO/IEC 10646
http://www.lashonkodesh.org/bavelpro.pdf
(...)
Other Items Supporting the Pointing Proposals Above:
Letter Requesting 'Hebrew Extended' Block (7/2004)
http://www.lashonkodesh.org/roadm08.pdf
The Aramaic and Hebrew Character Sets (June 2004)
http://www.lashonkodesh.org/hprelist.doc
Hello Ellaine,
In all your searches and in your proposals, di you try to segregate the 
proposed additional characters into two separate categories: those needed 
for inclusion within many modern studies, and those only used in very old 
scripts with many unknown or ambiguous properties?

I ask you that because not all the Hebrew Extended chracters may need an 
allocation in the BMP (in row U+08xx as suggested), and some may be placed 
in the SMP, in a separate Hebrew-Aramaic-Mandaic Extended block (including 
notably some punctuations signs or old numerals, or other diacritics needed 
for Phoenician and other extinct branches or variants).

Philippe.

Re: Nicest UTF

2004-12-07 Thread Philippe Verdy

From: D. Starner [EMAIL PROTECTED]
If you're talking about a language that hides the structure of strings
and has no problem with variable length data, then it wouldn't matter
what the internal processing of the string looks like. You'd need to
use iterators and discourage the use of arbitrary indexing, but arbitrary
indexing is rarely important.
I fully concur to this point of view. Almost all (if not all) string 
processing can be performed in terms of sequential enumerators, instead of 
through random indexing (which has also the big disavantage of not allowing 
with rich context dependant processing behaviors, something you can't ignore 
when handling international texts).

So internal storage of string does not matter for the programming interface 
of parsable string objects. In terms of efficiency and global application 
performance, using compressed encoding schemes is highly recommanded for 
large databases of text, because the negative impact of the decompressing 
overhead is extremely small face to the huge benefits you get when reducing 
the load on system resources, on data locality and on memory caches, on the 
system memory allocator, on the memory fragmentation level, on reduced VM 
swaps and on file or database I/O (which will be the only effective 
limitation for large databases).

Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

2004-12-07 Thread Philippe Verdy

From: Kenneth Whistler [EMAIL PROTECTED]
Yes, and pigs could fly, if they had big enough wings.
Once again, this is a creative comment. As if Unicode had to be bound on 
architectural constraints such as the requirement of representing code units 
(which are architectural for a system) only as 16-bit or 32-bit units, 
ignoring the fact that technologies do evolve and will not necessarily keep 
this constraint. 64-bit systems already exist today, and even if they have, 
for now, the architectural capability of handling efficiently 16-bit and 
32-bit code units so that they can be addressed individually, this will 
possibly not be the case in the future.

When I look at the encoding forms such as UTF-16 and UTF-32, they just 
define the value ranges in which code units will be be valid, but not 
necessarily their size. You are mixing this with encoding schemes, which is 
what is needed for interoperability, and where other factors such as bit or 
byte ordering is also important in addition to the value range.

I won't see anything wrong if a system is set so that UTF-32 code units will 
be stored in 24-bit or even 64-bit memory cells, as long as they respect and 
fully represent the value range defined in encoding forms, and if the system 
also provides an interface to convert them with encoding schemes to 
interoperable streams of 8-bit bytes.

Are you saying that UTF-32 code units need to be able to represent any 
32-bit value, even if the valid range is limited, for now to the 17 first 
planes?
An API on a 64-bit system that would say that it requires strings being 
stored with UTF-32 would also define how UTF-32 code units are represented. 
As long as the valid range 0 to 0x10 can be represented, this interface 
will be fine. If this system is designed so that two or three code units 
will be stored in a single 64-bit memory cell, no violation will occur in 
the valid range.

More interestingly, there already exists systems where memory is adressable 
by units of 1 bit, and on these systems, an UTF-32 code unit will work 
perfectly if code units are stored by steps of 21 bits of memory. On 64-bit 
systems, the possibility of addressing any groups individual bits will 
become an interesting option, notably when handling complex data structures 
such as bitfields, data compressors, bitmaps, ... No more need to use costly 
shifts and masking. Nothing would prevent such system to offer 
interoperability with 8-bit byte based systems (note also that recent memory 
technologies use fast serial interfaces instead of parallel buses, so that 
the memory granularity is less important).

The only cost for bit-addressing is that it just requires 3 bits of address, 
but in a 64-bit address, this cost seems very low becaue the global 
addressable space will still be... more than 2.3*10^18 bytes, much more than 
any computer will manage in a single process for the next century (according 
to the Moore's law which doubles the computing capabilities every 3 years). 
Even such scheme would not limit the performance given that memory caches 
are paged, and these caches are always increasing, eliminating most of the 
costs and problems related to data alignment experimented today on bus-based 
systems.

Other territories are also still unexplored in microprocessors, notably the 
possibility of using non-binary numeric systems (think about optical or 
magnetic systems which could outperform the current electric systems due to 
reduced power and heat caused by currents of electrons through molecular 
substrates, replacing them by shifts of atomic states caused by light rays, 
and the computing possibilities offered by light diffraction through 
cristals). The lowest granularity of information in some future may be 
larger than a dual-state bit, meaning that todays 8-bit systems would need 
to be emulated using other numerical systems...
(Note for example that to store the range 0..0x10, you would need 13 
digits on a ternary system, and to store the range of 32-bit integers, you 
would need 21 ternary digits; memry technologies for such systems may use 
byte units made of 6 ternary digits, so programmers would have the choice 
between 3 ternary bytes, i.e. 18 ternary digits, to store our 21-bit code 
units, or 4 ternary bytes, i.e. 24 ternary digits or more than 34 binary 
bits, to be able to store the whole 32-bit range.)

Nothing there is impossible for the future (when it will become more and 
more difficult to increase the density of transistors, or to reduce further 
the voltage, or to increase the working frequency, or to avoid the 
inevitable and random presence of natural defects in substrates; escaping 
from the historic binary-only systems may offer interesting opportunities 
for further performance increase).

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Philippe Verdy

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)I know wht you mean here: 
most Linux/Unix filesystems (as well as many legacy filesystems for Windows 
and MacOS...) do not track the encoding with which filenames were encoded 
and, depending on local user preferences when that user created that file, 
filenames on such systems seem to have unpredictable encodings.

However the problem comes, most often, when interchanging data from one 
system to another, through removeable volumes or shared volumes.

Needless to say, these systems were badly designed at their origin, and 
newer filesystems (and OS APIs) offer much better alternative, by either 
storing explicitly on volumes which encoding it uses, or by forcing all 
user-selected encodings to a common kernel encoding such as Unicode encoding 
schemes (this is what FAT32 and NTFS do on filenames created under Windows, 
since Windows 98 or NT).

I understand that there may exist situations, such as Linux/Unix UFS-like 
filesystems where it will be hard to decide which encoding was used for 
filenames (or simply for the content of plain-text files). For plain-text 
files, which have long-enough data in them, automatic identification of the 
encoding is possible, and used with success in many applications (notably in 
web browsers).

But foir filenames, which are generally short, automatic identification is 
often difficult. However, UTF-16 remains easy to identify, most often, due 
to the very unusual frequency of low-values in byte sequences on every even 
or odd position. UTF-8 is also easy to identify due to its strict rules 
(without these strict rules, that forbid some sequences, automatic 
identification of the encoding becomes very risky).

If the encoding cannot be identified precisely and explicitly, I think that 
UTF-16 is much better than UTF-8 (and it also offers a better compromize for 
total size for names in any modern language). However, it's true that UTF-16 
cannot be used on Linux/Unix due to the presence of null bytes. The 
alternative is then UTF-8, but it is often larger than legacy encodings.

An alternative can then be a mixed encoding selection:
- choose a legacy encoding that will most often be able to represent valid 
filenames without loss of information (for example ISO-8859-1, or Cp1252).
- encode the filename with it.
- try to decode it with a *strict* UTF-8 decoder, as if it was UTF-8 
encoded.
- if there's no failure, then you must reencode the filename with UTF-8 
instead, even if the result is longer.
- if the strict UTF-8 decoding fails, you can keep the filename in the first 
8-bit encoding...
When parsing files:
- try decoding filenames with *strict* UTF-8 rules. If this does not fail, 
then the filename was effectively encoded with UTF-8.
- if the decoding failed, decode the filename with the legacy 8-bit 
encoding.

But even with this scheme, you will find interoperability problems because 
some applications will only expect the legacy encoding, or only the UTF-8 
encoding, without deciding...

Re: Re: Word dividers, was: proposals I wrote (and also, didn't write)

2004-12-08 Thread Philippe VERDY

 De : Michael Everson 
   But there is already in the pipeline a PHOENICIAN WORD SEPARATOR 
 [...] The glyphs for
   all of these seem indistinguishable, and so are the functions. The only
   difference seems to be the scripts they are associated with, but
   punctuation marks are supposed to be not tied to individual scripts.
 
 Read the proposal. It is not always a dot.
 
 John said:
 
 We already have gobs of dots. It's one of those things: on the 
 other hand, Unicode unifies all the Indic dandas, for example.
 
 Not for long, one hopes. And other Brahmic dandas are not unified.

Why would there be too many dots in Unicode? Unicode does not encode glyphs, 
but abstract characters nearly independantly of their glyph. The need to encode 
them is justified by distinct semantics, distinct layout rules, and the need to 
make each encoded script coherent with itself, with appropraite character 
properties not wildly and abusively borrowed from other scripts that have their 
own rules...

It's true with the exception of Latin/Greek/Cyrillic or Hiragana/Katakana that 
have so many interactions that they share the same set of diacritics (for now 
they are in a block considered generic, but in fact I really think that this 
genericity should not be abused, and that possibly Unicode could define more 
precisely to which script family they apply; I see for example little interest 
in considering the COMBINING DOT ABOVE useful for something else than 
Greek/Cyrillic/Latin (possibly a few other historic scripts), and that if 
another script needs a ombining dot above, it should be encoded separately for 
that script, with its own name and its own properties.

There are probably lots of missing properties for combining characters, notably 
layout interaction properties that are not accurately represented by combining 
classes (which just define accurately the canonical equivalences, but not the 
significant equivalences). For me it's part of the Unicode job to document and 
standardize them. Same thing for Hangul jamos (notably the historic ones, but 
also SSANG-letters) which should have additional normative properties related 
to their actual composition and layout.

Re: IUC27 Unicode, Cultural Diversity, and Multilingual Computing / Africa is forgotten once again.

2004-12-08 Thread Philippe Verdy




Probably the first thing to do for Africa is to 
extend the support of softwares with localized contents that can ALREADY be 
performed with existing encoded scripts. But even there, software companies are 
not progressing much, even if this causes no technical problems with the 
existing Unicode repertoire (for example: Xholof, Yoruba, Kenyarwanda, ... and 
even Arabic, or already used Latin-based transliterations of these 
languages).

If only such localisation efforts were made, there 
would exist business opportunities in Africa to support other native scripts as 
well. When you see that even the famous libraries in rich countries can't 
support the cost of maintaining their database or conserve so many books and 
arts, imagine what African countries can do when there's not even a version of 
Windows or Linux supporting these languages for the common user interface needed 
by everyone at the first basic stages of litteracy and computer 
knowledge.

Thanks, Microsoft has now opened his system to 
African languages (it was waited since long). I won't blame the richest man on 
earth to give money to support litteracy and development of culture in Africa, 
as a fondamental step to the economic development of these areas, but also as a 
way to fight against ignorance which has caused so much damages in Africa (in 
terms of security with wars, abuses against children, in terms of freedom with 
conditions of women, or in terms of health with the tragic pandemies of 
A.I.D.S., tuberculose...).

I really think that the conditions for the 
developement of Africa will come from education of Africa with tools and methods 
made for and by African users. But instead of only selling arms or giving 
military assistance, or giving food, we, in rich countries, should be able to 
promote donate to support education with the now very cheap technologies, and 
donations to cheap cultural programs such as the localization of 
softwares.

There's no gain for now trying to sell costly 
solutions and overprotecting them for now in Africa (even if this means that we 
should tolerate software piracy in Africa, in order to let its population get 
their basic rights to knowledge). Whever these countries will choose Windows or 
Linux does not matter (I think that even promoting Linux usage in Africa would 
expand the market for proprietary softwares like Windows or Unix distributions; 
Africa is not Asia, and the conditions for a parallel development are still not 
there).

So let's think about really getting out of our rich 
country ghettos, and give some efforts to organize technological events and 
meeting in places which are less costly for African communities.Some 
places are favorable, without major conflicts or security risks, with 
reasonnable equipments, and cumfortable accessibility by airlines: Morocco, 
Tunisia, Egypt, South Africa), but also in the Middle-East (Arab Emirates, 
Oman?);it's probably too difficult to organize something for now in the 
currently unsecure Western Africa despite of its cultural interest (however 
West-African communities are extremely present in Europe).

But more than temporary events, there's a need for 
a more permanent working group in this area. Why not seeking collaboration with 
the newcoming AfriNIC with its permanent bureaux in South Africa, Egypt and 
Mauricius?


  - Original Message - 
  From: 
  Azzedine Ait 
  Khelifa 
  To: [EMAIL PROTECTED] 
  Sent: Wednesday, December 08, 2004 11:08 
  PM
  Subject: IUC27 Unicode, Cultural 
  Diversity, and Multilingual Computing / Africa is forgotten once again.
  
  
  Hello All,
  The 
  subject of this conference is really interesting and very usefull.But once 
  again Africa is forgotten.
  I 
  want to know, if we can have the same conference "Africa 
  Oriented"scheduled ?If Not, What should we do to have 
  this conference scheduled in a city accesible for african community (like 
  Paris).Thank you all.
  AAK
  
  
  Découvrez le nouveau Yahoo! Mail : 250 Mo d'espace de 
  stockage pour vos mails !Créez 
  votre Yahoo! MailAvec Yahoo! faites un don et soutenez le Téléthon 
  !

Re: Nicest UTF

2004-12-09 Thread Philippe Verdy

From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
Ok, so it's the conversion from raw text to escaped character
references which should treat combining characters specially.
What about  with combining acute, which doesn't have a precomposed
form? A broken opening tag or a valid text character?
Also a broken opening tag for HTML/XML documents (which are NOT plain text 
documents, and must be first parsed as HTML/XML, before parsing the many 
text sections contained in text elements, element names, attribute names, 
attribute values (etc...) as plain-text under the restrictions specified in 
the HTML or XML specifications (which contain restriction for example on 
which characters are allowed in names).

The XML/HTML core syntax is defined with fixed behavior of some individual 
characters like '', '', quotation marks, and with special behavior for 
spaces. This core structure is not plain-text, and cannot be overriden, even 
by Unicode grapheme clusters.

Note that HTML/XML do NOT mandate the use or even the support of Unicode, 
just the support of a character repertoire that contains some required 
characters, and the acceptance of at least the ISO/10646 repertoire under 
some conditions, however the encoding to code points itself is not required 
for something else than numeric character references, which are more 
symbolic in a way similar to other named character entities in SGML, than 
absolute as implying the required support of the repertoire with a single 
code!

So you can as well create fully conforming HTML or XML documents using a 
character set which includes characters not even defined in Unicode/ISO/IEC 
10646, or characters defined only symbolically with just a name. Whever this 
name will map or not to one or more Unicode characters does not change the 
validity of the document itself.

And all the XML/HTML behavior ignores almost all Unicode properties 
(including normalization properties, because XML and HTML treat different 
strings, which are still canonically equivalent, as completely distinct; an 
important feature for cases like XML Signatures, where normalization of 
documents should not be applied blindly as it would break the data 
signature).

If you want to normalize XML documents, you should not do it with a 
normalizer working on the whole document as if it was plain-text. Instead 
you must normalize the individual strings that are in the XML InfoSet, as 
accessible when browsing the nodes of its DOM tree, and then you can 
serialize the normalized tree to create a new document (using CDATA sections 
and/or character references, if needed to escape some syntaxic characters 
reserved by XML that would be present in the string data of DOM tree nodes).

Note also that a XML document containing references to Unicode 
non-characters would still be well-formed, because these characters may be 
part of a non-Unicode charset.

XML document validation is a separate and optional problem from XML parsing 
which checks well-formedness and builds a DOM tree: validation is only 
performed when matching the DOM tree according to a schema definition, DTD 
or XSD, in which additional restrictions on allowed characters may be 
checked, or in which additional symbolic-only characters may be defined 
and used in the XML document with parsable named entities similar to: 
gt;.

(An example: the schema may contain a definition for a character 
representing a private company logo, mapped to a symbolic name; the XML 
document can contain such references, but the DTD may also define an 
encoding for it in a private charset, so that the XML document will directly 
use that code; the Apple logo in Macintosh charsets is an example, for which 
an internal mapping to Unicode PUAs is not sufficient to allow correct 
processing of multiple XML documents, where PUAs used in each XML documents 
have no equivalence; the conversion of such documents to Unicode with these 
PUAs is a lossy conversion, not suitable for XML data processing).

Re: Nicest UTF

2004-12-09 Thread Philippe Verdy

From: D. Starner [EMAIL PROTECTED]
Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] writes:
If it's a broken character reference, then what about A#769; (769 is
the code for combining acute if I'm not mistaken)?
Please start adding spaces to your entity references or
something, because those of us reading this through a web interface
are getting very confused.
No confusion possible if using any classic mail reader.
Blame your ISP (and other ISPs as well like AOL that don't respect the 
interoperable standards for plain-text emails) for its poor webmail 
interface, that does not properly escape the characters used in plain-text 
emails you receive (and that are NOT containing any html entities), but that 
get inserted blindly within the HTML page they create in their webmail 
interface.

Not only such webmail interface is bogous, but it is also dangerous as it 
allows arbitrary HTML code to run from plain-text emails. Ask for support 
and press your ISP to correct its server-side scripts so that it will 
correctly support plain-text emails !

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-09 Thread Philippe Verdy

From: Antoine Leca [EMAIL PROTECTED]
Err, not really. MS-DOS *need to know* the encoding to use, a bit like a
*nix application that displays filenames need to know the encoding to use
the correct set of glyphs (but constrainst are much more heavy.) Also
Windows NT Unicode applications know it, because it can't be changed :-).
But when it comes to other Windows applications (still the more common) 
that
happen to operate in 'Ansi' mode, they are subject to the hazard of 
codepage
translations. Even if Windows 'knows' the encoding used for the filesystem
(as when it uses NTFS or Joliet, or VFAT on NT kernels; in the other cases
it does not even know it, much like with *nix kernels), the only usable 
set
is the _intersection_ of the set used to write and the set used to read;
that is, usually, it is restricted to US ASCII, very much like the usable
set in *nix cases...
True, but this applies to FAT-only filesystems, which happen to store 
filenames with a OEM charset which is not stored explicitly on the volume. 
This is a known caveat even for Unix, when you look at the tricky details of 
the support of Windows file sharing through Samba, when the client requests 
a file with a short 8.3 name, that a partition used by Windows is supposed 
to support.

In fact, this nightmare comes from the support in Windows of the 
compatibility with legacy DOS applications which don't know the details and 
don't use the Win32 APIs with Unicode support. Note that DOS applications 
use a OEM charset which is part of the user settings, not part of the 
system settings (see the effects of the command CHCP in a DOS command 
prompt).

FAT32 and NTFS help reconciliate these incompatible charsets because these 
filesystems also store a LFN (Long File Name) for the same files (in that 
case the short name, encoded in some ambiguous OEM charset, is just an 
alias, acting exactly like a hard link on Unix created in the same directory 
that references the same file). LFN names are UTF-16 encoded and support 
mostly the same names as in NTFS volumes.

However, on FAT32 volumes, the short names are mandatory, unlike on NTFS 
volumes where they can be created on the fly by the filesystem driver, 
according to the current user settings for the selected OEM charset, without 
storing them explicitly on the volume. Windows contains, in CHKDSK, a way to 
verify that short names of FAT32 filesystems are properly encoded with a 
coherent OEM charset, using the UTF-16 encoded LFN names as a reference. If 
needed, corrections for the OEM charset can be applied...

This nightmare of incompatible OEM charsets do happen on Windows 98/98SE/ME, 
when the autoexec.bat file that defines the current user profile is not 
executing as it should the proper CHCP command, or when this autoexec.bat 
file has been modified or erased: in that case, the default OEM charset 
(codepage 437) is used, and short filenames are incorrectly encoded.

Another complexity is that Win32 applications, that use a fixed (not 
user-settable) ANSI charset, and that don't use the Unicode API depend on 
the conversion from the ANSI charset to the current OEM charset. But if a 
file is handled through some directory shares via multiple hosts, that have 
distinct ANSI charsets (i.e. Windows hosts running different localization of 
Windows, such as a US installation and a French version in the same LAN), 
the charsets viewed by these hosts will create incompatible encodings on the 
same shared volume.

So the only stable subset for short names, that is not affected by OS 
localization or user settings is the intersection of all possible ANSI and 
OEM charsets that can be set in all versions of Windows! No need to say, 
this designates only the printable ASCII charset for short 8.3 names. Long 
filenames are not affected by this problem.

Conclusion: to use international characters out of ASCII in filenames used 
by Windows, make sure that the the name is not in a 8.3 short format, so 
that a long filename, in UTF-16, will be created on FAT32 filesystems or on 
SMBFS shares (Samba on Unix/Linux, Windows servers)... Or use NTFS (but then 
resolve the interoperability problems with Linux/Unix client hosts that 
can't access reliably, for now, to these filesystems, and that are not 
completely emulated by Unix filesystems used by Samba, due to the limitation 
on the LanMan sharing protocol, and limitations of Unix filesystems as well 
that rarely use UTF-8 as their prefered encoding...)

Re: Software support costs (was: Nicest UTF

2004-12-10 Thread Philippe Verdy

From: Carl W. Brown [EMAIL PROTECTED]
Philippe,
Also a broken opening tag for HTML/XML documents
In addition to not having endian problems UTF-8 is also useful when 
tracing
intersystem communications data because XML and other tags are usually in
the ASCII subset of UTF-8 and stand out making it easier to find the
specific data you are looking for.
If you are working on XML documents without parsing them first, at least at 
the DOM level (I don't say after validation), then any generic string 
handling will likely fail, because you may break the XML wellformed-ness of 
the document.

Note however that you are not required to split the document into many 
string objects: you could as well create a DOM tree with nodes referencing 
pairs of offsets in the source document, if you had not to convert also the 
numeric character references.

If not doing so, you'll need to create subnodes within text elements, i.e. 
working at a level below the normal leaf level in DOM. But anyway, this is 
what you need to do when there are references to named entities that break 
the text level; but for simplicity, you would still need to parse CDATA 
sections to recreate single nodes that may be splitted by CDATA end/start 
markers inserted in a text stream that contains the ]] sequence of three 
characters.

Clearly, the normative syntax of XML comes first before any other 
interpretation of the data in individual parsed nodes as plain-text. So in 
this case, you'll need to create new string instances to store the parsed 
XML nodes in the DOM tree. Under this consideration, the encoding of the XML 
document itself plays a very small role, and as you'll need to create a 
separate copy for the parsed text, the encoding you'll choose for parsed 
nodes with which you can create a DOM tree can become independant of the 
encoding actually used in the source XML data, notably because XML allows 
many distinct encodings in multiple documents that have cross-references.

This means that implementing a conversion of the source encoding to the 
working encoding for DOM tree nodes cannot be avoided, unless you are 
limiting your parser to handle only some classes of XML documents (remember 
that XML uses UTF-8 as the default encoding, so you can't ignore it in any 
XML parser, even if you later decide to handle the parsed node data with 
UTF-16 or UTF-32).

Then a good question is which prefered central encoding you'll use for the 
parsed nodes: this depends on the Java parser API you use: if this API is 
written for C with byte-oriented null-terminated strings, UTF-8 will be that 
best representation (you may choose GB18030). if this API uses a wide-char C 
interface, UTF-16 or UTF-32 will most often be the only easy solution. In 
both cases, because the XML document may contain nodes with null bytes 
(represented by numeric character references like #0;), your API will need 
to return an actual string length.

Then what your application will do with the parsed nodes (i.e. whever it 
will build a DOM tree, or it will use nodes on the fly to create another 
document) is the application choice. If a DOM tree is built, an important 
factor will be the size of XML documents that you can represent and work 
with in memory for the global DOM tree nodes. Whever these nodes, built by 
the application, will be left in UTF-8 or UTF-16 or UTF-32, or stored with a 
more compact representation like SCSU is an application design.

If XML documents are very large, the size of the DOM tree will become also 
very large, and if your application then needs to perform complex 
transformation on the DOM tree, the constant needs to navigate in the tree 
will mean that therer will be frequent random accesses to the tree nodes. If 
the whole tree does not fit well in memory, this may sollicitate a lot the 
system memory manager, meaning many swaps on disk. Compressing nodes will 
help reduce the I/O overhead and will improve the data locality, meaning 
that the overhead of decompression costs will become much lower than the 
gain in performance caused by reduced system resource usage.

However, within the program itself UTF-8 presents a problem when looking 
for
specific data in memory buffers.  It is nasty, time consuming and error
prone.  Mapping UTF-16 to code points is a snap as long as you do not have 
a
lot of surrogates.  If you do then probably UTF-32 should be considered.
This is not demonstrated by experience. Parsing UTF-8 or UTF-16 is not 
complex, even in the case of random accesses to the text data, because you 
always have a bounded and small limit to the number of steps needed to find 
the beginning offset of a fully encoded code point: for UTF-16, this means 
at most 1 range test and 1 possible backward step. For UTF-8, this limit for 
random accesses is at most 3 range tests and 3 possible backward steps. 
UTF-8 and UTF-16 are very easily supporting backwards and forwards 
enumerators; so what else do you need to perform any string

Re: Nicest UTF

2004-12-10 Thread Philippe Verdy

From: Philippe Verdy [EMAIL PROTECTED]
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
Philippe Verdy [EMAIL PROTECTED] writes:
The XML/HTML core syntax is defined with fixed behavior of some
individual characters like '', '', quotation marks, and with special
behavior for spaces.
The point is: what characters mean in this sentence. Code points?
Combining character sequences? Something else?
See the XML character model document... XML ignores combining sequences. 
But for Unicode and for XML a character is an abstract character with a 
single code allocated in a *finite* repertoire. The repertoire of all 
possible combining characters sequences is already infinite in Unicode, as 
well as the number of default grapheme clusters they can represent.
Note there is some differently relaxed definitions of what constitutes a 
character for XML.
If you look at the XML 1.0 Second Edition, it specifies that the document is 
a text (defined only as a sequence of characters, which may represent 
markup or character data) will only contain characters in this set:
Char   ::=   #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | 
[#x1-#x10]

But the comment following it specifies:
any Unicode character, excluding the surrogate blocks, FFFE, and .
which is considerably weaker (because it would include ALL basic controls in 
the range #x0 to #x1F, and not only TAB, LF, CR); the restrictive definition 
of Char above also includes the whole range of C1 controls (#x80..#x9F), 
so I can't understand why the Char definition is so restrictive on controls; 
in addition the definition of Char also *includes* many non-characters (it 
only excludes surrogates, and U+FFFE and U+, but forgets to exclude 
U+1FFFE and U+1, U+2FFFE and U+2, ..., U+10FFFE and U+10).

So XML does allow Unicode/ISO10646 non-characters... But not all. Apparently 
many XML parsers seem to ignore the restriction of Char above, notably in 
CDATA sections

The alternative is then to use numeric character references, as defined by 
this even weaker production (in 4.1. Character and Entity References):

CharRef ::= '#' [0-9]+ ';'
| '#x' [0-9a-fA-F]+ ';'
but with this definition:
A character reference refers to a specific character in the ISO/IEC 10646 
character set, for example one not directly accessible from available input 
devices.

Which is exactly the purpose of encoding something like #1; to encode a 
SOH character U+0001 (which after all is a valid Unicode/ISO/IEC10646 
character), or even a NUL character.

The CharRef production however is annotated by a Well-Formedness 
Constraint, Legal Character:
Characters referred to using character references must match the production 
for Char.

Note however that nearly all XML parsers don't seem to honor this constraint 
(like SGML parsers...)!

This was later amended in an errata for XML 1.0 which now says that the list 
of code points whose use is *discouraged* (but explicitly *not* forbidden) 
for the Char production is now:
[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
[#x1FFFE-#x1], [#x2FFFE-#x2], [#x3FFFE-#x3],
[#x4FFFE-#x4], [#x5FFFE-#x5], [#x6FFFE-#x6],
[#x7FFFE-#x7], [#x8FFFE-#x8], [#x9FFFE-#x9],
[#xAFFFE-#xA], [#xBFFFE-#xB], [#xCFFFE-#xC],
[#xDFFFE-#xD], [#xEFFFE-#xE], [#xE-#xF],
[#x10FFFE-#x10].
This clause is not really normative, but just adds to the confusion...Then 
comes XML 1.1, that extends the restrictive Char production:Char   ::= 
[#x1-#xD7FF] | [#xE000-#xFFFD] | [#x1-#x10]with the same comment 
any Unicode character, excluding the surrogate blocks, FFFE, and .So 
in XML 1.0, the comment was accurate, not the formal production...In XML 
1.1, all C0 and C1 controls (except NUL) are now allowed, but some of them 
their use is restricted in some cases:

RestrictedChar   ::=   [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | 
[#x86-#x9F]

What is even worse is that XML 1.1 now reallows NUL for system identifiers 
and URIs, through escaping mechanisms. Clearly, the XML specification is 
inconsistent there, and this would explain why most XML parsers are more 
permissive than what is given in the Char production of the XML 
specification, and that they simply refer to the definition of valid 
codepoints for Unicode and ISO/IEC 10646, excluding only surrogate code 
points (a valid code point can be a non-character, and can also be a 
NUL...): the XML parser will accept those code points, but will let the 
validity control to the application using the parsed XML data, or will offer 
some tuning options to enable this Char filter (that depends on XML 
version...).

See also the various erratas for XML 1.1, related to RestrictedChar...
Or to the list of characters whose use is discouraged (meaning explicitly 
not forbidden, so allowed...):

[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
[#x1FFFE-#x1], [#x2FFFE-#x2], [#x3FFFE-#x3],
[#x4FFFE-#x4], [#x5FFFE-#x5

Re: Please RSVP... (was: US-ASCII)

2004-12-10 Thread Philippe Verdy

From: Kenneth Whistler [EMAIL PROTECTED]
That it has been morphological reanalyzed is demonstrated by the
fact that it takes regular English verb endings, as in:
 I RSVPed yesterday, right after I got the email.
As I said, it is now a bona fide English verb, and most
English speakers will treat it as such.
Didn't know that. Is this a very recent use?
In France, I think that RSVP was introduced and widely used at end of 
telegraphic messages (that contained lots of conventional acronyms), it 
survived at the time of telex, but now it is renewed with SMS messages on 
cellular phones, but is rarely used in emails.

May be this was introduced in English at the old time of telegraphs as a 
useful abbreviation, but with a different meaning when it is used as a verb 
for saying reply as requested?

Re: Nicest UTF

2004-12-11 Thread Philippe Verdy

From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
Regarding A, I see three choices:
1. A string is a sequence of code points.
2. A string is a sequence of combining character sequences.
3. A string is a sequence of code points, but it's encouraged
  to process it in groups of combining character sequences.
I'm afraid that anything other than a mixture of 1 and 3 is too
complicated to be widely used. Almost everybody is representing
strings either as code points, or as even lower-level units like
UTF-16 units. And while 2 is nice from the user's point of view,
it's a nightmare from the programmer's point of view:
Consider that the normalized forms are trying to approach the choice number 
2, to create more predictable combining character sequences which can still 
be processed with algorithms just streams of code points.
Remember that the total number of possible code points is finite; but not 
the total number of possible combining sequences, meaning that text handling 
will necessarily have to make decisions based on a limited set of 
properties.

Note however that for most Unicode strings, the composite character 
properties are those of the base character in the sequence. Note also that 
for some languages/scripts, the linguistically correct unit of work is the 
grapheme cluster; Unicode just defines default grapheme clusters, which 
can span several combining sequences (see for example the Hangul script, 
written with clusters made of multiple combining sequences, where the base 
character is a Unicode jamo, itself made somtimes of multiple simpler jamos 
that Unicode do not allow to decompose as canonically equivalent strings, 
despite this decomposition is inherent of the script itself in its 
structure, and not bound to the language which Unicode will not 
standardize).

It's hard to create a general model that will work for all scripts encoded 
in Unicode. There are too many differences. So Unicode just appears to 
standardize a higher level of processing with combining sequences and 
normalization forms that are better approaching the linguistic and semantic 
of the scripts. Consider this level as an intermediate tool that will help 
simplify the identification of processing units.

The reality is that a written language is actually more complex than what 
can be approached in a single definition of processing units. For many other 
similar reasons, the ideal working model will be with simple and 
enumerable abstract characters with a finite number of code points, and with 
which actual and non-enumerable characters can be composed.

But the situation is not ideal for some scripts, notably ideographic ones 
due to their very complex and often inconsistent composition rules or 
layout and that require allocating many code points, one for each 
combination. Working with ideographic scripts requires much more character 
properties than with other scripts (see for example the huge and various 
properties defined in UniHan, which are still not standardized due to the 
difficulty to represent them and the slow discovery of errors, omissions, or 
contradictions found in various sources for this data...)

Re: Roundtripping in Unicode

2004-12-11 Thread Philippe Verdy

From: Doug Ewell [EMAIL PROTECTED]
Lars Kristan wrote:
I am sure one of the standardizers will find a Unicodally
correct way of putting it.
I can't even understand that paragraph, let alone paraphrase it.
My understanding of his question and my reponse to his problem is that you 
MUST not use VALID Unicode codepoints to represent INVALID byte sequences 
found in some text with alleged UTF encoding.

The only way is to use INVALID codepoints, out of the Unicode space, and 
then design an encoding scheme that contains and extends the Unicode UTF, 
and make sure that there will be no possible interaction between such 
encoded binary data and encoded plain text (so the conversion between the 
encoding scheme of the bytes stream and the encoding form with code units or 
codepoints in memory must be fully bijective; it is hard to design if you 
have to also support multiple UTF encoding schemes, because the invalid byte 
sequences of these UTF schemes are not the same, and must then be 
represented with distinct invalid codepoints or code units for each external 
UTF!)

I won't support the idea of reserving some valid codepoint in the Unicode 
space to allow storing something which is already considered invalid 
character data, notably because the Unicode standard is evolving, and such 
private encoding form which would work now could become incompatible with a 
later version of the Unicode standard, or a later standardized Unicode 
encoding scheme, meaning that interoperability would be lost...

The only thing for which you have a guarantee that Unicode will not assign a 
mandatory behavior is the codepoint space after U+10 (I'm not sure about 
the permanent invalidity of some code unit spaces in UTF-8 and UTF-16 
encoding forms; also I'm not sure that there will be enough free space in 
later standard encoding forms or schemes, see for example SCSU or BOCU-1, or 
with other already used private encoding forms like the modified UTF-8 
extended encoding scheme defined by Sun in Java).

Re: Please RSVP... (was: US-ASCII)

2004-12-11 Thread Philippe Verdy

From: Séamas Ó Brógáin [EMAIL PROTECTED]
John wrote:
As far as I know, they were first used in formal invitations (to 
weddings,
funerals, dances, etc.) in the corner of the card, as both shorter and
more fancy than the older phrase The favor of your reply is requested.
This is correct. The practice dates from the end of the nineteenth 
century.
At that time, transmission of text on long distances was with telegraphic 
systems, where texts had to be short because they were expensive, and 
because the available bandwidths were very limited to support many 
customers, notably for long distance and international communications.

I would not be surprized if this acronym was defined in some internationally 
accepted set of abbreviations used by telegraphists, so that their clients 
became exposed to these acronyms when reading telegrams received from their 
local post office that did not take the time to reconvert these acronyms to 
full words...

I have read some articles about the existence in telegraphic standards of 
such list of abbreviations. Isn't there a remaining, possibly deprecated, 
ISO standard about them? (For example there has existed the 5-bit system, 
because it was important to limit the available charset, and to limit the 
bandwidth required to transmit messages, at a time were searches on data 
compression was not as advanced and successful as today, and the computing 
resources or human capabilities to decode complex compression schemes would 
have been too much expensive or impossible to satisfy on a large scale).

Re: infinite combinations, was Re: Nicest UTF

2004-12-11 Thread Philippe Verdy

From: Peter R. Mueller-Roemer [EMAIL PROTECTED]
For a fixed length of combining character sequence (base + 3 combining 
marks is the most I have seen graphically distinguishable) the repertore 
is still finite.
I do think that you are underestimating the repertoire. Also Unicode does 
NOT define an upper bound for the length of combining sequences, and also 
not on the length of default grapheme clusters (which can be composed of 
multiple combining sequences, for example in the Hangul or Tibetan scripts) 
Your estimations also ignores various layouts found in Asian texts, and the 
particular structures of historic texts which can use many diacritics on 
top of a single base letter starting a combining sequence. The model of 
these scripts (for example Hebrew) imply the justaposition of up to 13 or 15 
levels of diacritics for the same base letter!

In practice, it's impossible to enumerate all existing combinations (and 
ensure that they will be assigned a unique code within a reasonnably limited 
code point), and that's why a simpler model based on more basic but 
combinable code points is used in Unicode: it frees Unicode from having to 
encode all of them (this is already a difficult task for the Han script 
which could have been encoded with combining sequences, if the algorithms 
needed to create the necesssary layout had not needed the use of so many 
complex rules and so many exceptions...)

Re: Please RSVP... (was: US-ASCII)

2004-12-11 Thread Philippe Verdy

From: Michael Everson [EMAIL PROTECTED]
Nonsense. You might as well try to explain SPQR on the same basis.
I won't. I know that SPQR was used on architectural constructions as a 
symbol of the Roman Empire, and it was a wellknown acronym of a Latin 
expression.

It largely predates the invention of the telegraph. My only comment was 
related to the date of origin of the acronym. It's a coincidental, may be 
accidental, analysis.

And it ignores the fact that RSVP was printed on posted invitation cards; 
such invitations were not, as a rule, sent by telegraph.
And another site gives other historic context of this expression: the 
etiquette of the French court of King Louis XIV in the 16th century, and the 
use of the French etiquette throughout Europe and in the United States up to 
the 19th century :
http://people.howstuffworks.com/question450.htm
So the etiquette would have continued to be used in the wellknown acronym as 
a convenience when telegrams were invented.

I just discovered after some searches an old notice of the French Poste, 
with acronyms and abbreviations to be used preferably by telegraphists... 
RSVP is present in that list, among other abbreviations used to encode the 
routing and delivery options of the telegram itself. Probably an interesting 
example of the first communication protocol standards, to limit false 
interpretations.

Re: Roundtripping in Unicode

2004-12-11 Thread Philippe Verdy

RE: Roundtripping in UnicodeMy view about this problem of roundtripping is 
that if data, supposed to contain only valid UTF-8 sequences, contains some 
invalid byte sequences that still need to be roundtripped to some code 
point for internal management that can be roundtripped later to the 
original invalid byte sequence, then these invalid bytes MUST NOT be 
converted to valid code points.

An implementation based on internal UTF-32 code units representation could 
use, privately only, only the range which is NOT assigned to valid Unicode 
code points; so such application would need to convert these bytes into code 
points higher than 0x10; but the same application will no longer be 
conforming to strict UTF-32 requirements: the application will represent 
this way binary data which is NOT bound to Unicode rules and which can't be 
valid plain-text.
For example, {0xFF+n} where n is the byte value to encapsulate. Don't 
call it UTF-32, because it MUST remain for private use only!

This will be more complex if the application uses UTF-16 code units, because 
there are only TWO code units that can be used to recognize such 
invalid-text data within a text stream. It is possible to do that, but with 
MUCH care:
For example encoding 0xFFFE before each byte value converted to some 16-bit 
code unit. The problem is that backward parsing of strings just check that a 
code unit is a low surrogate, to see if a second backward step is needed to 
get the first high surrogate, and so U+FFFE would need to be used (privately 
only) as another lead high surrogate with special (internal) meaning for 
round trip compatibility, and so the best choice for the code unit encoding 
the invalid byte value would be to use a standard low surrogate to store 
this byte. So a qualifying internal representation would be {0xFFFE, 
0xDC00+n} where n is the byte value to encapsulate.
Don't call this UTF-16, because it is not UTF-16.

An implementation that uses UTF-8 for valid string could use the invalid 
ranges for lead bytes to encapsultate invalid byte values. Note however that 
invalid bytes you would need to represent have 256 possible values, but the 
UTF-8 lead bytes have only 2 reserved values (0xC0 and 0xC1) each for 64 
codes, if you want to use an encoding on two bytes. The alternative would be 
to use the UTF-8 lead byte values which have initially been assigned to byte 
sequences longer than 4 bytes, and that are now unassigned/invalid in 
standard UTF-8. For example: {0xF8+(n/64); 0x80+(n%64)}.
Here also it will be a private encoding, that should NOT be named UTF-8, and 
the application should clearly document that it will not only accept any 
valid Unicode string, but also some invalid data which will have some 
roundtrip compatibility.

So what is the problem: suppose that the application, internally, starts to 
generate strings containing any occurences of such private sequences, then 
it will be possible for the application to generate on its output a byte 
stream that would NOT have roundtrip compatibility, back to the private 
representation. So roundtripping would only be guaranteed for streams 
converted FROM an UTF-8 where some invalid sequences are present and must be 
preserved by the internal representation. So the transformation is not 
bijective as you would think, and this potentially creates lots of possible 
security issues.

So for such application, it would be much more appropriate to use different 
datatypes and structures to represent either streams of binary bytes, or 
streams of characters, and recognize them independantly. The need of a 
bijective representation means that the input stream will contain an 
encapsultation to recognize *exactly* if the stream is text or binary.

If the application is a filesystem storing filenames and there's no place in 
the filesystem to encode if a filename is binary or text, then you are left 
without any secured solution!

So the best thing you can do to secure your application, is to REJECT/IGNORE 
all files whose names do not match the strict UTF-8 encoding rules that your 
application expect (all will happen as if those files were not present, but 
this may still create security problems if an application that does not see 
any file in a directory wants to delete that directory, assuming it is 
empty... In that case the application must be ready to accept the presence 
of directories without any content, and must not depend on the presence of a 
directory to determine that it has some contents; anyway, on secured 
filesystems, such things could happen due to access restrictions, completely 
unrelated to the encoding of filenames, and it is not unreasonnable to 
prepare the application so that it will behave correctly face to 
inaccessible files or directories, so that the application will also 
correctly handle the fact that the same filesystem will contain non 
plain-text and inaccessible filenames).

Anyway, the exposed solutions above demonstrate

Re: RE: Roundtripping in Unicode

2004-12-13 Thread Philippe VERDY

Lars Kristan wrote: What I was talking about in the paragraph in question is what happens if you want to take unassigned codepoints and give them a new status.
You don't need to do that. No Unicode application must assign semantics to unassigned codepoints.
If a source sequence is invalid, and you want to preserve it,then this sequencemust remain invalid if you change its encoding.
So there's no need for Unicode to assign valid code points for invalid source data.
There's enough space *assigned* as invalid (or assigned to non-characters) in all UT forms, that allow an application to create a local conversion scheme which will perform a bijective conversion of invalid sequences:
- for example in UTF-8: trailing bytes 0x80 to 0xBFisolated or in excess, or even the invalid lead bytes 0xF8 to 0xFF
- for example in UTF-16: 0XFFFE, 0x
- for example in UTF-32: same as UTF-16, plus all code units above 0x10
Using PUA space or some unassigned space in Unicode to represent invalid sequences present in a source textwill be a severe designerror in all cases, because that conversion will not be bejective and could map invalid sequences to valid ones without further notice, changing the status of the original text which should be kept as incorrectly encoded, until explicitly corrected or until the source text is reparsed with another more appriate encoding.
(In fact I also think that mapping invalid sequences to U+FFFD is also an error, because U+FFFD is valid, and the presence of the encoding error in the sourceis lost, and will not throw exceptions in further processings of the remapped text, unless the application constantly checks for the presence of U+FFFD in the text stream, and all modules in the application explicitly forbids U+FFFD within its interface...)

Re: Roundtripping in Unicode

2004-12-14 Thread Philippe Verdy

From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
Lars Kristan [EMAIL PROTECTED] writes:
Hm, here lies the catch. According to UTC, you need to keep
processing the UNIX filenames as BINARY data. And, also according
to UTC, any UTF-8 function is allowed to reject invalid sequences.
Basically, you are not supposed to use strcpy to process filenames.
No: strcpy passes raw bytes, it does not interpret them according to
the locale. It's not an UTF-8 function.
Correct: [wc]strcpy() handles string instances, but not all string 
instances are plain-text, so they don't need to obey to UTF encoding rules 
(they just obey to the convention of null-byte termination, with no 
restriction on the string length, measured as a size in [w]char[_t] but not 
as a number of Unicode characters).

This is true for the whole standard C/C++ string libraries, as well as in 
Java (String and Char objects or native char datatype), and as well in 
almost all string handling libraries of common programming languages.

A locale defined as UTF-8 will experiment lots of problems because of 
the various ways applications will behave face to encoding errors 
encountered in filenames: exceptions thrown aborting the program, 
substitution by ? or U+FFFD causing wrong files to be accessed, some files 
not treated because their name was considered invalid althoug they were 
effectively created by some user of another locale...

Filenames are identifiers coded as strings, not as plain-text (even if most 
of these filename strings are plain-text).

The solution if then to use a locale based on a relaxed version of UTF-8 
(some spoke about defining a NOT-UTF-8 and NOT-UTF-16 encodings to allow 
any sequence of code units, but nobody has thought about how to make 
NOT-UTF-8 and NOT-UTF-16 mutually fully reversible; now add NOT-UTF-32 
to this nightmare and you will see that NOT-UTF-32 needs to encode 2^32 
distinct NOT-Unicode-codepoints, and that they must map bijectively to 
exactly all 2^32 sequences possible in NOT-UTF-16 and NOT-UTF-8; I have not 
found a solution to this problem, and I don't know if such solution even 
exists; if such solution exists, it should be quite complex...).

< 5 6 7 8 9 10 11 12 13 14 >

901 - 1000 of 2449 matches

Mail list logo