Philippe,
However, within the program itself UTF-8 presents a
problem when looking for specific data in memory buffers.
It is nasty, time consuming and error prone. Mapping
UTF-16 to code points is a snap as long as you
do not have a lot of surrogates. If you do then probably
UTF-32 should be
Jill,
I think that the best practice is to validate input.
Besides the overhead of revalidating there is the issue of what do you do
with data that contains invalid characters. This has to be handles
explicitly. Once validated all transforms should maintain valid data. If
you also provide a
Philippe,
Also a broken opening tag for HTML/XML documents
In addition to not having endian problems UTF-8 is also useful when tracing
intersystem communications data because XML and other tags are usually in
the ASCII subset of UTF-8 and stand out making it easier to find the
specific data you
What amazes me is that no one has addressed numeric input.
Often companies to simplify i18n use web servers and browsers for data processing.
Much of that involves forms and these forms have mixed alphanumeric and numeric only
fields. To the best of my knowledge nowhere can I specify numeric
Mark,
I am impressed with the data collected but have problems with the structure and some
of the actual data values.
For example if I want to handle date/time data I need time zone info. I may also need
country information to parse and format the date as well and language info for things
Eric,
1. Does somebody have more information about that effort?
Eki lists four characters as needed but missing in Unicode (see
http://www.eki.ee/letter/chardata.cgi?lang=tt+Tatarscript=latin).
I had suggested earlier that Tartar be added to the special case rules for
dotted and dotless I
Doug,
The issue of French as spoken in Switzerland versus French as spoken
in Canada is totally unrelated to the issue of Swiss conventions versus
Canadian conventions for sorting, date and time format, decimal
separator, and so forth.
As for time zones, I agree completely with Mark that
Peter,
If I live in Guam I will probably be using an en_US locale.
However the US territory does not contain my time zone.
Probably the best solution for this problem is to add a category
of possessions to the territory information. This allows
applications to enumerate available time zones
Mark,
Do you know if there is an official list of country possessions?
Carl
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Behalf Of Mark Davis
Sent: Friday, May 07, 2004 5:28 PM
To: Carl W. Brown; Unicode List
Subject: Re: TR35 (was: Standardize TimeZone ID
Mark,
LDML does require the Olson IDs to identify time zones
(as does Unix, Java, ICU,...). See the discussion in
http://www.unicode.org/reports/tr35/.
I found a normalization problem with the IDs. For example you have both
Asia/Istanbul and Europe/Istanbul which are different names for
Mark,
That is not a problem. The Olson IDs are not guaranteed
to be unique, just unambiguous. And there are aliases.
Typically these are de-unified for political
purposes. Thus you may find that two different IDs produce
the same results over
the entire period of time in the database.
Markus,
Rick Cameron wrote:
IMHO, that's a bit misleading. The String class
itself does not appear to be
aware of SMP characters. It clearly uses
UTF-16, and the length it reports
is the number of code units, not the number
of characters or graphemes in the string.
There is no
Benjamin,
Versions up until Windows 2000 use UCS-2 internally. 2000 and XP use
UTF-16, although applications tend to have differing levels of awareness
about surrogates.
You can enable Win2K surrogate support
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicod
Philippe,
its set of proscribed words including in programs that were designed to
filter the words out of text.
Dos this list really exist? Seriously, there's
no word that can be proscribed,
because they are not themselves infamous.
What is infmous or dangrour is their
use to make
James Kass,
U+E000 COMBINING BLACK BLOB? Censors would probably love it.
It is a much more universal solution than the one that the censors really
wanted.
COMBINING EXPLITIVE DELETE
The character would be inserted after all words and delete them if they were
on a proscribed list of forbidden
Marion,
That particular campaign was such a resounding 'success' we went on to
spend thousands of quid each year, for many years, trekking one more
encoding campaign trail after another, in support of many other languages,
as well as our own.
It reminds me of my work on a multi-lingual
Marion,
Irish in Roman script is written i with dot above,
Irish in traditional script is written i without
dot above. The current flooding of our local
advertising and publishing markets by various
non-native uncial fonts to write our language goes
against tradition in imposing on us that
Mark,
Markus did a good job of describing that advantages of each. The problem that I see
is that there are applications that are not enabled to do BOM processing and convert
from little-endian to big-endian and the other way around.
Are there any browsers that support Unicode but will not do
Euro-English
The EU announces changes to the spellings of common English words...
European Union commissioners have announced that agreement has been reached
to adopt English as the preferred language for European communications,
rather than German, which was the other possibility.
As part of
Jill,
The
dotted and dotless i are distinctly different, however I like to fold them when
doing searches because I don't know of any cases where is would case search
problems. However if I am searching for Istanbul and what to include the
dotted spelling as well.
Carl
-Original
in each page:
!-- /* $WEFT -- Created by: Carl W. Brown ([EMAIL PROTECTED]) on
2/17/2002 -- */
@font-face {
font-family: Papyrus;
font-style: normal;
font-weight: normal;
src: url(PAPYRUS3.eot);
}
--
Carl
-Original Message-From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]On Behalf
Jill,
I know that Unicode does have some
locale-sensitive case mappings (Turkish
uppercase I to dotless lowercase
I for example), I was under the impression
that ss to ß was not one of them.
You are correct that SS and ß are the same in case insensitive compares
regardless of locale.
I
Mark,
But there's no official Unicode standard that
I know of (and that isn't saying much) that says that ss and ß have to
compare as equals.
http://www.unicode.org/Public/UNIDATA/CaseFolding.txt
Carl
Doug,
You might remember that I chided Microsoft for
its definition of Unicode in
Windows 2000 Help, where Unicode was described
as a 16-bit standard that was developed between
1988 and 1991, implying that the work was
finished. Even at the time Windows 2000 was being
developed, there
. Brown
Cc:[EMAIL PROTECTED]
Subject: RE: MS Windows and Unicode 4.0 ?
Carl W. Brown wrote:
Doug writes:
You might remember that I chided Microsoft for
its definition of Unicode in
Windows 2000 Help, where Unicode was described
as a 16-bit standard that was developed between
Michael,
This is another strawman argument isn't it? Nobody on this thread
has said they want monospaced alphanumerics.
No, but the responsibles have responded and informed the list that
clones of Latin letters A-F will not be entertained.
How 'bout we drop the discussion?
Before dropping
Michael,
Tim Berners-Lee has sent a letter of concern to the president of
ISO about the idea of collecting royalties on...guess what... ISO
language and country codes! According to the letter, the ISO
Commercial Policies Steering Group is proposing a royalty on
commercial use of ISO
Mark,
Right after Ken was so nice to take it to the beer OT topic to a group message off
line, I got hit with sobig-f. Over 1000 messages per day and I did not open an
attachment.
I know that part of what makes us i18n folks is detail. But too often we carry it too
far even with topics
Mark,
Yes, I am sick and tired of dealing with this horrible
non-decimal measurement
system the US has for time: the number of units per other unit
vary all across
the board: 60..61 : 1, 60 : 1, 24 : 1, 28..31 : 1, 12 : 1,
365..366 : 1 --
awful. At least with inches, feet, and miles,
John,
A kilosec is a reasonable amount of time to wait for a late appointment
(in some countries, anyhow).
A megasec is enough time to do a small project.
If a marriage lasts a gigasec, it is doing very well.
1 pictun = 20 baktun = 2,880,000 days = approx. 7885 years
1 calabtun = 20
Jay,
Oracle's UTF-8 is not really a valid encoding. It
encodes surrogates as if they were characters. The kept the old Unicode
2.x code that only supports BMP to provide sort key compatibility for clients
who never upgraded to Unicode 3.0 support and are using 16 bit character
encoding
Thomas,
It's all well and good to change the keyboard layout, but it can be
confusing if it becomes too different from the physical keyboard
(esp. if one has to type something in a totally different alphabet).
Now, if anybody would manifacture keyboards with tiny LCD displays on
each key,
Tim,
The point is not that any potential attendee would actually
travel to the wrong place. It is that advertising the 24th
conference as Atlanta, GA but the 23rd as Prague, Czech
Republic is part of a cultural arrogance in the USA.
We should have the next conferences in San Jose, Costa Rica
I disagree with Philippe's message in that I think that it is based on
Microsoft's determination to follow the idea that browsers are not
applications but part of the OS.
To clarify my statement.
I think Philippe's message was appropriate to this forum. It was far more
pertinent to Unicode
to cover.
Even if the user does not read the language they may be able to recognize
the name.
From one of my sites:
!-- /* $WEFT -- Created by: Carl W. Brown ([EMAIL PROTECTED]) on
2/17/2002 -- */
@font-face {
font-family: Papyrus;
font-style: normal;
font-weight: normal;
src: url
Chris,
I think that if you have a Klingon web site that uses UTF-8
and the PUA with
your own font is very Unicode savvy.
Carl
It's certainly a lot more savvy than using Latin-1 characters to
encode Klingon.
If nothing else we need to discourage people from using the Latin-1 code
page
Philippe,
From: Carl W. Brown [EMAIL PROTECTED]
It looks to me like UNCODE. Has the UN has taken a rode in
globalization? Maybe the web page has no scripting but is still savvy.
Wrong! You strip the very visible dot from the i letter, you also
refse to see that there's a ligature
Marco,
No, archaic, American and informal are usage labels, not
translations.
The translation is buon senso. (BTW, it is: Dizionario Garzanti di
inglese, Garzanti Editore, 1997, ISBN 88-11-10212-X)
Webster's has to know, to understand or common sense, understanding. In actually it
is
Doug,
Most likely because no modern computer uses a 3-byte (24-bit) internal
processing unit, and because it would be false economy for real-world
Unicode text (see (1) and (2) above).
What would be worse is to have an implementation like the old IBM 360 computers where
the 24 bit addresses
Alan,
IE uses mlang to determine if you have the right fonts for the characters.
http://msdn.microsoft.com/library/default.asp?url=/workshop/misc/mlang/overv
iew/overview.asp
Carl
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of Alan Wood
Sent:
Markus,
There are some more characters that have the same codes in most
EBCDIC codepages, but there are also
some where the Latin letters are not all present. (I think some
old Japanese EBCDIC codepages
replace small Latin letters with Katakana ones.)
That is true. The half width
Barry,
If you think that this is bad try 390 mainframe EBCDIC shift to upper case.
You can shift up to 256 characters at a time with a single machine language
instruction by ORing a line of spaces to your character field. Now that is
bit flipping and is still heavily used.
Carl
-Original
Marco,
I agree. I did some basic design work on an Ethiopian system and it was
decided to follow the same implementation system as Thai. We don't encode
every possible Thai glyph.
We felt that if it were ever Unicode encoded we needed to use the decomposed
characters rather than decomposing
Marco,
I was disappointed that Unicode used precomposed encoding for Ethiopic.
Carl
Michael,
I was disappointed that Unicode used precomposed encoding for Ethiopic.
Heavens, why?
I assume that you are being tongue-in-cheek. If not:
Since you key in syllables as consonant+vowel combinations you can keep the
encoding under 256 characters like most other languages with
translating between different
language that represent different cultures.
Carl
-Original Message-
From: Stefan Persson [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, November 20, 2002 2:33 PM
To: Carl W. Brown; [EMAIL PROTECTED]
Subject: Re: Morse coded Unicode(was: Morse code
- Original
Tex,
I think that the bigger issue might be how do you extend Morse code to
incorporate the Unicode character set.
Other than an enormous number do dots and dashes per character there are
other issues.
Without case do you need a German sharp s?
Does the final sigma need to forms?
How do you
Radovan,
I seem to remember that just recently Morse code was dropped and is no
longer used officially. Braille is different.
Unicode does support dead scripts for scholarly use. Do you think that
there will be many scholarly texts that will be written in Morse code?
Carl
-Original
Doug,
However, 16 bit characters were a hard enough sell in the good old
days. If we had started out withug 2bit characters we would still be
dreaming about Unicode.
I think Carl meant with 32-bit characters. I don't know what kind of
word withug is (Old English?), but I like it.
It
Jane,
One of the problems is that early Unicode adopters used the 16 bit UCS-2
encoding for of Unicode. Converting to UTF-16 requires surrogate support.
Some of the GB18030 characters require this support. ICU is dedicated to
Unicode support so a lot of effort is put into ICU to keep it up to
] [mailto:unicode-bounce;unicode.org]On
Behalf Of Markus Scherer
Sent: Thursday, November 14, 2002 9:18 AM
To: unicode
Subject: Re: IBM AIX 5 and GB18030
Carl W. Brown wrote:
Some Unix systems adapted faster because the later Unicode
adopters used 32
bit Unicode characters making the job
Markus,
You seem to suggest that there is a problem with 16-bit Unicode.
It does take some effort to adapt
UCS-2-designed functions for UTF-16, but it's not rocket
science and works very well thanks to the
Unicode allocation practice (common characters in the BMP).
Making UTF-8/32 functions
Jim,
There
already is a Unicode solution for the problem. Check UAX #21. If
search engines use case insensitive compares then it should be no problem.
There
a a lot of exceptions to the rule so that you need separate characters for the
forms but you also need an algorithm that works
Thomas,
It seems that the private use area is abused. If you are sending characters
between two systems that are not a part of the Unicode standard then you can
use the private use area with agreed code points.
With ligatures you scan the text and identify ligature pairs. The resultant
text is
Mark,
Do you know if there is any CSS work on defining field contents. I have run
into a number of cases where I wanted to distinguish between text and
numeric only input fields. The numeric field entry would disable the IME so
that the user could enter standard Latin narrow digits.
With
What level of Unicode does Java currently fully support?
Carl
William,
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of William Overington
Sent: Friday, August 23, 2002 12:55 AM
To: James Kass; Carl W. Brown; Unicode List
Cc: [EMAIL PROTECTED]
Subject: Re: Revised proposal for Missing character glyph
Ken,
The little square boxes do not help much if you what to know exactly what
the missing characters are. I do however feel that any solution to the
problems should be Unicode based. If left to the vendors that may display
the code page characters and you are guessing again.
The tool idea is
Ken,
This is an alternate to representing bad glyphs with a missing glyph
character. People can implement either.
-Original Message-
From: Kenneth Whistler [mailto:[EMAIL PROTECTED]]
Sent: Friday, August 16, 2002 2:28 PM
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; [EMAIL
Proposed unknown and missing character representation. This would be an
alternate to method currently described in 5.3.
The missing or unknown character would be represented as a series of
vertical hex digit pairs for each byte of the character. BMP characters
would be represented with 4 hex
of some Latin text. It
would be higher that wide but not as high as the 6 hex digit grouping.
Carl W. Brown
With a bit more thought we might reduce the minimum point size of an
unrenderable character as follows:
The numbers represent a dot position of that bit is a one. It is blank if
the bit is 0.
The XX characters are lines with an inverted wide squared U at the top with
the edges coming down to
Doug,
I agree.
I used to do security consulting and found that the biggest problem was that
people tried to come up with solutions for the wrong problem.
We can go back to the typewriter days when there was no.t difference between
1 l or 0 O. Do. you blame ASCII if you type ST0P instead of
xIUA 3.2 with ICU 2.0 support is available from X.Net, Inc. It is also
compatible with ICU 1.8.1.
http://www.xnetinc.com/xiua/
Upgrade instructions for prior releases are available from X.Net, Inc.
Carl
Roozbeh,
I was told that there was a special (semi official) version of Win98 that
added 4 missing letters in CP1256 by replacing Latin letters to create
CP1256mod. It used LCID 0826.
Carl
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of Roozbeh
patches but I don't think that the MS folks ever did an Urdu patch.
Carl
-Original Message-
From: Michael (michka) Kaplan [mailto:[EMAIL PROTECTED]]
Sent: Friday, October 12, 2001 9:01 AM
To: Carl W. Brown; [EMAIL PROTECTED]
Subject: Re: CP1256 and Persian YEH?
Probably mistaken
Keld Simonsen,
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of Keld Jorn Simonsen
Sent: Wednesday, October 10, 2001 1:07 PM
To: Michael Everson
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: Re: Roadmaps
On Wed, Oct 10, 2001 at 07:53:51PM
Keld,
In the case of ISO 639 there is an online, official, up-to-date registry
available at the Library of Congress site.
It is there because the same codes are used in the MARC standard. However
even though they seem to keep it up to date, it is an unofficial copy of the
standard. Other
Bent Herlevsen,
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of Magda Danish (Unicode)
Sent: Thursday, October 04, 2001 10:00 AM
To: [EMAIL PROTECTED]
Subject: FW: Unicode locale id
-Original Message-
From: [EMAIL PROTECTED]
Addison,
It might be easier to convert the JVM from UCS-2 to UTF-32 so that you do
not have to worry about surrogates. This would more closely match most Unix
implementations (except Sun) where Java is widely used.
Carl
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL
Doug,
I suspect that since it was a phonetic spelling system and the writings
varied with the writer's pronunciation that individualized keyboard layouts
could be a personal preference as well.
Carl
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of
if users
could create sig's to define the layout. Now all I need is the Klingon font,
thanks to this thread I found the Deseret font. - Dave
"Carl W. Brown"
[EMAIL PROTECTED] Sent by: [EMAIL PROTECTED]
10/03/0
MichKa,
And I am sure Apple is hard at work on the Desert font and
keyboard for Mac
OS 11? :-)
Getting the scripts defined will allow third parties to add support to most
operating systems for specific languages that are not supported by the
standard offerings.
The big deal will be
William,
It looks like if you really want multilingual support that you need to run
your text through a layout engine. If that is the case then you can remap
certain characters or character combinations into the U+FDD0 to U+FDEF
Unicode range and use this special non-character area for what
Michka,
I have also heard that the dollar sign come from a U superimposed over an S
and the bottom of the U was dropped. This would be hard to do on a
typewriter because the two lines would be so close that they would be
indistinct and would fill with lint from the ribbon. I suspect that the
Tex,
ok i'll quit
I figured that you would drag some GIFTS (Poison) from your MIST (Manure)
ridden mind.
Carl
Tom,
If i can b so bold as 2 pen a pun or 2. Punning is a vocabulary mind set
that even a pica-mind can render.
There is no bad joke like a good pun. It is a great way to lose friends and
make enemies. The only really challenging puns are the multilingual ones.
Now that I have been avoiding
Mike,
The typical situation involves cases where large data sets
are cached in
memory, for immediate access. Going to UTF-32 reduces the
cache effectively
by a factor of two, with no comparable increase in processing
efficiency to
balance out the extra cache misses. This is because
Tom,
Andy Heninger writes:
Performance tuning is easier with UTF-16. You can optimize for
BMP characters, knowing that surrogate pairs are sufficiently uncommon
that it's OK for them take a bail-out slow path.
Sure, but if you are using UTF-16 (or any other multibyte encoding)
you
this response.
At 12:49 -0500 2001-09-24, Eric Fischer wrote:
Michael Everson [EMAIL PROTECTED] quotes Carl W. Brown:
This is logical. Originally typewrites had no 1 or 0. You code use
the letters l and O. They look the same so that is good enough until
computers came along
Mike,
If you think you have the answer to all the problems, then you
don't know all the problems.
I tried to make a point, and apparently made it poorly. I will try
again. It seems that some people are arguing that UTF-16 is the ideal
solution for all computing, and that
Edward,
Typewriters, computer keyboards, and school recitations still put 0 after 9
rather than before 1. Such is Human Stupidity.
This is logical. Originally typewrites had no 1 or 0. You code use the
letters l and O. They look the same so that is good enough until computers
came along and
When developing xIUA, I designed UTF-8 support to be used two different
ways. One as a form of Unicode and the other as yet another code page. In
either case the two are handled with few exceptions in the same manor. The
only difference it when you want to convert from UTF-8 to an underlying
Ken
I have to convert from UTF-8 to UTF-16, before calling ICU
functions (such
as ucol_strcoll() )
I'm worried about the performance overhead of this conversion.
You shouldn't be.
The conversion from UTF-8 to UTF-16 and back is algorithmic and very
fast.
To make this conversion
Ram,
If ISCII is intended as a pan-Indic solution does it also support Urdu?
Carl
Ram,
ISCII has escape sequences which announce the start of a new Indic script.
An ATR char followed by special codepoint forms the escape sequence.
It is possible to support a page that contains different Indic
scripts.There are
problems with the standard like, it assumes a default
Doug,
It is true that the *specific* irregular UTF-8 sequences introduced (and
required) by CESU-8 decode to characters above 0x when interpreted as
CESU-8, and to pairs of surrogate code points when (incorrectly)
interpreted
as UTF-8. Since definition D29, arguably my least favorite
Bernard,
Many of your questions have been answered by others but I wants to add a few
comments.
1.Why does Unicode say that there are 63486 code
values available to represent characters with single
16 bit values and 2048 available to represent an
additional 1,048,544 characters as
Ken,
Even those who do not know the details of Indic processing know that you can
not argue both sides of the issue. There was a lot of criticism of the fact
that there were differences in scripts yet there was no mention that Unicode
because of its extended code base does support
Ram,
ISCII has escape sequences which announce the start of a new Indic script.
An ATR char followed by special codepoint forms the escape sequence.
It is possible to support a page that contains different Indic
scripts.There are
problems with the standard like, it assumes a default
MichKa,
Actually, once its in IANA then it is legal in XML and other places, and
*everyone* will have to support it, whether they want to or not. What is
supposedly private will become quite public. IANA, after all,
does not have
charsets that they register for people to not use and none of
MichKa,
Also, Toby was not attempting to be deceitful, AFAIK. The
original proposal
he submitted (still called UTF-8S) was not in any way
contradictory but many
people objected to various issues within it and the way many things were
presented. The current proposal was a very rushed
Doug,
But if people start compromising their UTF-8 parsers to
accommodate CESU-8
adaptively, it would be a great blow to UTF-8. It would
essentially undo
all the tightening-up that was accomplished by the Corrigendum,
and it would
revive all the old Bruce Schneier-style skepticism about
Mark,
- Just because it is in IANA does *not* mean that everyone will
support it.
There are many encodings in IANA supported by very few people. Nor does it
mean that it is intended for widespread public use. The IANA registry is
also used as a general purpose registry, even for encodings
Addison,
By providing a documented, standard way to refer to legacy
versions of these products and their encodings, I can more
readily rely on having a well-documented range of protocols and
procedures for converting and validating data exchanged with
these systems. The argument that
MichKa,
Many people believe that any rule or law that makes no sense or cannot be
enforced weakens all other laws. I believe that publishing an inconsistent
document that would allow any reasonably intelligent reader to come to the
same conclusions as you did, and the standard itself would
Marcin,
We can't change the past, but I hope that at least UTF-8 processing can
be done without treating surrogates in any special way. Surrogates are
relevant only for UTF-16; by not using UTF-16 you should be free of
surrogate issues, except by having a silly unused area in character
different sort orders.
Lets fix the problem the right way.
Thank you, (Now stepping off the soap box)
Carl
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of Carl W. Brown
Sent: Friday, September 14, 2001 9:40 PM
To: [EMAIL PROTECTED]
Subject: CESU-8
Doug,
This was my solution long ago: fix the code that sorts in UCS-2
order so that
supplementary characters are sorted correctly. In case there is any
disagreement about this, sorting by UCS-2 order has been WRONG ever since
surrogates and UTF-16 were invented.
However, the database
Ken,
I agree.
Any one who was an original Unicode evangelist with the loose leaf Unicode
1.0 binder in hand knows that if it were not for UCS-2 that Unicode would
not be used today.
It was a risk for MS to use Unicode in NT
It was a risk for MS to partially implement Unicode in Win95.
It was
1 - 100 of 288 matches
Mail list logo