You can determine that that particular text is not legal UTF-32*,
since there be illegal code points in any of the three forms. IF you
exclude null code points, again heuristically, that also excludes
UTF-8, and almost all non-Unicode encodings. That leaves UTF-16, 16BE,
16LE as the only
| I am surprised by the must only be used. It seems I am not
| conforming by including a meta statement in the utf-16 HTML page. I
| should either remove the statement or encode the HTML up to and
| including that statement as ascii. I'll check on this.
It doesn't make much sense to have
This looks like a nice endorsement of SCSU:
:D
It saves 59% just as a charset,
and it saves almost 20% in a system with a real compression.
I am all for SCSU as a charset (after my tools can view it properly), but
that was not the use there. OTOH there is gzip encoding in HTTP 1.1 :)
Since we're on this topic, what about sources for other languages where a
dictionary is needed to do word breaking? I'd be interested in Chinese and
Japanese myself for instance,
YA
If you can process SCSU, and would appreciate a 59% reduction in file
size, try:
http://home.adelphia.net/~dewell/th18057-scsu.txt(135,731 bytes)
Not to knock down SCSU, but if it had been gzipped instead, the resulting
file would be about half that size: 70,912 bytes. (The gzipped
The last time I read the Unicode standard UTF-16 was big endian
unless a BOM was present, and that's what I expected from a UTF-16
converter.
Conformance requirement C2 (TUS 3.0, p. 37) says:
[And other many good references where TUS does *not* say that :)]
OK, maybe in 2.0, or I made
The reason for ICU's UTF-16 converter not trying to auto-detect the BOM
is that this seems to be something that the _application_ has to decide,
not the _converter_ that the application instantiates.
This converter name is (currently) only a convenience alias for use the
UTF-16 byte
D43 italUTF-16 character encoding scheme:/ital the Unicode
CES that serializes a UTF-16 code unit sequence as a byte sequence
in either big-endian or little-endian format.
* In UTF-16 (the CES), the UTF-16 code unit sequence
004D 0430 4E8C D800 DF02 is serialized as
FE FF 00 4D
And of course, I have been complaining about ICU's UTF-16 converter
behavior, but glibc's one does the same assumption that UTF-16 is in the
local endianness:
gabier% echo hello | uconv -t utf-16be | iconv -f utf-16 -t ascii
iconv: illegal input sequence at position 0
gabier%
So fixing one but
So same semantics as before.
Yep. The editorial committee would't be doing its job right
if it were changing the semantics of the standard.
Agreed! Is there any mention that the non-BOM byte sequence is most
significant byte first anywhere else? You know, for the newbies?
Joshua 1.8
This is incorrect. Here is a summary of the meaning of those bytes at
the start of text files with different Unicode encoding forms.
beginning with bytes FE FF:
- UTF-16 = big endian, omitted from contents
beginning with bytes FF FE:
- UTF-16 = little endian, omitted from contents
TUS does not prevent anyone to put noncharacter code points in Unicode
strings. As a matter of fact, p. 23 of TUS 3.0 reads U+ is reserved
for
private program use as a sentinel or other signal. I would expect this
to
hold true for the noncharacters that were introduced later too. It may
Markus Scherer wrote:
How about U+10?
It is a non-character, which gives it a high (unassigned
character) weight in the UCA. It is the highest code point =
the last character.
That is definitely not what I was looking for. It is an illegal codepoint,
while I was looking for a
The old currencies on the continent (German Mark, Dutch guilder, French
frank) however use a period to devide the groups and a comma as a decimal
sign
Some use a full stop as the thousands separator and some use a
numeric (nonbreaking) space Switzerland uses an apostrophe for the
listing the way I wanted it. *nix systems that start with fr_FR and
then allow you to define fr_FR-EURO or something really aren't much
better; what if I want to deviate from the pre-defined locale in four or
five ways instead of just one?
They do not let you deviate from a pre-defined
On Fri, 1 Mar 2002 11:26:42 +0100 , Marco Cimarosti wrote:
French francs amounts were often
written with a single decimal (because the smallest coin was 10 cents)
No, the 5 centime coin remained in use (until the recent demise of the
Franc, of course) and in any case it was very rare to
My page is in Unicode, but does not mention Unicode except in the headers,
and the headers are invisible unless you choose view source in your
browser
My company service has been in UTF-8 since I joined in 1998 See
http://wwwrealnamescom/; Another good example, but it's much more recent:
I'm confused. Do you mean meaningless identifiers? They look
meaningless to me. House numbers in North America (and in France
also, it seems) have a few bits of meaning: the least-significant
(numeric) bit tells you which side of the street the house is on,
and it's often the case that you
Perhaps not as physical currency, but they sure do still exist in data,
and will continue to exist in data until the Apocalypse.
When is that scheduled to occur?
[Alain] Very simple: « la semaine des quatre jeudis » (the week of the 4
Thursdays, as we say in French).
And the exact day
If foo is a US-ASCII string, grep foo file will work fine with any
US-ASCII-superset charset for which non-ASCII characters do not use
bytes 0x80, including the hypothetical one I described, with no
possibility of a false match. However grep fóó file will work only
if the current shell
The very fact that most of them can be reduced to ASCII and people still
find the resulting text useful and accurate to the original is a sign
that the important characters in English are in ASCII. And all the
standard transliterations - em-dashes - --, c-cedilia - c, e-acute,
e-grave - e,
UTF-8 should *never* contain the BOM.
But has been pointed out, it is common practice for Microsoft, and also for
ICU's genrb tool, for example, which uses the BOM to autodetect the
encoding. The more example you'll see of that, the more people will use the
BOM (now, can't we all use -*-
What do you mean? I've done works for Project Gutenberg, and looked at a
number of books with thoughts of reducing them to ASCII. In my opinion,
Windows-1252 has every character that most English books will need,
Especially those books that you want to reduce to ASCII :-)
YA
A ideal interface should probably automatically and silently select
Unicode
(and its default UTF) whenever one or more of the characters in a document
are not representable in the local encoding.
I beg to differ. Silently doing such an unexpected change is guaranteed to
confuse the user,
Moreover, the IDN WG documents are in final call, so if you have comments to
make on them, now is the time. Visit http://www.i-d-n.net/ and sub-scribe
(with a hyphen here so that listar does not interpret my post as a command!)
to their mailing list (and read their archives) before doing so.
The
Moreover, the IDN WG documents are in final call, so if you have comments to
make on them, now is the time. Visit http://www.i-d-n.net/ and subscribe to
their mailing list (and read their archives) before doing so.
The documents in last call are:
1. Internationalizing Domain Names in
Are the actual domain names as stored in the DB going to be canonical
normalized Unicode strings? It seems this would go a long way towards
preventing spoofing ...
Names will be stored according to a normalization called Nameprep. Read the
Stringprep (general framework) and Nameprep (IDN
Well, nothing wrong with Unicode of course. Just means that there will
need
to be an option in your browser to reject any site without a digital
certificate, and perhaps it will need to be turned on by default. So,
Nothing prevents sites running frauds to get a certificate matching their
As part of the mystery of CJK encodings I notice that IBM's ICU's
uconv and SuSE6.4 linux iconv differ as to the UTF-8 representation
if table.euc
Both converters will round-trip with themselves and give byte exact
copy of table.euc
Weirdly they differ in how they map '\' and '~' in
It is definitely a problem to try to interpret what any given label is
supposed to be. The problem is that MIME labels and others are
ambiguous, and are interpreted different ways on different systems.
Still, in the meantime it does make sense to have EUC-JP associated to the
most common
quite a lot of space. However, Fraktur is already encoded in the
Mathematical whatever-it's-called block. This variant selector would mean
that lots of characters can be displayed in two *different* ways. I'd
prefer
that Fraktur diacritics were added instead, and that the mathematical
Well, I've seen cases where chat engines have
converted ASCII into emoticon pictures at the wrong
places...
And sometimes you can't turn them off. Grumble. I couldn't give out sample
code in MSIM using foo(c) for a function call w/o getting a cup of coffee
after foo!
YA
Obviously (I advocate in French changing the spelling of common foreign
words so that there would be more consistency).
Le ouiquende?
That would be pronounced wikãd... To respect the English pronunciation
you would have to write it ouiquennde, which would still be a very odd
spelling in
http://www.culture.fr/culture/dglf/dispositif-enrichissement.htm
http://www.culture.fr/culture/dglf/dispositif-enrichissement.htm
Thanks for the pointer. Though I can't fine the exact sentence re: the
substantive use I found mél referred to as a symbol for messagerie
électronique. I like
1. I have a Geocities page now. I do not know what encoding Geocities
uses,
but I think it's unicode. What I did for the Japanese text on it was not
think about encodings and just type it in with Microsoft's IME (and do
some
swearing at the IME at the process). And it comes out fine, for the
Re: elite-speak generator, I meant the one Edward Cherlin posted:
L33t-5p34k, d00d! 1t'5 3v3rywh3r3. Try the L33t-5p34K Generator!!!### at
http://www.geocities.com/mnstr_2000/translate.html
but the link to the trusty mail archives was enough :) Thanks.
YA
--
Sailing is harder than flying.
Now if someone could resend this elite-speak converter link, it was great.
Please...
Thanks!
YA
--
Sailing is harder than flying. It's amazing that man learned how to sail
first. -- Burt Rutan.
It may even be a glyph variant of the w with forward slash...
YA
-Original Message-
From: Stefan Persson [mailto:[EMAIL PROTECTED]]
Sent: Sunday, December 02, 2001 3:19 AM
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: Re: C with bar for with
- Original Message -
But:
setenv LC_ALL en_US.UTF-8
env LC_ALL=it echo
giovedì, 25 ottobre 2001, 11:45:24 EDT
I could not understand why I get the display of the letter ì in the
en_US.UTF-8 Locale. My understanding was that the date command was
generating the message in the Italian locale (default encoding
[People were discussing whether one should do some case mappings before
doing normalization, or the other way, and whether the case mapping can be
naive or must account for what normalization will do/has done in order not
to break assumptions that the resulting string is both case-folded and
About £ (L with two bars = Italian lira or Egypt/Cyprus pound) and
£
(L with one bar = Pound Sterling or Irish punt), I think that the
Unicode distinction is not valid because:
[...]
For these reason, I suggest that font designers ignore the distinction
between U+00A3 (POUND SIGN) and
At the request of someone working with ICU, I regenerated a derived file
that shows the age of Unicode characters -- when they came into Unicode.
Does anyone think this might be useful to have in the UCD?
It is definitely useful information that could go into UNIDATA. Here is a
good use for
UTF-16 - wchar_t*
Wait be careful here. wchar_t is not an encoding. So.. in
theory, you cannot convert between UTF-16 and wchar_t. You,
however, can convert between UTF-16 and wchar_t* ON win32
since microsoft declare UTF-16 as the encoding for wchar_t.
And he can also do some
I'm also thinking of 3rd party
UTF-8 support such as libutf8, IBM ICU.
They seem no good supports on NT, what do you think
?We are usingICU for all our Unicode
needs,on NT, Windows 2000, and Unix, and itworks perfectlywell
on all of these.
YA
Hi,
I would like to know how the derived files that one can find in the UNIDATA
folder are generated? I am trying to have IBM's ICU library support older
versions of Unicode than the one it currently supports (3.0.something),
specifically Unicode 2.1.x.
ICU needs the following files:
On Thu, Jul 26, 2001 at 01:04:29AM -0700, Yves Arrouye wrote:
If you have a cross platform system you should use RFC 1766
style locales
between systems and convert them to LCIDs on Windows.
RFC 3066 was published in January. Check it out.
http://www.ietf.org/rfc/rfc3066.txt
If you have a cross platform system you should use RFC 1766
style locales
between systems and convert them to LCIDs on Windows.
RFC 3066 was published in January. Check it out.
http://www.ietf.org/rfc/rfc3066.txt
YA
After considerable and unfortunate delay, the new Ethnologue site,
including the online version of the 14th Edition, is at last
available to
the public: http://www.ethnologue.com/home.asp. There are
still refinements
being made, but all the basics are there and working.
Very nice!
SCSU doesn't look very nice for me. The idea is OK but it's just
too complicated. Various proposals of encodings differences or xors
between consecutive characters are IMHO technically better: much
simpler to implement and work as well.
These differential schemes seem to be the way
SCSU is also registered as an IANA charset, although you are
unlikely to find
raw SCSU text on the Internet, due to its use of control
characters (bytes
below 0x20).
And what browser supports SCSU, and what it that browser's reach in term of
population? Because that's usually what
A proposal needs a definition, though:
UTF would mean Unicode Transformation Format
utf would mean Unicode Terrible Farce
untenable total figment?
unable to focus?
utf twisted form?
YA
From: [EMAIL PROTECTED]
Oh yeah, well, I can be more tongue-in-cheek than all of you. I've
already
implemented it.
Quick, quick. Patent it and then open-source it. It will be unstoppable.
YA
Isn't UTF-17 just a sarcastic comment on all of this UTF- discussion?
YA
We have a specific requirment of converting Latin -1
character set ( iso
8859-1 ) text to ASCII charactet set ( a set of only 128
characters). Is
there any special set of utilities available or service
providers who can do
that type of job.
[I am assuming that your ascii table is
Also check out the sites of the IETF IDN WG
(http://www.ietf.org/html.charters/idn-
charter.html, and http://www.i-d-n.net/) for more information that you may
have
wished for.
Oops. Sorry, I only saw James's answer. You obviously read these. Well, I
hope my English horns pages were new
Also check out the sites of the IETF IDN WG
(http://www.ietf.org/html.charters/idn-charter.html, and
http://www.i-d-n.net/) for more information that you may have wished for.
Except on English horns, that is; but then you may want to visit
http://www.users.globalnet.co.uk/~gbrowne/geoff9.htm and
So my question is: is the superscript attribute essential in French to
understand these abbreviations (as it is in Italian), or is
it desirable but
optional (as it is in English)?
Not to understand them. While understanding is subjective, it is usually
evident from the context that these
There are also terms like the West or Western (world, languages,
civilization, etc) which have referents that are not completely west of
the Greenwich Meridian, whose usage cannot be simply explained or
justified by it.
Every point can be found west (or east) of the Greenwhich Meridian. Not all
BTW, it seems that Metafont is a trademark of Addison Wesley
publishing
company ...
Interesting. Maybe because they published the Metafont book (and its friend
Metafont: the program) along with the rest of Knuth's Computers and
Typesetting books? This is the bell that Metafont (as you
Peter - normalise both data and search string - delete /
ignore all
Peter characters with general category Mn
It worked well for us too. Someone mentionned to me once though that U+3099
and U+309A should be preserved in order not to change the meaning of words,
and we do so. But
Kenneth,
Thanks for the explanations.
So I'd suggest you be very careful when trying to do this kind of
a folding. If it is just for surface text matching, the number of
false positive matches would likely swamp the number of false
negatives you'd be correcting.
On the other hand, if you
Hi,
If one were to need to pick Katakana versus Hiragana and fold one into the
other (say to let people match a word or sentence in any of them), is there
one that is preferrable to the other? I think that some Katakana have no
Hiragana equivalents, does that mean that it's always easier to go
To go with Lukas's Perl code, I'll provide a C version, not really tested
either, with ICU, to give him a choice. No error checking etc., just to give
the idea. If you want UTF-16 you'll need to use the macros in
unicode/utf16.h to generate surrogate pairs properly.
#include stdio.h
#include
I then tried my usual remedy: Bow in precisely the correct
direction (359° 16' 32 N*)
Adjust the bearing for declination (15° 26' E according to my chart of the
bay), and try again compass in hand, maybe? ;-)
YA
BTW, anybody knows how to input characters on Windows using the hex
codepoint? I know it's good for my brain to do the exercise of going from
hexadecimal to decimal, but it is still a pain to have to type ALT-DECIMAL
when all I have in my book is hex. That would be a reason for providing the
Then why is ICU mapping UTF-16 to UTF16_PlatformEndian and not
UTF16_BigEndian?
ICU does not do Unicode-signature or other encoding detection
as part of a converter. When you get text from some protocol,
you need to instantiate a converter according to what you
know about the
On Thu, Apr 19, 2001 at 06:24:47PM -0700, Markus Scherer wrote:
On the other hand, if you get a file from your platform and
it is in 16-bit Unicode, then you would appreciate the
convenience of the auto-endian alias.
But nothing should be spitting out platform-endian UTF-16! In the
If you don't have any clue about the byte order, but you know it is
UTF-16, then assume BE.
Then why is ICU mapping UTF-16 to UTF16_PlatformEndian and not
UTF16_BigEndian? I know that was a difference between ICU and my library,
and when I asked this question a while ago I was told that despite
Has this matter already been addressed anywhere?
I think the C standard is in the process of making a decision
about this. If
memory helps, we will have escapes like '\u' and '\U'.
I think they made the decision already. It is in the latest editions of the
standards. The only
On Sun, Apr 15, 2001 at 08:10:55PM +0200, Florian Weimer wrote:
Is it sufficient to mandate that all such identifiers
MUST be KC- or
KD-normalized? Does this guarantee print-and-enter round-trip
compatibility?
In general, the problem is unsolvable. There are several look-alikes
(I don't know if email addresses will be internationalized anytime
soon. This is just an example. ;-)
http://www.-i-d-n.net/
They have a normalization process that may be used for e-mail someday. It
explictely does not do anything about similar looking glyphs. Read their
list archive, I'm
There should be a method to overcome the source sepearation rule which
might have saved certain identical characters from unification.
- U+0048 LATIN CAPITAL LETTER H
- U+0397 GREEK CAPITAL LETTER ETA
- U+041D CYRILLIC CAPITAL LETTER EN
- U+13BB CHEROKEE LETTER MI
If
Florian, I respectfully suggest that you look up the various technical
reports that accompany the Unicode standard. It looks like ther may be
certain confusion about characters and glyphs
Oops, got tripped by my native French language. I didn't mean "certain" but
"some". Do not conclude that
We have normalization similar to
the one you're talking about in our Internet Keywords
system. It is built on
top of NFKC. It is good for users, but then it is also very
specific.
Details, details! (Or do you consider that stuff a proprietary
advantage?)
I don't really. That would
Is it sufficient to mandate that all such identifiers MUST be KC- or
KD-normalized? Does this guarantee print-and-enter round-trip
compatibility?
It depends on the accuracy of both the printer or the reader. So I'd say no.
People won't necessarily mae the difference between a middle dot and
I should not be surprised by your statement, but I am. It is distressing
to
think that something that by definition should not be rocket science --
repertoires of abstract characters mapped directly to specific bit patterns
-- would be subject to such haphazard definition and even more haphazard
What would really be nice, is for glibc-2.2 or any other unicode enabled
library to display unicode characters,etc by juts using the "escape"
sequence \u, where X represents a hexadecimal value..
Make that up to 6 Xs. One of the problems of such escapes when used in code,
a la ISO
sorry. Intel platform running Redhat Linux 7.0..
Oops, and regarding your questions about locale files on Linux. They follow
the POSIX format and can easily be modified once you get them in source form
along with the localdef util.
YA
Since the U in UTF stands for Unicode, UTF-32 cannot
represent more than
what Unicode encodes, which is is 1+ million code points.
Otherwise, you're
talking about UCS-4. But I
thought that one of the latest revs of ISO 10646
explicitely specified that
UCS-4 will never encode more
The people doing this are www.xns.org and www.onename.com.
One needs to
visit their sites and read their "white papers" to get a full
picture of
what the purpose is and how they are using the standards.
Note that there are other naming initiatives, including the one driven by my
company,
Recently I've had the dubious pleasure of delving into the details of
the VFAT file system. For long file names, I thought it used UCS-2,
but in looking at the data with a disk editor, it appears to be
byte-swapping (little endian). I thought that UCS-2 was by definition
big endian, thus
81 matches
Mail list logo