RE: UNICODE BOMBER STRIKES AGAIN

2002-04-24 Thread Yves Arrouye
You can determine that that particular text is not legal UTF-32*, since there be illegal code points in any of the three forms. IF you exclude null code points, again heuristically, that also excludes UTF-8, and almost all non-Unicode encodings. That leaves UTF-16, 16BE, 16LE as the only

RE: browsers and unicode surrogates

2002-04-23 Thread Yves Arrouye
| I am surprised by the must only be used. It seems I am not | conforming by including a meta statement in the utf-16 HTML page. I | should either remove the statement or encode the HTML up to and | including that statement as ascii. I'll check on this. It doesn't make much sense to have

RE: SCSU compression (WAS: RE: Thai word list)

2002-04-19 Thread Yves Arrouye
This looks like a nice endorsement of SCSU: :D It saves 59% just as a charset, and it saves almost 20% in a system with a real compression. I am all for SCSU as a charset (after my tools can view it properly), but that was not the use there. OTOH there is gzip encoding in HTTP 1.1 :)

RE: Japanese and Chinese and ... word lists (WAS RE:Thai word list)

2002-04-18 Thread Yves Arrouye
Since we're on this topic, what about sources for other languages where a dictionary is needed to do word breaking? I'd be interested in Chinese and Japanese myself for instance, YA

RE: Thai word list

2002-04-18 Thread Yves Arrouye
If you can process SCSU, and would appreciate a 59% reduction in file size, try: http://home.adelphia.net/~dewell/th18057-scsu.txt(135,731 bytes) Not to knock down SCSU, but if it had been gzipped instead, the resulting file would be about half that size: 70,912 bytes. (The gzipped

RE: Default endianness of Unicode, or not

2002-04-10 Thread Yves Arrouye
The last time I read the Unicode standard UTF-16 was big endian unless a BOM was present, and that's what I expected from a UTF-16 converter. Conformance requirement C2 (TUS 3.0, p. 37) says: [And other many good references where TUS does *not* say that :)] OK, maybe in 2.0, or I made

RE: MS/Unix BOM FAQ again (small fix)

2002-04-10 Thread Yves Arrouye
The reason for ICU's UTF-16 converter not trying to auto-detect the BOM is that this seems to be something that the _application_ has to decide, not the _converter_ that the application instantiates. This converter name is (currently) only a convenience alias for use the UTF-16 byte

RE: Default endianness of Unicode, or not

2002-04-10 Thread Yves Arrouye
D43 italUTF-16 character encoding scheme:/ital the Unicode CES that serializes a UTF-16 code unit sequence as a byte sequence in either big-endian or little-endian format. * In UTF-16 (the CES), the UTF-16 code unit sequence 004D 0430 4E8C D800 DF02 is serialized as FE FF 00 4D

RE: Default endianness of Unicode, or not

2002-04-10 Thread Yves Arrouye
And of course, I have been complaining about ICU's UTF-16 converter behavior, but glibc's one does the same assumption that UTF-16 is in the local endianness: gabier% echo hello | uconv -t utf-16be | iconv -f utf-16 -t ascii iconv: illegal input sequence at position 0 gabier% So fixing one but

RE: Default endianness of Unicode, or not

2002-04-10 Thread Yves Arrouye
So same semantics as before. Yep. The editorial committee would't be doing its job right if it were changing the semantics of the standard. Agreed! Is there any mention that the non-BOM byte sequence is most significant byte first anywhere else? You know, for the newbies? Joshua 1.8

RE: MS/Unix BOM FAQ again (small fix)

2002-04-09 Thread Yves Arrouye
This is incorrect. Here is a summary of the meaning of those bytes at the start of text files with different Unicode encoding forms. beginning with bytes FE FF: - UTF-16 = big endian, omitted from contents beginning with bytes FF FE: - UTF-16 = little endian, omitted from contents

RE: Collation - last character?

2002-03-22 Thread Yves Arrouye
TUS does not prevent anyone to put noncharacter code points in Unicode strings. As a matter of fact, p. 23 of TUS 3.0 reads U+ is reserved for private program use as a sentinel or other signal. I would expect this to hold true for the noncharacters that were introduced later too. It may

RE: Collation - last character?

2002-03-19 Thread Yves Arrouye
Markus Scherer wrote: How about U+10? It is a non-character, which gives it a high (unassigned character) weight in the UCA. It is the highest code point = the last character. That is definitely not what I was looking for. It is an illegal codepoint, while I was looking for a

RE: Standard Conventions and euro

2002-03-02 Thread Yves Arrouye
The old currencies on the continent (German Mark, Dutch guilder, French frank) however use a period to devide the groups and a comma as a decimal sign Some use a full stop as the thousands separator and some use a numeric (nonbreaking) space Switzerland uses an apostrophe for the

RE: Standard Conventions and euro

2002-03-01 Thread Yves Arrouye
listing the way I wanted it. *nix systems that start with fr_FR and then allow you to define fr_FR-EURO or something really aren't much better; what if I want to deviate from the pre-defined locale in four or five ways instead of just one? They do not let you deviate from a pre-defined

RE: Standard Conventions and euro

2002-03-01 Thread Yves Arrouye
On Fri, 1 Mar 2002 11:26:42 +0100 , Marco Cimarosti wrote: French francs amounts were often written with a single decimal (because the smallest coin was 10 cents) No, the 5 centime coin remained in use (until the recent demise of the Franc, of course) and in any case it was very rare to

RE: Unicode page Web ring?

2002-03-01 Thread Yves Arrouye
My page is in Unicode, but does not mention Unicode except in the headers, and the headers are invisible unless you choose view source in your browser My company service has been in UTF-8 since I joined in 1998 See http://wwwrealnamescom/; Another good example, but it's much more recent:

RE: ISO 3166 (country codes) Maintenance Agency Web pages move

2002-02-28 Thread Yves Arrouye
I'm confused. Do you mean meaningless identifiers? They look meaningless to me. House numbers in North America (and in France also, it seems) have a few bits of meaning: the least-significant (numeric) bit tells you which side of the street the house is on, and it's often the case that you

RE: Standard Conventions and euro

2002-02-28 Thread Yves Arrouye
Perhaps not as physical currency, but they sure do still exist in data, and will continue to exist in data until the Apocalypse. When is that scheduled to occur? [Alain] Very simple: « la semaine des quatre jeudis » (the week of the 4 Thursdays, as we say in French). And the exact day

RE: Unicode and end users

2002-02-16 Thread Yves Arrouye
If foo is a US-ASCII string, grep foo file will work fine with any US-ASCII-superset charset for which non-ASCII characters do not use bytes 0x80, including the hypothetical one I described, with no possibility of a false match. However grep fóó file will work only if the current shell

RE: This spoofing and security thread

2002-02-14 Thread Yves Arrouye
The very fact that most of them can be reduced to ASCII and people still find the resulting text useful and accurate to the original is a sign that the important characters in English are in ASCII. And all the standard transliterations - em-dashes - --, c-cedilia - c, e-acute, e-grave - e,

RE: Unicode and end users

2002-02-14 Thread Yves Arrouye
UTF-8 should *never* contain the BOM. But has been pointed out, it is common practice for Microsoft, and also for ICU's genrb tool, for example, which uses the BOM to autodetect the encoding. The more example you'll see of that, the more people will use the BOM (now, can't we all use -*-

RE: This spoofing and security thread

2002-02-13 Thread Yves Arrouye
What do you mean? I've done works for Project Gutenberg, and looked at a number of books with thoughts of reducing them to ASCII. In my opinion, Windows-1252 has every character that most English books will need, Especially those books that you want to reduce to ASCII :-) YA

RE: UTF-16 is not Unicode

2002-02-12 Thread Yves Arrouye
A ideal interface should probably automatically and silently select Unicode (and its default UTF) whenever one or more of the characters in a document are not representable in the local encoding. I beg to differ. Silently doing such an unexpected change is guaranteed to confuse the user,

RE: Unicode and Security: Domain Names

2002-02-08 Thread Yves Arrouye
Moreover, the IDN WG documents are in final call, so if you have comments to make on them, now is the time. Visit http://www.i-d-n.net/ and sub-scribe (with a hyphen here so that listar does not interpret my post as a command!) to their mailing list (and read their archives) before doing so. The

RE: Unicode and Security: Domain Names

2002-02-08 Thread Yves Arrouye
Moreover, the IDN WG documents are in final call, so if you have comments to make on them, now is the time. Visit http://www.i-d-n.net/ and subscribe to their mailing list (and read their archives) before doing so. The documents in last call are: 1. Internationalizing Domain Names in

RE: Unicode and Security: Domain Names

2002-02-08 Thread Yves Arrouye
Are the actual domain names as stored in the DB going to be canonical normalized Unicode strings? It seems this would go a long way towards preventing spoofing ... Names will be stored according to a normalization called Nameprep. Read the Stringprep (general framework) and Nameprep (IDN

RE: Unicode and Security

2002-02-06 Thread Yves Arrouye
Well, nothing wrong with Unicode of course. Just means that there will need to be an option in your browser to reject any site without a digital certificate, and perhaps it will need to be turned on by default. So, Nothing prevents sites running frauds to get a certificate matching their

RE: ICU's uconv vs Linux iconv and UTF-8

2002-02-01 Thread Yves Arrouye
As part of the mystery of CJK encodings I notice that IBM's ICU's uconv and SuSE6.4 linux iconv differ as to the UTF-8 representation if table.euc Both converters will round-trip with themselves and give byte exact copy of table.euc Weirdly they differ in how they map '\' and '~' in

RE: ICU's uconv vs Linux iconv and UTF-8

2002-02-01 Thread Yves Arrouye
It is definitely a problem to try to interpret what any given label is supposed to be. The problem is that MIME labels and others are ambiguous, and are interpreted different ways on different systems. Still, in the meantime it does make sense to have EUC-JP associated to the most common

RE: Introducing the idea of a ROMAN VARIANT SELECTOR (was: Re: Proposing Fraktur)

2002-01-31 Thread Yves Arrouye
quite a lot of space. However, Fraktur is already encoded in the Mathematical whatever-it's-called block. This variant selector would mean that lots of characters can be displayed in two *different* ways. I'd prefer that Fraktur diacritics were added instead, and that the mathematical

RE: POSITIVELY MUST READ! Bytext is here!

2002-01-28 Thread Yves Arrouye
Well, I've seen cases where chat engines have converted ASCII into emoticon pictures at the wrong places... And sometimes you can't turn them off. Grumble. I couldn't give out sample code in MSIM using foo(c) for a function call w/o getting a cup of coffee after foo! YA

RE: [Very-OT] Re: ü

2002-01-23 Thread Yves Arrouye
Obviously (I advocate in French changing the spelling of common foreign words so that there would be more consistency). Le ouiquende? That would be pronounced wikãd... To respect the English pronunciation you would have to write it ouiquennde, which would still be a very odd spelling in

RE: RE: [Very-OT] Re: ü

2002-01-23 Thread Yves Arrouye
http://www.culture.fr/culture/dglf/dispositif-enrichissement.htm http://www.culture.fr/culture/dglf/dispositif-enrichissement.htm Thanks for the pointer. Though I can't fine the exact sentence re: the substantive use I found mél referred to as a symbol for messagerie électronique. I like

RE: Funky characters, Japanese, and Unicode

2002-01-18 Thread Yves Arrouye
1. I have a Geocities page now. I do not know what encoding Geocities uses, but I think it's unicode. What I did for the Japanese text on it was not think about encodings and just type it in with Microsoft's IME (and do some swearing at the IME at the process). And it comes out fine, for the

RE: Off topic: Whut in tarnation is Unicode?

2002-01-16 Thread Yves Arrouye
Re: elite-speak generator, I meant the one Edward Cherlin posted: L33t-5p34k, d00d! 1t'5 3v3rywh3r3. Try the L33t-5p34K Generator!!!### at http://www.geocities.com/mnstr_2000/translate.html but the link to the trusty mail archives was enough :) Thanks. YA -- Sailing is harder than flying.

RE: Off topic: Whut in tarnation is Unicode?

2002-01-15 Thread Yves Arrouye
Now if someone could resend this elite-speak converter link, it was great. Please... Thanks! YA -- Sailing is harder than flying. It's amazing that man learned how to sail first. -- Burt Rutan.

RE: C with bar for with

2001-12-02 Thread Yves Arrouye
It may even be a glyph variant of the w with forward slash... YA -Original Message- From: Stefan Persson [mailto:[EMAIL PROTECTED]] Sent: Sunday, December 02, 2001 3:19 AM To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: C with bar for with - Original Message -

RE: Character encoding at the prompt

2001-10-25 Thread Yves Arrouye
But: setenv LC_ALL en_US.UTF-8 env LC_ALL=it echo giovedì, 25 ottobre 2001, 11:45:24 EDT I could not understand why I get the display of the letter ì in the en_US.UTF-8 Locale. My understanding was that the date command was generating the message in the Italian locale (default encoding

RE: normalize before map?

2001-10-04 Thread Yves Arrouye
[People were discussing whether one should do some case mappings before doing normalization, or the other way, and whether the case mapping can be naive or must account for what normalization will do/has done in order not to break assumptions that the resulting string is both case-folded and

RE: Currency symbols (was RE: Shape of the US Dollar Sign)

2001-10-01 Thread Yves Arrouye
About £ (L with two bars = Italian lira or Egypt/Cyprus pound) and £ (L with one bar = Pound Sterling or Irish punt), I think that the Unicode distinction is not valid because: [...] For these reason, I suggest that font designers ignore the distinction between U+00A3 (POUND SIGN) and

RE: DerivedAge.txt

2001-09-26 Thread Yves Arrouye
At the request of someone working with ICU, I regenerated a derived file that shows the age of Unicode characters -- when they came into Unicode. Does anyone think this might be useful to have in the UCD? It is definitely useful information that could go into UNIDATA. Here is a good use for

RE: 3rd-party cross-platform UTF-8 support

2001-09-21 Thread Yves Arrouye
UTF-16 - wchar_t* Wait be careful here. wchar_t is not an encoding. So.. in theory, you cannot convert between UTF-16 and wchar_t. You, however, can convert between UTF-16 and wchar_t* ON win32 since microsoft declare UTF-16 as the encoding for wchar_t. And he can also do some

RE: UTF-8 on NT

2001-09-04 Thread Yves Arrouye
I'm also thinking of 3rd party UTF-8 support such as libutf8, IBM ICU. They seem no good supports on NT, what do you think ?We are usingICU for all our Unicode needs,on NT, Windows 2000, and Unix, and itworks perfectlywell on all of these. YA

How are the UNIDATA derived files generated?

2001-08-29 Thread Yves Arrouye
Hi, I would like to know how the derived files that one can find in the UNIDATA folder are generated? I am trying to have IBM's ICU library support older versions of Unicode than the one it currently supports (3.0.something), specifically Unicode 2.1.x. ICU needs the following files:

RE: Locale codes (WAS: RE: RTF language codes)

2001-07-27 Thread Yves Arrouye
On Thu, Jul 26, 2001 at 01:04:29AM -0700, Yves Arrouye wrote: If you have a cross platform system you should use RFC 1766 style locales between systems and convert them to LCIDs on Windows. RFC 3066 was published in January. Check it out. http://www.ietf.org/rfc/rfc3066.txt

Locale codes (WAS: RE: RTF language codes)

2001-07-26 Thread Yves Arrouye
If you have a cross platform system you should use RFC 1766 style locales between systems and convert them to LCIDs on Windows. RFC 3066 was published in January. Check it out. http://www.ietf.org/rfc/rfc3066.txt YA

RE: Ethnologue 14 online

2001-07-24 Thread Yves Arrouye
After considerable and unfortunate delay, the new Ethnologue site, including the online version of the 14th Edition, is at last available to the public: http://www.ethnologue.com/home.asp. There are still refinements being made, but all the basics are there and working. Very nice!

RE: More about SCSU (was: Re: A UTF-8 based News Service)

2001-07-21 Thread Yves Arrouye
SCSU doesn't look very nice for me. The idea is OK but it's just too complicated. Various proposals of encodings differences or xors between consecutive characters are IMHO technically better: much simpler to implement and work as well. These differential schemes seem to be the way

RE: More about SCSU (was: Re: A UTF-8 based News Service)

2001-07-13 Thread Yves Arrouye
SCSU is also registered as an IANA charset, although you are unlikely to find raw SCSU text on the Internet, due to its use of control characters (bytes below 0x20). And what browser supports SCSU, and what it that browser's reach in term of population? Because that's usually what

RE: Playing with Unicode (was: Re: UTF-17)

2001-06-25 Thread Yves Arrouye
A proposal needs a definition, though: UTF would mean Unicode Transformation Format utf would mean Unicode Terrible Farce untenable total figment? unable to focus? utf twisted form? YA

RE: UTF-17

2001-06-25 Thread Yves Arrouye
From: [EMAIL PROTECTED] Oh yeah, well, I can be more tongue-in-cheek than all of you. I've already implemented it. Quick, quick. Patent it and then open-source it. It will be unstoppable. YA

RE: UTF-17

2001-06-22 Thread Yves Arrouye
Isn't UTF-17 just a sarcastic comment on all of this UTF- discussion? YA

RE: converting ISO 8859-1 character set text to ASCII (128)charactet set

2001-06-20 Thread Yves Arrouye
We have a specific requirment of converting Latin -1 character set ( iso 8859-1 ) text to ASCII charactet set ( a set of only 128 characters). Is there any special set of utilities available or service providers who can do that type of job. [I am assuming that your ascii table is

RE: UTFs, ACEs, and English horns

2001-06-18 Thread Yves Arrouye
Also check out the sites of the IETF IDN WG (http://www.ietf.org/html.charters/idn- charter.html, and http://www.i-d-n.net/) for more information that you may have wished for. Oops. Sorry, I only saw James's answer. You obviously read these. Well, I hope my English horns pages were new

RE: UTFs, ACEs, and English horns

2001-06-18 Thread Yves Arrouye
Also check out the sites of the IETF IDN WG (http://www.ietf.org/html.charters/idn-charter.html, and http://www.i-d-n.net/) for more information that you may have wished for. Except on English horns, that is; but then you may want to visit http://www.users.globalnet.co.uk/~gbrowne/geoff9.htm and

RE: Missing characters for Italian

2001-06-11 Thread Yves Arrouye
So my question is: is the superscript attribute essential in French to understand these abbreviations (as it is in Italian), or is it desirable but optional (as it is in English)? Not to understand them. While understanding is subjective, it is usually evident from the context that these

RE: Term Asian is not used properly on Computers and NET

2001-06-03 Thread Yves Arrouye
There are also terms like the West or Western (world, languages, civilization, etc) which have referents that are not completely west of the Greenwich Meridian, whose usage cannot be simply explained or justified by it. Every point can be found west (or east) of the Greenwhich Meridian. Not all

RE: Metafont [was Re: Single Unicode Font]

2001-05-26 Thread Yves Arrouye
BTW, it seems that Metafont is a trademark of Addison Wesley publishing company ... Interesting. Maybe because they published the Metafont book (and its friend Metafont: the program) along with the rest of Knuth's Computers and Typesetting books? This is the bell that Metafont (as you

RE: search ignoring diacritics

2001-05-21 Thread Yves Arrouye
Peter - normalise both data and search string - delete / ignore all Peter characters with general category Mn It worked well for us too. Someone mentionned to me once though that U+3099 and U+309A should be preserved in order not to change the meaning of words, and we do so. But

RE: About Kana folding

2001-05-18 Thread Yves Arrouye
Kenneth, Thanks for the explanations. So I'd suggest you be very careful when trying to do this kind of a folding. If it is just for surface text matching, the number of false positive matches would likely swamp the number of false negatives you'd be correcting. On the other hand, if you

About Kana folding

2001-05-17 Thread Yves Arrouye
Hi, If one were to need to pick Katakana versus Hiragana and fold one into the other (say to let people match a word or sentence in any of them), is there one that is preferrable to the other? I think that some Katakana have no Hiragana equivalents, does that mean that it's always easier to go

RE: Help in a HURRY !!!!!!!!!!!!!!!!!!!!!!!

2001-05-15 Thread Yves Arrouye
To go with Lukas's Perl code, I'll provide a C version, not really tested either, with ICU, to give him a choice. No error checking etc., just to give the idea. If you want UTF-16 you'll need to use the macros in unicode/utf16.h to generate surrogate pairs properly. #include stdio.h #include

RE: UCD in XML

2001-05-15 Thread Yves Arrouye
I then tried my usual remedy: Bow in precisely the correct direction (359° 16' 32 N*) Adjust the bearing for declination (15° 26' E according to my chart of the bay), and try again compass in hand, maybe? ;-) YA

RE: Using hex numbers considered a geek attitude

2001-05-03 Thread Yves Arrouye
BTW, anybody knows how to input characters on Windows using the hex codepoint? I know it's good for my brain to do the exercise of going from hexadecimal to decimal, but it is still a pain to have to type ALT-DECIMAL when all I have in my book is hex. That would be a reason for providing the

RE: Byte Order Marks

2001-04-20 Thread Yves Arrouye
Then why is ICU mapping UTF-16 to UTF16_PlatformEndian and not UTF16_BigEndian? ICU does not do Unicode-signature or other encoding detection as part of a converter. When you get text from some protocol, you need to instantiate a converter according to what you know about the

RE: Byte Order Marks

2001-04-20 Thread Yves Arrouye
On Thu, Apr 19, 2001 at 06:24:47PM -0700, Markus Scherer wrote: On the other hand, if you get a file from your platform and it is in 16-bit Unicode, then you would appreciate the convenience of the auto-endian alias. But nothing should be spitting out platform-endian UTF-16! In the

RE: Byte Order Marks

2001-04-19 Thread Yves Arrouye
If you don't have any clue about the byte order, but you know it is UTF-16, then assume BE. Then why is ICU mapping UTF-16 to UTF16_PlatformEndian and not UTF16_BigEndian? I know that was a difference between ICU and my library, and when I asked this question a while ago I was told that despite

RE: How will software source code represent 21 bit unicode charac ters?

2001-04-17 Thread Yves Arrouye
Has this matter already been addressed anywhere? I think the C standard is in the process of making a decision about this. If memory helps, we will have escapes like '\u' and '\U'. I think they made the decision already. It is in the latest editions of the standards. The only

RE: Identifiers

2001-04-16 Thread Yves Arrouye
On Sun, Apr 15, 2001 at 08:10:55PM +0200, Florian Weimer wrote: Is it sufficient to mandate that all such identifiers MUST be KC- or KD-normalized? Does this guarantee print-and-enter round-trip compatibility? In general, the problem is unsolvable. There are several look-alikes

RE: Identifiers

2001-04-16 Thread Yves Arrouye
(I don't know if email addresses will be internationalized anytime soon. This is just an example. ;-) http://www.-i-d-n.net/ They have a normalization process that may be used for e-mail someday. It explictely does not do anything about similar looking glyphs. Read their list archive, I'm

RE: Identifiers

2001-04-16 Thread Yves Arrouye
There should be a method to overcome the source sepearation rule which might have saved certain identical characters from unification. - U+0048 LATIN CAPITAL LETTER H - U+0397 GREEK CAPITAL LETTER ETA - U+041D CYRILLIC CAPITAL LETTER EN - U+13BB CHEROKEE LETTER MI If

RE: Identifiers

2001-04-16 Thread Yves Arrouye
Florian, I respectfully suggest that you look up the various technical reports that accompany the Unicode standard. It looks like ther may be certain confusion about characters and glyphs Oops, got tripped by my native French language. I didn't mean "certain" but "some". Do not conclude that

RE: Identifiers

2001-04-16 Thread Yves Arrouye
We have normalization similar to the one you're talking about in our Internet Keywords system. It is built on top of NFKC. It is good for users, but then it is also very specific. Details, details! (Or do you consider that stuff a proprietary advantage?) I don't really. That would

RE: Identifiers

2001-04-15 Thread Yves Arrouye
Is it sufficient to mandate that all such identifiers MUST be KC- or KD-normalized? Does this guarantee print-and-enter round-trip compatibility? It depends on the accuracy of both the printer or the reader. So I'd say no. People won't necessarily mae the difference between a middle dot and

RE: Sun's Java encodings vs IANA's character set registry

2001-04-12 Thread Yves Arrouye
I should not be surprised by your statement, but I am. It is distressing to think that something that by definition should not be rocket science -- repertoires of abstract characters mapped directly to specific bit patterns -- would be subject to such haphazard definition and even more haphazard

RE: Digits in Unicode Names

2001-04-06 Thread Yves Arrouye
What would really be nice, is for glibc-2.2 or any other unicode enabled library to display unicode characters,etc by juts using the "escape" sequence \u, where X represents a hexadecimal value.. Make that up to 6 Xs. One of the problems of such escapes when used in code, a la ISO

RE: locale files....

2001-03-30 Thread Yves Arrouye
sorry. Intel platform running Redhat Linux 7.0.. Oops, and regarding your questions about locale files on Linux. They follow the POSIX format and can easily be modified once you get them in source form along with the localdef util. YA

Re: UTF8 vs. Unicode (UTF16) in code

2001-03-09 Thread Yves Arrouye
Since the U in UTF stands for Unicode, UTF-32 cannot represent more than what Unicode encodes, which is is 1+ million code points. Otherwise, you're talking about UCS-4. But I thought that one of the latest revs of ISO 10646 explicitely specified that UCS-4 will never encode more

RE: New Name Registry Using Unicode

2000-09-29 Thread Yves Arrouye
The people doing this are www.xns.org and www.onename.com. One needs to visit their sites and read their "white papers" to get a full picture of what the purpose is and how they are using the standards. Note that there are other naming initiatives, including the one driven by my company,

RE: Unicode in VFAT file system

2000-07-20 Thread Yves Arrouye
Recently I've had the dubious pleasure of delving into the details of the VFAT file system. For long file names, I thought it used UCS-2, but in looking at the data with a disk editor, it appears to be byte-swapping (little endian). I thought that UCS-2 was by definition big endian, thus