Philippe Verdy [EMAIL PROTECTED] writes:
[...]
This was later amended in an errata for XML 1.0 which now says that
the list of code points whose use is *discouraged* (but explicitly
*not* forbidden) for the Char production is now:
[...]
Ugh, it's a mess...
IMHO Unicode is partially to blame,
Title: RE: When to validate?
Antoine Leca wrote:
As a result, your strings are likely to be some stuctures.
Then, it is pretty easy to add some s_valid flag, and you are done.
Is that a proven technique? I'd say not. The flag would only be valid for as long as the string is not changed. You
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Arcane Jill responded:
Windows filesystems do know what encoding they use.
Err, not really. MS-DOS *need to know* the encoding to use,
a bit like a
*nix application that displays filenames need to know the
encoding to use
the
At 17:38 -0800 2004-12-10, Asmus Freytag wrote:
Other examples of apparent redundancy, are
Cakes - Keks (German), plural Kekse
Baby - bebis (Swedish), plural bebissar
and there are many more such examples.
In Ireland sometime in the early nineties, the Allied Irish Bank
became AIB Bank, the
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
John Cowan wrote:
However, although they are *technically* octet sequences, they
are *functionally* character strings. That's the issue.
Nicely put! But UTC does not seem to care.
The point I'm making is that *whatever* you do,
Am 11.12.2004 um 04:32 schrieb Clark Cox:
There are always the classics: ATM Machine and PIN Number
Here in germany, they say ASCII-Code. :-)
Johannes
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Kenneth Whistler wrote:
Lars responded:
... Whatever the solutions
for representation of corrupt data bytes or uninterpreted data
bytes on conversion to Unicode may be, that is irrelevant to the
concerns on whether an
Lars Kristan [EMAIL PROTECTED] writes:
It's essential that any UTF-n can be translated to any other without
loss of data. Because it allows to use an implementation of the given
functionality which represents data in any form, not necessarily the
form we have at hand, as long as correctness
Philippe Verdy wrote:
The repertoire of all possible combining characters sequences is
already infinite in Unicode, as well as the number of default
grapheme clusters they can represent.
For a fixed length of combining character sequence (base + 3 combining
marks is the most I have seen
Title: RE: Roundtripping in Unicode
Marcin 'Qrczak' Kowalczyk wrote:
Lars Kristan [EMAIL PROTECTED] writes:
The other name for this is roundtripping. Currently, Unicode allows
a roundtrip UTF-16=UTF-8=UTF-16. For any data. But there are
several reasons why a UTF-8=UTF-16(32)=UTF-8
Philippe Verdy scripsit:
Didn't know that. Is this a very recent use?
It's been used as an English verb, adjective, and noun for 30-40 years
and perhaps much longer: see below.
In France, I think that RSVP was introduced and widely used at end of
telegraphic messages (that contained lots of
On 11/12/2004 02:29, Mark Davis wrote:
This is just a confusion among the hoi polloi.
Mark
But such things happen not just among the German and Swedish polloi, but
even in the crowning heights of the English language. The word
cherubims is used many times in the King James Bible and at least
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Kenneth Whistler wrote:
Further, as it turns out that Lars is actually asking for
standardizing corrupt UTF-8, a notion that isn't going to
fly even two feet, I think the whole idea is going to be
a complete non-starter.
Michael Everson everson at evertype dot com wrote:
In Ireland sometime in the early nineties, the Allied Irish Bank
became AIB Bank, the Allied Irish Bank Bank.
Israel Discount Bank of New York regularly refers to itself as IDB
Bank.
-Doug Ewell
Fullerton, California
Philippe,
However, within the program itself UTF-8 presents a
problem when looking for specific data in memory buffers.
It is nasty, time consuming and error prone. Mapping
UTF-16 to code points is a snap as long as you
do not have a lot of surrogates. If you do then probably
UTF-32 should be
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
Regarding A, I see three choices:
1. A string is a sequence of code points.
2. A string is a sequence of combining character sequences.
3. A string is a sequence of code points, but it's encouraged
to process it in groups of combining character
on 2004-12-11 09:21 John Cowan wrote:
It's been used as an English verb, adjective, and noun for 30-40 years
and perhaps much longer: see below.
Longer. I can attest from my youth in the 1950s that my parents
considered it ordinary English usage, and in fact knew of its origin.
--
Curtis Clark
Title: RE: Roundtripping in Unicode
Marcin 'Qrczak' Kowalczyk wrote:
Lars Kristan [EMAIL PROTECTED] writes:
All assigned codepoints do roundtrip even in my concept.
But unassigned codepoints are not valid data.
Please make up your mind: either they are valid and programs are
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Philippe Verdy wrote:
This is a known caveat even for Unix, when you look at the
tricky details of
the support of Windows file sharing through Samba, when the
client requests
a file with a short 8.3 name, that a partition used
RE: Roundtripping in Unicode
Lars Kristan wrote:
All assigned codepoints do roundtrip even in my concept.
But unassigned codepoints are not valid data.
Please make up your mind: either they are valid and programs are
required to accept them, or they are invalid and programs are
required to
John wrote:
As far as I know, they were first used in formal invitations (to
weddings,
funerals, dances, etc.) in the corner of the card, as both shorter and
more fancy than the older phrase The favor of your reply is
requested.
This is correct. The practice dates from the end of the nineteenth
Title: RE: Nicest UTF
Missed this one the other day, but cannot let it go...
Marcin 'Qrczak' Kowalczyk wrote:
filenames, what is one supposed to do? Convert all
filenames to UTF-8?
Yes.
Who will do that?
A system administrator (because he has access to all files).
My my,
From: Doug Ewell [EMAIL PROTECTED]
Lars Kristan wrote:
I am sure one of the standardizers will find a Unicodally
correct way of putting it.
I can't even understand that paragraph, let alone paraphrase it.
My understanding of his question and my reponse to his problem is that you
MUST not use
From: Séamas Ó Brógáin [EMAIL PROTECTED]
John wrote:
As far as I know, they were first used in formal invitations (to
weddings,
funerals, dances, etc.) in the corner of the card, as both shorter and
more fancy than the older phrase The favor of your reply is requested.
This is correct. The
At 01:12 +0100 2004-12-12, Philippe Verdy wrote:
I would not be surprised if this acronym was defined in some
internationally accepted set of abbreviations used by telegraphists,
so that their clients became exposed to these acronyms when reading
telegrams received from their local post office
From: Peter R. Mueller-Roemer [EMAIL PROTECTED]
For a fixed length of combining character sequence (base + 3 combining
marks is the most I have seen graphically distinguishable) the repertore
is still finite.
I do think that you are underestimating the repertoire. Also Unicode does
NOT define
From: Michael Everson [EMAIL PROTECTED]
Nonsense. You might as well try to explain SPQR on the same basis.
I won't. I know that SPQR was used on architectural constructions as a
symbol of the Roman Empire, and it was a wellknown acronym of a Latin
expression.
It largely predates the invention
Marcin 'Qrczak' Kowalczyk writes:
But demanding that each program which searches strings checks for
combining classes is I'm afraid too much.
How is it any different from a case-insenstive search?
Does \n followed by a combining code point start a new line?
The Standard says no,
Lars Kristan writes:
A system administrator (because he has access to all files).
My my, you are assuming all files are in the same encoding. And what about
all the references to the files in scripts? In configuration files? Soft
links? If you want to break things, this is definitely the
RE: Roundtripping in UnicodeMy view about this problem of roundtripping is
that if data, supposed to contain only valid UTF-8 sequences, contains some
invalid byte sequences that still need to be roundtripped to some code
point for internal management that can be roundtripped later to the
Title: RE: When to validate?
Andy Heninger wrote:
Some important things in designing a function API are
o Fully define what the behavior is. With a function like
tolower(), you could leave malformed sequences unaltered;
you could replace them with some substitution character;
you
D. Starner [EMAIL PROTECTED] writes:
This implies that every programmer needs an indepth knowledge of
Unicode to handle simple strings.
There is no way to avoid that.
Then there's no way that we're ever going to get reliable Unicode
support.
This is probably true.
I wonder whether
Title: Roundtripping in Unicode (was RE: Invalid UTF-8 sequences)
Marcin 'Qrczak' Kowalczyk wrote:
Lars Kristan [EMAIL PROTECTED] writes:
Quite close. Except for the fact that:
* U+EE93 is represented in UTF-32 as 0xEE93
* U+EE93 is represented in UTF-16 as 0xEE93
* U+EE93 is
Lars Kristan [EMAIL PROTECTED] writes:
The other name for this is roundtripping. Currently, Unicode allows
a roundtrip UTF-16=UTF-8=UTF-16. For any data. But there are
several reasons why a UTF-8=UTF-16(32)=UTF-8 roundtrip is more
valuable, even if it means that the other roundtrip is no
Title: RE: Roundtripping in Unicode
Marcin 'Qrczak' Kowalczyk wrote:
Roundtrip for valid data is of course essential and needs to be
preserved.
Your proposal does not do this.
All assigned codepoints do roundtrip even in my concept. But unassigned codepoints are not valid data.
Lars Kristan [EMAIL PROTECTED] writes:
All assigned codepoints do roundtrip even in my concept.
But unassigned codepoints are not valid data.
Please make up your mind: either they are valid and programs are
required to accept them, or they are invalid and programs are required
to reject them.
36 matches
Mail list logo