Title: RE: Nicest UTF
D. Starner wrote:
Some won't convert any and will just start using UTF-8
for new ones. And this should be allowed.
Why should it be allowed? You can't mix items with
different unlabeled encodings willy-nilly. All you're going
to get, all you can expect to get
Title: RE: Nicest UTF
Marcin 'Qrczak' Kowalczyk wrote:
My my, you are assuming all files are in the same encoding.
Yes. Otherwise nothing shows filenames correctly to the user.
UNIX is a multi user system. One user can use one locale and might never see files from another user that uses
Title: RE: Nicest UTF
D. Starner wrote:
Lars Kristan writes:
A system administrator (because he has access to all files).
My my, you are assuming all files are in the same encoding.
And what about
all the references to the files in scripts? In
configuration files? Soft
links
Some won't convert any and will just start using UTF-8
for new ones. And this should be allowed.
Why should it be allowed? You can't mix items with
different unlabeled encodings willy-nilly. All you're going
to get, all you can expect to get is a mess.
--
Lars Kristan scripsit:
I'm using ISO-8859-2.
In fact you're lucky. Many ISO-8859-1 filenames display correctly in
ISO-8859-2. Not all users are so lucky.
It was a design point of ISO-8859-{1,2,3,4}, but not any other variants,
that every character appears either at the same codepoint or not
From: D. Starner [EMAIL PROTECTED]
Some won't convert any and will just start using UTF-8
for new ones. And this should be allowed.
Why should it be allowed? You can't mix items with
different unlabeled encodings willy-nilly. All you're going
to get, all you can expect to get is a mess.
When you
Lars Kristan [EMAIL PROTECTED] writes:
My my, you are assuming all files are in the same encoding.
Yes. Otherwise nothing shows filenames correctly to the user.
And what about all the references to the files in scripts?
In configuration files?
Such files rarely use non-ASCII characters.
D. Starner [EMAIL PROTECTED] writes:
But demanding that each program which searches strings checks for
combining classes is I'm afraid too much.
How is it any different from a case-insenstive search?
We started from string equality, which somehow changed into searching.
Default string
Philippe Verdy [EMAIL PROTECTED] writes:
It's hard to create a general model that will work for all scripts
encoded in Unicode. There are too many differences. So Unicode just
appears to standardize a higher level of processing with combining
sequences and normalization forms that are better
On 11/12/2004 16:53, Peter R. Mueller-Roemer wrote:
...
For a fixed length of combining character sequence (base + 3 combining
marks is the most I have seen graphically distinguishable) the
repertore is still finite.
In Hebrew it is actually possible to have up to 9 combining marks with a
Philippe Verdy [EMAIL PROTECTED] writes:
[...]
This was later amended in an errata for XML 1.0 which now says that
the list of code points whose use is *discouraged* (but explicitly
*not* forbidden) for the Char production is now:
[...]
Ugh, it's a mess...
IMHO Unicode is partially to blame,
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Arcane Jill responded:
Windows filesystems do know what encoding they use.
Err, not really. MS-DOS *need to know* the encoding to use,
a bit like a
*nix application that displays filenames need to know the
encoding to use
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
John Cowan wrote:
However, although they are *technically* octet sequences, they
are *functionally* character strings. That's the issue.
Nicely put! But UTC does not seem to care.
The point I'm making is that *whatever* you do
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Kenneth Whistler wrote:
Lars responded:
... Whatever the solutions
for representation of corrupt data bytes or uninterpreted data
bytes on conversion to Unicode may be, that is irrelevant to the
concerns on whether
Philippe Verdy wrote:
The repertoire of all possible combining characters sequences is
already infinite in Unicode, as well as the number of default
grapheme clusters they can represent.
For a fixed length of combining character sequence (base + 3 combining
marks is the most I have seen
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Kenneth Whistler wrote:
Further, as it turns out that Lars is actually asking for
standardizing corrupt UTF-8, a notion that isn't going to
fly even two feet, I think the whole idea is going to be
a complete non-starter
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
Regarding A, I see three choices:
1. A string is a sequence of code points.
2. A string is a sequence of combining character sequences.
3. A string is a sequence of code points, but it's encouraged
to process it in groups of combining character
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Philippe Verdy wrote:
This is a known caveat even for Unix, when you look at the
tricky details of
the support of Windows file sharing through Samba, when the
client requests
a file with a short 8.3 name, that a partition used
Title: RE: Nicest UTF
Missed this one the other day, but cannot let it go...
Marcin 'Qrczak' Kowalczyk wrote:
filenames, what is one supposed to do? Convert all
filenames to UTF-8?
Yes.
Who will do that?
A system administrator (because he has access to all files).
My my
From: Peter R. Mueller-Roemer [EMAIL PROTECTED]
For a fixed length of combining character sequence (base + 3 combining
marks is the most I have seen graphically distinguishable) the repertore
is still finite.
I do think that you are underestimating the repertoire. Also Unicode does
NOT define
Marcin 'Qrczak' Kowalczyk writes:
But demanding that each program which searches strings checks for
combining classes is I'm afraid too much.
How is it any different from a case-insenstive search?
Does \n followed by a combining code point start a new line?
The Standard says no,
Lars Kristan writes:
A system administrator (because he has access to all files).
My my, you are assuming all files are in the same encoding. And what about
all the references to the files in scripts? In configuration files? Soft
links? If you want to break things, this is definitely the
D. Starner [EMAIL PROTECTED] writes:
This implies that every programmer needs an indepth knowledge of
Unicode to handle simple strings.
There is no way to avoid that.
Then there's no way that we're ever going to get reliable Unicode
support.
This is probably true.
I wonder whether
Marcin 'Qrczak' Kowalczyk scripsit:
The XML/HTML core syntax is defined with fixed behavior of some
individual characters like '', '', quotation marks, and with special
behavior for spaces.
The point is: what characters mean in this sentence. Code points?
Combining character sequences?
Marcin 'Qrczak' Kowalczyk writes:
D. Starner writes:
This implies that every programmer needs an indepth knowledge of
Unicode to handle simple strings.
There is no way to avoid that.
Then there's no way that we're ever going to get reliable Unicode
support.
If the runtime
From: Philippe Verdy [EMAIL PROTECTED]
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
Philippe Verdy [EMAIL PROTECTED] writes:
The XML/HTML core syntax is defined with fixed behavior of some
individual characters like '', '', quotation marks, and with special
behavior for spaces.
The point is:
Marcin 'Qrczak' Kowalczyk scripsit:
http://www.w3.org/TR/2000/REC-xml-20001006#charsets
implies that the appropriate level for parsing XML is code points.
You are reading the XML Recommendation incorrectly. It is not defined
in terms of codepoints (8-bit, 16-bit, or 32-bit) but in terms of
John Cowan writes:
You are reading the XML Recommendation incorrectly. It is not defined
in terms of codepoints (8-bit, 16-bit, or 32-bit) but in terms of
characters. XML processors are required to process UTF-8 and UTF-16,
and may process other character encodings or not. But the internal
Philippe Verdy scripsit:
If you look at the XML 1.0 Second Edition
The Second Edition has been superseded by the Third.
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x1-#x10]
That is normative.
But the comment following it specifies:
That comment is not
Philippe Verdy scripsit:
Okay, I'm confused. Does #8814; open a tag? Does it matter if it's
composed or decomposed?
It does not open a XML tag.
It does matter if it's composed (won't open a tag) or decomposed (will
open a tag, but with a combining character, invalid as an identifier
Philippe Verdy scripsit:
And I disagree with you about the fact the U+ can't be used in XML
documents. It can be used in URI through URI escaping mechanism, as
explicitly indicated in the XML specification...
You have a hold of the right stick but at the wrong end. U+ can be
D. Starner [EMAIL PROTECTED] writes:
String equality in a programming language should not treat composed
and decomposed forms as equal. Not this level of abstraction.
This implies that every programmer needs an indepth knowledge of
Unicode to handle simple strings.
There is no way to avoid
Philippe Verdy [EMAIL PROTECTED] writes:
The XML/HTML core syntax is defined with fixed behavior of some
individual characters like '', '', quotation marks, and with special
behavior for spaces.
The point is: what characters mean in this sentence. Code points?
Combining character sequences?
John Cowan [EMAIL PROTECTED] writes:
The XML/HTML core syntax is defined with fixed behavior of some
individual characters like '', '', quotation marks, and with special
behavior for spaces.
The point is: what characters mean in this sentence. Code points?
Combining character sequences?
- Original Message -
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Friday, December 10, 2004 8:35 PM
Subject: Re: Nicest UTF
Philippe Verdy [EMAIL PROTECTED] writes:
The XML/HTML core syntax is defined with fixed behavior of some
individual characters
John Cowan [EMAIL PROTECTED] writes:
The XML/HTML core syntax is defined with fixed behavior of some
individual characters like '', '', quotation marks, and with special
behavior for spaces.
The point is: what characters mean in this sentence. Code points?
Combining character sequences?
From: John Cowan [EMAIL PROTECTED]
Marcin 'Qrczak' Kowalczyk scripsit:
http://www.w3.org/TR/2000/REC-xml-20001006#charsets
implies that the appropriate level for parsing XML is code points.
You are reading the XML Recommendation incorrectly. It is not defined
in terms of codepoints (8-bit,
From: D. Starner [EMAIL PROTECTED]
Okay, I'm confused. Does #8814; open a tag? Does it matter if it's
composed or
decomposed?
It does not open a XML tag.
It does matter if it's composed (won't open a tag) or decomposed (will open
a tag, but with a combining character, invalid as an identifier
On Monday, December 6th, 2004 20:52Z John Cowan va escriure:
Doug Ewell scripsit:
Now suppose you have a UNIX filesystem, containing filenames in a
legacy encoding (possibly even more than one). If one wants to
switch to UTF-8 filenames, what is one supposed to do? Convert all
filenames to
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Behalf Of Antoine Leca
Sent: 09 December 2004 11:29
To: Unicode Mailing List
Subject: Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Windows filesystems do know what encoding they use.
Err, not really. MS-DOS *need
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
Ok, so it's the conversion from raw text to escaped character
references which should treat combining characters specially.
What about with combining acute, which doesn't have a precomposed
form? A broken opening tag or a valid text character?
From: D. Starner [EMAIL PROTECTED]
Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] writes:
If it's a broken character reference, then what about A#769; (769 is
the code for combining acute if I'm not mistaken)?
Please start adding spaces to your entity references or
something, because those of us
From: Antoine Leca [EMAIL PROTECTED]
Err, not really. MS-DOS *need to know* the encoding to use, a bit like a
*nix application that displays filenames need to know the encoding to use
the correct set of glyphs (but constrainst are much more heavy.) Also
Windows NT Unicode applications know it,
Philippe Verdy verdy underscore p at wanadoo dot fr wrote:
Please start adding spaces to your entity references or
something, because those of us reading this through a web interface
are getting very confused.
No confusion possible if using any classic mail reader.
Blame your ISP (and
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Doug Ewell wrote:
How do file names work when the user changes from one SBCS to another
(let's ignore UTF-8 for now) where the interpretation is
different? For
example, byte C3 is U+00C3, A with tilde () in ISO 8859-1,
but U+0102
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Needless to say, these systems were badly designed at their
origin, and
newer filesystems (and OS APIs) offer much better
alternative, by either
storing explicitly on volumes which encoding it uses, or by
forcing all
user
D. Starner [EMAIL PROTECTED] writes:
You could hide combining characters, which would be extremely useful if
we were just using Latin and Cyrillic scripts.
It would need a separate API for examining the contents of a combining
character. You can't avoid the sequence of code points completely.
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Kenneth Whistler wrote:
I'm going to step in here, because this argument seems to
be generating more heat than light.
I agree, and I thank you for that.
First, I'm going to summarize what I think Lars Kristan is
suggesting
John Cowan responded:
Storage of UNIX filenames on Windows databases, for example,
^^
O.k., I just quoted this back from the original email, but
it really is a complete misconception of the issue for
databases. Windows databases is a
Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] writes:
D. Starner [EMAIL PROTECTED] writes:
You could hide combining characters, which would be extremely useful if we
were just using Latin
and Cyrillic scripts.
It would need a separate API for examining the contents of a combining
D. Starner [EMAIL PROTECTED] writes:
The semantics there are surprising, but that's true no matter what you
do. An NFC string + an NFC string may not be NFC; the resulting text
doesn't have N+M graphemes.
Which implies that automatically NFC-ing strings as they are processed
would be a bad
Marcin 'Qrczak' Kowalczyk scripsit:
String equality in a programming language should not treat composed
and decomposed forms as equal. Not this level of abstraction.
Well, that assumes that there's a special string equality predicate, as
distinct from just having various predicates that DWIM.
John Cowan [EMAIL PROTECTED] writes:
String equality in a programming language should not treat composed
and decomposed forms as equal. Not this level of abstraction.
Well, that assumes that there's a special string equality predicate,
as distinct from just having various predicates that
Marcin 'Qrczak' Kowalczyk writes:
String equality in a programming language should not treat composed
and decomposed forms as equal. Not this level of abstraction.
This implies that every programmer needs an indepth knowledge of Unicode
to handle simple strings. The concept makes me want to
Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] writes:
If it's a broken character reference, then what about A#769; (769 is
the code for combining acute if I'm not mistaken)?
Please start adding spaces to your entity references or
something, because those of us reading this through a web
Marcin asked:
The general trouble is that numeric character references can only
encode individual code points
By design.
rather than graphemes (is this a correct
term for a non-combining code point with a sequence of combining code
points?).
No. The correct term is combining character
Lars responded:
... Whatever the solutions
for representation of corrupt data bytes or uninterpreted data
bytes on conversion to Unicode may be, that is irrelevant to the
concerns on whether an application is using UTF-8 or UTF-16
or UTF-32.
The important fact is that if you have an
From: D. Starner [EMAIL PROTECTED]
If you're talking about a language that hides the structure of strings
and has no problem with variable length data, then it wouldn't matter
what the internal processing of the string looks like. You'd need to
use iterators and discourage the use of arbitrary
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Doug Ewell wrote:
John Cowan jcowan at reutershealth dot com wrote:
Windows filesystems do know what encoding they use. But a
filename on
a Unix(oid) file system is a mere sequence of octets, of
which only 00
and 2F
Philippe stated, and I need to correct:
UTF-24 already exists as an encoding form (it is identical to UTF-32), if
you just consider that encoding forms just need to be able to represent a
valid code range within a single code unit.
This is false.
Unicode encoding forms exist by virtue of
Yes, and pigs could fly, if they had big enough wings.
An 8-foot wingspan should do it. For picture of said flying pig see:
http://www.cincinnati.com/bigpiggig/profile_091700.html
http://www.cincinnati.com/bigpiggig/images/pig091700.jpg
Rick
From: Kenneth Whistler [EMAIL PROTECTED]
Yes, and pigs could fly, if they had big enough wings.
Once again, this is a creative comment. As if Unicode had to be bound on
architectural constraints such as the requirement of representing code units
(which are architectural for a system) only as
From: D. Starner [EMAIL PROTECTED]
(Sorry for sending this twice, Marcin.)
Marcin 'Qrczak' Kowalczyk writes:
UTF-8 is poorly suitable for internal processing of strings in a
modern programming language (i.e. one which doesn't already have a
pile of legacy functions working of bytes, but which can
RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)I know wht you mean here:
most Linux/Unix filesystems (as well as many legacy filesystems for Windows
and MacOS...) do not track the encoding with which filenames were encoded
and, depending on local user preferences when that user created
Philippe continued:
As if Unicode had to be bound on
architectural constraints such as the requirement of representing code units
(which are architectural for a system) only as 16-bit or 32-bit units,
Yes, it does. By definition. In the standard.
ignoring the fact that technologies do
Lars,
I'm going to step in here, because this argument seems to
be generating more heat than light.
I never said it doesn't violate any existing rules. Stating that it does,
doesn't help a bit. Rules can be changed.
I ask you to step back and try to see the big picture.
First, I'm going to
Kenneth Whistler scripsit:
Storage of UNIX filenames on Windows databases, for example,
can be done with BINARY fields, which correctly capture the
identity of them as what they are: an unconvertible array of
byte values, not a convertible string in some particular
code page.
This solution,
Philippe Verdy verdy underscore p at wanadoo dot fr wrote:
An alternative can then be a mixed encoding selection:
- choose a legacy encoding that will most often be able to represent
valid filenames without loss of information (for example ISO-8859-1,
or Cp1252).
- encode the filename with
Kenneth Whistler kenw at sybase dot com wrote:
I do not think this is a proposal to amend UTF-8 to allow
invalid sequences. So we should get that off the table.
I hope you are right.
Apparently Lars is currently using PUA U+E080..U+E0FF
(or U+EE80..U+EEFF ?) for this purpose, enabling the
RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Lars Kristan wrote:
I never said it doesn't violate any existing rules. Stating that it
does, doesn't help a bit. Rules can be changed. Assuming we understand
the consequences. And that is what we should be discussing. By stating
what should
: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Behalf Of Marcin 'Qrczak' Kowalczyk
Sent: 02 December 2004 16:59
To: [EMAIL PROTECTED]
Subject: Re: Nicest UTF
Arcane Jill [EMAIL PROTECTED] writes:
Oh for a chip with 21-bit wide registers!
Not 21-bit but 20.087462841250343-bit :-)
--
__( Marcin Kowalczyk
Title: RE: Nicest UTF
Doug Ewell wrote:
RE: Nicest UTFLars Kristan wrote:
I think UTF8 would be the nicest UTF.
I agree. But not for reasons you mentioned. There is one other
important advantage: UTF-8 is stored in a way that permits storing
invalid sequences. I will need
Arcane Jill arcanejill at ramonsky dot com wrote:
Probably a dumb question, but how come nobody's invented UTF-24 yet?
I just made that up, it's not an official standard, but one could
easily define UTF-24 as UTF-32 with the most-significant byte (which
is always zero) removed, hence all
Asmus Freytag wrote:
A simplistic model of the 'cost' for UTF-16 over UTF-32 would consider
snip
3) additional cost of accessing 16-bit registers (per character)
snip
For many processors, item 3 is not an issue.
I do not know, I only know of a few of them; for example, I do not know how
Alpha
Asmus Freytag wrote:
A simplistic model of the 'cost' for UTF-16 over UTF-32 would consider
1) 1 extra test per character (to see whether it's a surrogate)
In my experience with tuning a fair amount of utf-16 software, this test
takes pretty close to zero time. All modern processors have branch
Lars Kristan [EMAIL PROTECTED] writes:
This is simply what you have to do. You cannot convert the data
into Unicode in a way that says I don't know how to convert this
data into Unicode. You must either convert it properly, or leave
the data in its original encoding (properly marked,
RE: Nicest UTFLars Kristan wrote:
I could not disagree more with the basic premise of Lars' post. It
is a fundamental and critical mistake to try to extend Unicode with
non-standard code unit sequences to handle data that cannot be, or
has not been, converted to Unicode from a legacy
Doug Ewell scripsit:
Now suppose you have a UNIX filesystem, containing filenames in a
legacy encoding (possibly even more than one). If one wants to switch
to UTF-8 filenames, what is one supposed to do? Convert all filenames
to UTF-8?
Well, yes. Doesn't the file system dictate what
John Cowan jcowan at reutershealth dot com wrote:
Windows filesystems do know what encoding they use. But a filename on
a Unix(oid) file system is a mere sequence of octets, of which only 00
and 2F are interpreted. (Filenames containing 20, and especially 0A,
are annoying to handle with
- Original Message -
From: Arcane Jill [EMAIL PROTECTED]
Probably a dumb question, but how come nobody's invented UTF-24 yet? I
just made that up, it's not an official standard, but one could easily
define UTF-24 as UTF-32 with the most-significant byte (which is always
zero) removed,
(Sorry for sending this twice, Marcin.)
Marcin 'Qrczak' Kowalczyk writes:
UTF-8 is poorly suitable for internal processing of strings in a
modern programming language (i.e. one which doesn't already have a
pile of legacy functions working of bytes, but which can be designed
to make
Asmus Freytag asmusf at ix dot netcom dot com wrote:
Given this little model and some additional assumptions about your
own project(s), you should be able to determine the 'nicest' UTF for
your own performance-critical case.
This is absolutely correct. Each situation may have different needs
RE: Nicest UTFLars Kristan wrote:
I think UTF8 would be the nicest UTF.
I agree. But not for reasons you mentioned. There is one other
important advantage: UTF-8 is stored in a way that permits storing
invalid sequences. I will need to elaborate that, of course.
I could not disagree more
Philippe Verdy verdy underscore p at wanadoo dot fr wrote:
I appreciate Philippe's support of SCSU, but I don't think *even I*
would recommend it as an internal storage format. The effort to
encode and decode it, while by no means Herculean as often perceived,
is not trivial once you step
- Original Message -
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Sunday, December 05, 2004 1:37 AM
Subject: Re: Nicest UTF
Philippe Verdy [EMAIL PROTECTED] writes:
There's nothing that requires the string storage to use the same
exposed array,
The point
Philippe Verdy [EMAIL PROTECTED] writes:
The point is that indexing should better be O(1).
SCSU is also O(1) in terms of indexing complexity...
It is not. You can't extract the nth code point without scanning the
previous n-1 code points.
But individual characters do not always have any
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
Philippe Verdy [EMAIL PROTECTED] writes:
The point is that indexing should better be O(1).
SCSU is also O(1) in terms of indexing complexity...
It is not. You can't extract the nth code point without scanning the
previous n-1 code points.
The
Philippe Verdy [EMAIL PROTECTED] writes:
The question is why you would need to extract the nth codepoint so
blindly.
For example I'm scanning a string backwards (to remove '\n' at the
end, to find and display the last N lines of a buffer, to find the
last '/' or last '.' in a file name). SCSU
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
Now consider scanning forwards. We want to strip a beginning of a
string. For example the string is an irc message prefixed with a
command and we want to take the message only for further processing.
We have found the end of the prefix and we want
Philippe Verdy verdy underscore p at wanadoo dot fr wrote:
The point is that indexing should better be O(1).
SCSU is also O(1) in terms of indexing complexity... simply because it
keeps the exact equivalence with codepoints, and requires a *fixed*
(and small) number of steps to decode it to
Philippe Verdy verdy underscore p at wanadoo dot fr wrote:
Here is a string, expressed as a sequence of bytes in SCSU:
05 1C 4D 6F 73 63 6F 77 05 1D 20 69 73 20 12 9C BE C1 BA B2 B0 2E
M o s s o v SP i s SP .
Without looking at it, it's easy to see
Philippe Verdy verdy underscore p at wanadoo dot fr wrote:
Only the encoder may be a bit complex to write (if one wants to
generate the optimal smallest result size), but even a moderate
programmer could find a simple and working scheme with a still
excellent compression rate (around 1 to 1.2
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
Philippe Verdy [EMAIL PROTECTED] writes:
Random access by code point index means that you don't use strings
as immutable objects,
No. Look at Python, Java and C#: their strings are immutable (don't
change in-place) and are indexed by integers (not
On Dec 3, 2004, at 2:54 AM, Andrew C. West wrote:
I strongly agree that all Unicode
implementations should cover all of Unicode, and not just the BMP, and
it really
annoys me when they don't; but suggesting that you need to implement
supra-BMP
characters because they are going to start popping
Philippe Verdy [EMAIL PROTECTED] writes:
There's nothing that requires the string storage to use the same
exposed array,
The point is that indexing should better be O(1).
Not having a constant side per code point requires one of three things:
1. Using opaque iterators instead of integer
Title: RE: Nicest UTF
Theodore H. Smith wrote:
What would be the nicest UTF to use?
I think UTF8 would be the nicest UTF.
I agree. But not for reasons you mentioned. There is one other important advantage: UTF-8 is stored in a way that permits storing invalid sequences. I will need
At 09:56 PM 12/2/2004, Doug Ewell wrote:
I use ... and UTF-32 for most internal processing that I write
myself. Let people say UTF-32 is wasteful if they want; I don't tend to
store huge amounts of text in memory at once, so the overhead is much
less important than one code unit per character.
From: Asmus Freytag [EMAIL PROTECTED]
I use ... and UTF-32 for most internal processing that I write
myself. Let people say UTF-32 is wasteful if they want; I don't tend
to
store huge amounts of text in memory at once, so the overhead is much
less important than one code unit per character.
Mailing List
[EMAIL PROTECTED]
Sent: Friday, December 03, 2004 07:55
Subject: Re: Nicest UTF
At 09:56 PM 12/2/2004, Doug Ewell wrote:
I use ... and UTF-32 for most internal processing that I write
myself. Let people say UTF-32 is wasteful if they want; I don't tend to
store huge amounts of text
RE: Nicest UTF
- Original Message -
From: Lars Kristan
To: '[EMAIL PROTECTED]'
Sent: Friday, December 03, 2004 2:45 PM
Subject: RE: Nicest UTF
Theodore H. Smith wrote:
What would be the nicest UTF to use?
I think UTF8 would be the nicest UTF.
I agree. But not for reasons you
1 - 100 of 114 matches
Mail list logo