Arcane Jill [EMAIL PROTECTED] writes:
OBSERVATION - Requirement (4) is not met absolutely, however,
the probability of the UTF-8 encoding of this sequence occuring
accidently at an arbitrary offset in an arbitrary octet stream
is approximately one in 2^384;
Assuming that the distribution of
Arcane Jill [EMAIL PROTECTED] writes:
Unix makes is possible for /you/ to change /your/ locale - but by
your reasoning, this is an error, unless all other users do so
simultaneously.
Not necessarily: you can change the locale as long as it uses the same
default encoding.
By error I mean a
Lars Kristan [EMAIL PROTECTED] writes:
OK, strcpy does not need to interpret UTF-8. But strchr probably should.
No. Its argument is a byte, even though it's passed as type int.
By byte here I mean C char value, which is an octet in virtually
all modern C implementations; the C standard doesn't
Peter Kirk [EMAIL PROTECTED] writes:
Jill, again your solution is ingenious. But would it not work just
as well to for Lars' purposes to use, instead of your string of
random characters, just ONE reserved code point followed by U+0xx?
Instead of asking the UTC to allocate a specific code
Arcane Jill [EMAIL PROTECTED] writes:
OBSERVATION - Roundtripping is possible in the direction NOT-UTF-8 -
NOT-UTF-16 - NOT-UTF-8
But it's not possible in the direction NOT-UTF-16 - NOT-UTF-8 -
NOT-UTF-16, unless you define valid sequences of NOT-UTF-16 in an
awkward way which would happen to
Lars Kristan [EMAIL PROTECTED] writes:
Hm, here lies the catch. According to UTC, you need to keep
processing the UNIX filenames as BINARY data. And, also according
to UTC, any UTF-8 function is allowed to reject invalid sequences.
Basically, you are not supposed to use strcpy to process
Arcane Jill [EMAIL PROTECTED] writes:
If so, Marcin, what exactly is the error, and whose fault is it?
It's an error to use locales with different encodings on the same
system.
--
__( Marcin Kowalczyk
\__/ [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/
I describe here languages which exclusively use Unicode strings.
Some languages have both byte strings and Unicode strings (e.g. Python)
and then byte strings are generally used for strings exchanged with
the OS, the programmer is responsible for the conversion if he wishes
to use Unicode.
I
Lars Kristan [EMAIL PROTECTED] writes:
But, as I once already said, you can do it with UTF-8, you simply
keep the invalid sequences as they are, and really handle them
differently only when you actually process them or display them.
UTF-8 is painful to process in the first place. You are
Lars Kristan [EMAIL PROTECTED] writes:
And once we understand that things are manageable and not as
frigtening as it seems at first, then we can stop using this as an
argument against introducing 128 codepoints. People who will find
them useful should and will bother with the consequences.
Lars Kristan [EMAIL PROTECTED] writes:
My my, you are assuming all files are in the same encoding.
Yes. Otherwise nothing shows filenames correctly to the user.
And what about all the references to the files in scripts?
In configuration files?
Such files rarely use non-ASCII characters.
D. Starner [EMAIL PROTECTED] writes:
But demanding that each program which searches strings checks for
combining classes is I'm afraid too much.
How is it any different from a case-insenstive search?
We started from string equality, which somehow changed into searching.
Default string
Philippe Verdy [EMAIL PROTECTED] writes:
It's hard to create a general model that will work for all scripts
encoded in Unicode. There are too many differences. So Unicode just
appears to standardize a higher level of processing with combining
sequences and normalization forms that are better
Lars Kristan [EMAIL PROTECTED] writes:
Please make up your mind: either they are valid and programs are
required to accept them, or they are invalid and programs are required
to reject them.
I don't know what they should be called. The fact is there shouldn't be any.
And that current
Philippe Verdy [EMAIL PROTECTED] writes:
[...]
This was later amended in an errata for XML 1.0 which now says that
the list of code points whose use is *discouraged* (but explicitly
*not* forbidden) for the Char production is now:
[...]
Ugh, it's a mess...
IMHO Unicode is partially to blame,
Lars Kristan [EMAIL PROTECTED] writes:
It's essential that any UTF-n can be translated to any other without
loss of data. Because it allows to use an implementation of the given
functionality which represents data in any form, not necessarily the
form we have at hand, as long as correctness
D. Starner [EMAIL PROTECTED] writes:
This implies that every programmer needs an indepth knowledge of
Unicode to handle simple strings.
There is no way to avoid that.
Then there's no way that we're ever going to get reliable Unicode
support.
This is probably true.
I wonder whether
Lars Kristan [EMAIL PROTECTED] writes:
The other name for this is roundtripping. Currently, Unicode allows
a roundtrip UTF-16=UTF-8=UTF-16. For any data. But there are
several reasons why a UTF-8=UTF-16(32)=UTF-8 roundtrip is more
valuable, even if it means that the other roundtrip is no
Lars Kristan [EMAIL PROTECTED] writes:
All assigned codepoints do roundtrip even in my concept.
But unassigned codepoints are not valid data.
Please make up your mind: either they are valid and programs are
required to accept them, or they are invalid and programs are required
to reject them.
Arcane Jill [EMAIL PROTECTED] writes:
Here's something that's been bothering me. Suppose I write a function
-
let's call it trim(), which removes leading and trailing spaces from a
string, represented as one of the UTFs. If I've understood this
correctly, I'm supposed to validate the input,
D. Starner [EMAIL PROTECTED] writes:
String equality in a programming language should not treat composed
and decomposed forms as equal. Not this level of abstraction.
This implies that every programmer needs an indepth knowledge of
Unicode to handle simple strings.
There is no way to avoid
Philippe Verdy [EMAIL PROTECTED] writes:
The XML/HTML core syntax is defined with fixed behavior of some
individual characters like '', '', quotation marks, and with special
behavior for spaces.
The point is: what characters mean in this sentence. Code points?
Combining character sequences?
John Cowan [EMAIL PROTECTED] writes:
The XML/HTML core syntax is defined with fixed behavior of some
individual characters like '', '', quotation marks, and with special
behavior for spaces.
The point is: what characters mean in this sentence. Code points?
Combining character sequences?
John Cowan [EMAIL PROTECTED] writes:
The XML/HTML core syntax is defined with fixed behavior of some
individual characters like '', '', quotation marks, and with special
behavior for spaces.
The point is: what characters mean in this sentence. Code points?
Combining character sequences?
D. Starner [EMAIL PROTECTED] writes:
You could hide combining characters, which would be extremely useful if
we were just using Latin and Cyrillic scripts.
It would need a separate API for examining the contents of a combining
character. You can't avoid the sequence of code points completely.
Theodore H. Smith [EMAIL PROTECTED] writes:
It's because code points have variable lengths in bytes, so
extracting individual characters is almost meaningless
Same with UTF-16 and UTF-32. A character is multiple code-points,
remember? (decomposed chars?)
Nope. I've done tons of UTF-8
Lars Kristan [EMAIL PROTECTED] writes:
Quite close. Except for the fact that:
* U+EE93 is represented in UTF-32 as 0xEE93
* U+EE93 is represented in UTF-16 as 0xEE93
* U+EE93 is represented in UTF-8 as 0x93 (_NOT_ 0xEE 0xBA 0x93)
Then it would be impossible to represent sequences like
D. Starner [EMAIL PROTECTED] writes:
The semantics there are surprising, but that's true no matter what you
do. An NFC string + an NFC string may not be NFC; the resulting text
doesn't have N+M graphemes.
Which implies that automatically NFC-ing strings as they are processed
would be a bad
John Cowan [EMAIL PROTECTED] writes:
String equality in a programming language should not treat composed
and decomposed forms as equal. Not this level of abstraction.
Well, that assumes that there's a special string equality predicate,
as distinct from just having various predicates that
Lars Kristan [EMAIL PROTECTED] writes:
This is simply what you have to do. You cannot convert the data
into Unicode in a way that says I don't know how to convert this
data into Unicode. You must either convert it properly, or leave
the data in its original encoding (properly marked,
Philippe Verdy [EMAIL PROTECTED] writes:
The point is that indexing should better be O(1).
SCSU is also O(1) in terms of indexing complexity...
It is not. You can't extract the nth code point without scanning the
previous n-1 code points.
But individual characters do not always have any
Philippe Verdy [EMAIL PROTECTED] writes:
The question is why you would need to extract the nth codepoint so
blindly.
For example I'm scanning a string backwards (to remove '\n' at the
end, to find and display the last N lines of a buffer, to find the
last '/' or last '.' in a file name). SCSU
Philippe Verdy [EMAIL PROTECTED] writes:
There's nothing that requires the string storage to use the same
exposed array,
The point is that indexing should better be O(1).
Not having a constant side per code point requires one of three things:
1. Using opaque iterators instead of integer
Philippe Verdy [EMAIL PROTECTED] writes:
Decoding SCSU is very straightforward,
But not for random access by code point index, which is needed by many
string APIs.
--
__( Marcin Kowalczyk
\__/ [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/
Arcane Jill [EMAIL PROTECTED] writes:
Oh for a chip with 21-bit wide registers!
Not 21-bit but 20.087462841250343-bit :-)
--
__( Marcin Kowalczyk
\__/ [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/
Theodore H. Smith [EMAIL PROTECTED] writes:
Assuming you had no legacy code. And no handy libraries either,
[...]
What would be the nicest UTF to use?
For internals of my language Kogut I've chosen a mixture of ISO-8859-1
and UTF-32. Normalized, i.e. a string with chracters which fit in
narrow
Donald Z. Osborn [EMAIL PROTECTED] writes:
Is anyone aware of URLs that use extended Latin characters as examples?
http://w.pl/
--
__( Marcin Kowalczyk
\__/ [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/
[EMAIL PROTECTED] (James Kass) writes:
[...]
If there are eight bits, why shouldn't they be bits one
through eight?
Because then the number of a bit doesn't correspond to the exponent
of its weight, so I even don't know in which order they are specified
(as many people order bits backwards,
W licie z sob, 14-08-2004, godz. 12:35 +0200, Philippe Verdy napisa:
Simply because, for both Unicode and ISO/IEC 10646, the character
model includes the fact that ANY base character forms a combining
character sequence with ANY following combining character or ZW(N)J
character.
Shouldn't
W licie z czw, 12-08-2004, godz. 13:00 -0400, John Cowan napisa:
Even better yet: Have the WC3 rephrase their demand that no element
should start with a defective sequence (when considered in separate)
as that no *block-level* element should etc., and leave things like
span, i and other
W licie z wto, 10-08-2004, godz. 18:33 +0100, Jon Hanna napisa:
By the rules of XML replacing #x338; with U+226F would mean the document was
no longer well-formed.
Really? I don't have a XML spec handy, but character references like
#x338; can't be processed before parsing tags, because 60; is
W licie z czw, 05-08-2004, godz. 15:52 -0500, John Tisdale napisa:
Yet, if you are working with an application that must parse and
manipulate text at the byte-level, the costliness of variable length
encoding will probably outweigh the benefits of ASCII compatibility.
In such a case the fixed
W licie z wto, 03-08-2004, godz. 13:47 +0200, Theo Veenker napisa:
Don't know if this has been asked/reported before, but is the example code
for hangul composition in UAX 15 correct?
I reported it a month ago and got a response stating that This has been
forwarded to the right people, and
W licie z pi, 23-07-2004, godz. 18:01 +0200, Philipp Reichmuth
napisa:
However, to return to the original problem, I don't remember ever having
seen a data where it would be necessary to distinguish between trema and
diaeresis in the data itself.
A similar issue: a Polish encyclopaedia I have
W licie z sob, 17-07-2004, godz. 16:46 -0700, Asmus Freytag napisa:
I wonder whether that's truly intended, or whether it could be replaced
by a combination of
AccentFolding
OtherDiacriticFolding
where AccentFolding removes *all* nonspacing marks following Latin, Greek
or Cyrillic
W licie z pi, 09-07-2004, godz. 19:34 -0700, Asmus Freytag napisa:
o-slash, can be analyzed as o and slash, even though that's not done
canonically in Unicode. Allowing users outside Scandinavia to perform
fuzzy searches for words with this character is useful.
In this view of folding,
W licie z wto, 06-07-2004, godz. 10:50 +0100, Peter Kirk napisa:
I guess another similar change would be Danzig - Gdansk, but
I don't know where the initial G came from so possibly the Polish form
is older than the German.
A name with initial Gd is older than with D:
http://www.unicode.org/reports/tr15/ says:
int SIndex = last - SBase;
if (0 = SIndex SIndex SCount (SIndex % TCount) == 0) {
int TIndex = ch - TBase;
if (0 = TIndex TIndex = TCount) {
// make syllable of form LVT
Fri, 28 Sep 2001 09:58:39 -0600, Jim Melton [EMAIL PROTECTED] pisze:
I believe this is nothing but a font/glyph/presentation issue.
A font for text mode I once made had the dollar like this:
. . . . . . . . .
. . . # . # . . .
. . . # . # . . .
. . # # # # # . .
. # # . # . # # .
Thu, 20 Sep 2001 12:46:49 -0700 (PDT), Kenneth Whistler [EMAIL PROTECTED] pisze:
If you are expecting better performance from a library that takes UTF-8
API's and then does all its internal processing in UTF-8 *without*
converting to UTF-16, then I think you are mistaken. UTF-8 is a bad
form
Wed, 19 Sep 2001 03:47:59 -0700 (PDT), MindTerm [EMAIL PROTECTED] pisze:
I would like to ask any tools to convert HTML
unicode ( e.g. # n n n n ) to JAVA unicode ( e.g. \u
n n n n ) ?
Here is a Perl program which does this:
perl -pe 'BEGIN {sub java ($) {sprintf "\\u%04x", $_[0]}}
Sun, 16 Sep 2001 01:14:06 -0700, Carl W. Brown [EMAIL PROTECTED] pisze:
If it can be demonstrated that there is a real need for an encoding
like CESU-8 then is should be very different from UTF-8. How does
SCSU for example sort?
SCSU encoding is non-deterministic and its representations
Thu, 13 Sep 2001 12:52:04 -0700, Asmus Freytag [EMAIL PROTECTED] pisze:
UTF-32 does have the same byte order issues as UTF-16, except that
byte order is recognizable without a BOM.
UTF-8 would be used for external communication almost exclusively.
Especially as it's compatible with ASCII and
Wed, 12 Sep 2001 11:08:41 -0700, Julie Doll Allen [EMAIL PROTECTED] pisze:
Proposed Draft Unicode Technical Report #26: Compatibility Encoding
Scheme for UTF-16: 8-Bit (CESU-8) is now available at:
http://www.unicode.org/unicode/reports/tr26/
IMHO Unicode would have been a better standard if
Mon, 10 Sep 2001 10:47:48 +0200, Marco Cimarosti [EMAIL PROTECTED] pisze:
It's as weird as some Italian names for German cities: Aquisgrana
for Aachen, Augusta for Augsburg, Magonza for Mainz, Monaco (di
Baviera) for Mnchen.
Interesting that Polish names of these cities are more like Italian
Wed, 22 Aug 2001 15:59:15 -0700, Michael (michka) Kaplan [EMAIL PROTECTED] pisze:
Functions ConvertUCS4toUTF8 and ConvertUTF8toUCS4 use surrogates
in UCS4. In particular ConvertUTF8toUCS4 converts a character above
U+ into two UCS4 words. Why is this absurd there?!
UCS-4 has no
Sat, 14 Jul 2001 11:51:29 +0100, Michael Everson [EMAIL PROTECTED] pisze:
References to animals are the most common. Germans, Dutch, Finns,
Hungarians, Poles and South Africans see it as a monkey tail.
Indeed it's commonly called "monkey" in Polish (in parallel with "at"),
but some call it
Fri, 13 Jul 2001 03:01:10 EDT, [EMAIL PROTECTED] [EMAIL PROTECTED] pisze:
Unfortunately, you don't hear much about SCSU, and in particular
the Unicode Consortium doesn't really seem to promote it much
(although they may be trying to avoid the "too many UTF's" syndrome).
SCSU doesn't look
7 Jul 2001 11:01:18 GMT, Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] pisze:
I put a sample at http://qrczak.ids.net.pl/vi-001.gif
Now I put a prettier version there: with variable line width, serifs,
and by a slightly improved sizing engine (enlargement of rounded parts
to make them look
In a message dated 2001-07-06 0:31:39 Pacific Daylight Time, [EMAIL PROTECTED]
writes:
I wonder: why aren't languages with simple syllabic structures
written in hiragana? It seems to be built for them.
I am using my own script inspired by hiragana 10 years ago for writing
Polish. It looks
Tue, 3 Jul 2001 11:19:05 +0100, Michael Everson [EMAIL PROTECTED] pisze:
I would be glad if the resolution allowed UTF-8 and UTF-32 encoders and
decoders to not worry about surrogates at all. Please leave surrogate
issues to UTF-16.
But what if I want to put up a Web page in Etruscan?
UTF-8
27 Jun 2001 13:38:33 +0100, Gaute B Strokkenes [EMAIL PROTECTED] pisze:
I would be indebted if any of the experts who hang out on the
unicode list could sort out this confusion.
I would be glad if the resolution allowed UTF-8 and UTF-32 encoders and
decoders to not worry about surrogates at
Tue, 3 Jul 2001 01:50:56 -0700, Michael (michka) Kaplan [EMAIL PROTECTED] pisze:
It's a pity that UTF-16 doesn't encode characters up to U+F, such
that code points corresponding to lone surrogates can be encoded as
pairs of surrogates.
Unfortunately, we would then be stuck with what
Mon, 25 Jun 2001 07:24:28 -0700, Mark Davis [EMAIL PROTECTED] pisze:
In most people's experience, it is best to leave the low level interfaces
with indices in terms of code units, then supply some utility routines that
tell you information about code points.
It's yet better to work on
Tue, 17 Apr 2001 07:33:16 +0100, William Overington [EMAIL PROTECTED]
pisze:
In Java source code one may currently represent a 16 bit unicode character
by using \u where each h is any hexadecimal character.
How will Java, and maybe other languages, represent 21 bit unicode
characters?
Wed, 28 Feb 2001 13:35:17 -0800 (GMT-0800), Pierpaolo BERNARDI
[EMAIL PROTECTED] pisze:
The initial character of the name is transliterated as CH in English,
TCH in French, TSCH in German, C or CI in Italian, C WITH CARON in the
official Russian transliteration.
And CZ in Polish.
--
__("
Mon, 5 Feb 2001 08:20:43 -0800 (GMT-0800), Mark Davis [EMAIL PROTECTED] pisze:
The topic came up in a UTC meeting some time ago, a "UTF-8S". The
motivation was for performance (having a form that reproduces the
binary order of UTF-16).
This is unfair: it slows down the conversion UTF-8 -
Mon, 15 Jan 2001 13:09:47 -0800 (GMT-0800), G. Adam Stanislav [EMAIL PROTECTED]
pisze:
I would not be surprised if speakers of certain Slavic languages even
changed the SPELLING to Unikod (with an acute over the [o]), as they
have done with other imported words (such as futbal for football).
Fri, 12 Jan 2001 07:28:18 -0800 (GMT-0800), Mark Davis [EMAIL PROTECTED] pisze:
According to the references I have, the prefix "uni" is directly from
Latin while the word "code" is through French. The Indo-European would
have been *oi-no-kau-do ("give one strike"): *kau apparently being
Sun, 21 Jan 2001 09:29:56 -0800 (GMT-0800), Rob Hardy [EMAIL PROTECTED]
pisze:
[Polish set] contains the line
0x5B 0x01B5 # LATIN CAPITAL LETTER Z WITH STROKE
should supposedly be
0x5B 0x017B # LATIN CAPITAL LETTER Z WITH DOT ABOVE
My teletext spec definitely has a Z with a stroke.
Mon, 23 Oct 2000 09:48:52 +0100, [EMAIL PROTECTED] [EMAIL PROTECTED]
pisze:
isDigit:Nd
isHexDigit: '0'..'9', 'A'..'F', 'a'..'f'
isDecDigit: '0'..'9'
isOctDigit: '0'..'7'
The definition "Nd" is what I would have proposed for isDecDigit.
The name isDecDigit is confusing indeed...
Mon, 23 Oct 2000 09:48:52 +0100, [EMAIL PROTECTED] [EMAIL PROTECTED]
pisze:
isDigit:Nd
isHexDigit: '0'..'9', 'A'..'F', 'a'..'f'
isDecDigit: '0'..'9'
isOctDigit: '0'..'7'
The definition "Nd" is what I would have proposed for isDecDigit.
The name isDecDigit is confusing indeed...
Wed, 11 Oct 2000 07:15:05 -0800 (GMT-0800), Mark Davis [EMAIL PROTECTED] pisze:
Here is my take on the way Unicode general categories should be
mapped to POSIX ones.
Reiterated, here is my compilation of mapping of properties proposed
for Haskell:
isAssigned: all except Cs, Cn
isControl:
Wed, 4 Oct 2000 18:48:17 -0700 (PDT), Kenneth Whistler [EMAIL PROTECTED] pisze:
It is quite clear that many important character properties cannot
be deduced from the General Category values in UnicodeData.txt alone.
What a pity. Especially as it does work for some properties and I
would like
Fri, 22 Sep 2000 22:11:44 -0800 (GMT-0800), Roozbeh Pournader
[EMAIL PROTECTED] pisze:
intToDigit should look at the locale to select the preferred digit
form, I think.
Sorry, it cannot apply to Haskell, because it's a functional language.
It must work the same way all the time, unless it
Thu, 21 Sep 2000 23:55:24 +0330 (IRT), Roozbeh Pournader [EMAIL PROTECTED]
pisze:
isDigit intentionally recognizes ASCII digits only. IMHO it's more
often needed and this is what the Haskell 98 Report says. (But I
don't follow the report in some other cases.)
Would you please give me
I am trying to improve character properties handling in the language
Haskell. What should the following functions return, i.e. what is
most standard/natural/preferred mapping between Unicode character
categories and predicates like isalpha etc.? What else should be
provided? Here are definitions
77 matches
Mail list logo