Re: Roundtripping Solved

2004-12-15 Thread Marcin 'Qrczak' Kowalczyk
Arcane Jill [EMAIL PROTECTED] writes: OBSERVATION - Requirement (4) is not met absolutely, however, the probability of the UTF-8 encoding of this sequence occuring accidently at an arbitrary offset in an arbitrary octet stream is approximately one in 2^384; Assuming that the distribution of

Re: Roundtripping in Unicode

2004-12-15 Thread Marcin 'Qrczak' Kowalczyk
Arcane Jill [EMAIL PROTECTED] writes: Unix makes is possible for /you/ to change /your/ locale - but by your reasoning, this is an error, unless all other users do so simultaneously. Not necessarily: you can change the locale as long as it uses the same default encoding. By error I mean a

Re: Roundtripping in Unicode

2004-12-15 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan [EMAIL PROTECTED] writes: OK, strcpy does not need to interpret UTF-8. But strchr probably should. No. Its argument is a byte, even though it's passed as type int. By byte here I mean C char value, which is an octet in virtually all modern C implementations; the C standard doesn't

Re: Roundtripping Solved

2004-12-15 Thread Marcin 'Qrczak' Kowalczyk
Peter Kirk [EMAIL PROTECTED] writes: Jill, again your solution is ingenious. But would it not work just as well to for Lars' purposes to use, instead of your string of random characters, just ONE reserved code point followed by U+0xx? Instead of asking the UTC to allocate a specific code

Re: Roundtripping in Unicode

2004-12-14 Thread Marcin 'Qrczak' Kowalczyk
Arcane Jill [EMAIL PROTECTED] writes: OBSERVATION - Roundtripping is possible in the direction NOT-UTF-8 - NOT-UTF-16 - NOT-UTF-8 But it's not possible in the direction NOT-UTF-16 - NOT-UTF-8 - NOT-UTF-16, unless you define valid sequences of NOT-UTF-16 in an awkward way which would happen to

Re: Roundtripping in Unicode

2004-12-14 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan [EMAIL PROTECTED] writes: Hm, here lies the catch. According to UTC, you need to keep processing the UNIX filenames as BINARY data. And, also according to UTC, any UTF-8 function is allowed to reject invalid sequences. Basically, you are not supposed to use strcpy to process

Re: Roundtripping in Unicode

2004-12-14 Thread Marcin 'Qrczak' Kowalczyk
Arcane Jill [EMAIL PROTECTED] writes: If so, Marcin, what exactly is the error, and whose fault is it? It's an error to use locales with different encodings on the same system. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/

Unicode filenames and other external strings on Unix - existing practice

2004-12-14 Thread Marcin 'Qrczak' Kowalczyk
I describe here languages which exclusively use Unicode strings. Some languages have both byte strings and Unicode strings (e.g. Python) and then byte strings are generally used for strings exchanged with the OS, the programmer is responsible for the conversion if he wishes to use Unicode. I

Re: Roundtripping in Unicode

2004-12-13 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan [EMAIL PROTECTED] writes: But, as I once already said, you can do it with UTF-8, you simply keep the invalid sequences as they are, and really handle them differently only when you actually process them or display them. UTF-8 is painful to process in the first place. You are

Re: Roundtripping in Unicode

2004-12-13 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan [EMAIL PROTECTED] writes: And once we understand that things are manageable and not as frigtening as it seems at first, then we can stop using this as an argument against introducing 128 codepoints. People who will find them useful should and will bother with the consequences.

Re: Nicest UTF

2004-12-12 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan [EMAIL PROTECTED] writes: My my, you are assuming all files are in the same encoding. Yes. Otherwise nothing shows filenames correctly to the user. And what about all the references to the files in scripts? In configuration files? Such files rarely use non-ASCII characters.

Re: Nicest UTF

2004-12-12 Thread Marcin 'Qrczak' Kowalczyk
D. Starner [EMAIL PROTECTED] writes: But demanding that each program which searches strings checks for combining classes is I'm afraid too much. How is it any different from a case-insenstive search? We started from string equality, which somehow changed into searching. Default string

Re: Nicest UTF

2004-12-12 Thread Marcin 'Qrczak' Kowalczyk
Philippe Verdy [EMAIL PROTECTED] writes: It's hard to create a general model that will work for all scripts encoded in Unicode. There are too many differences. So Unicode just appears to standardize a higher level of processing with combining sequences and normalization forms that are better

Re: Roundtripping in Unicode

2004-12-12 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan [EMAIL PROTECTED] writes: Please make up your mind: either they are valid and programs are required to accept them, or they are invalid and programs are required to reject them. I don't know what they should be called. The fact is there shouldn't be any. And that current

Re: Nicest UTF

2004-12-11 Thread Marcin 'Qrczak' Kowalczyk
Philippe Verdy [EMAIL PROTECTED] writes: [...] This was later amended in an errata for XML 1.0 which now says that the list of code points whose use is *discouraged* (but explicitly *not* forbidden) for the Char production is now: [...] Ugh, it's a mess... IMHO Unicode is partially to blame,

Re: Roundtripping in Unicode

2004-12-11 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan [EMAIL PROTECTED] writes: It's essential that any UTF-n can be translated to any other without loss of data. Because it allows to use an implementation of the given functionality which represents data in any form, not necessarily the form we have at hand, as long as correctness

Re: Nicest UTF

2004-12-11 Thread Marcin 'Qrczak' Kowalczyk
D. Starner [EMAIL PROTECTED] writes: This implies that every programmer needs an indepth knowledge of Unicode to handle simple strings. There is no way to avoid that. Then there's no way that we're ever going to get reliable Unicode support. This is probably true. I wonder whether

Re: Roundtripping in Unicode

2004-12-11 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan [EMAIL PROTECTED] writes: The other name for this is roundtripping. Currently, Unicode allows a roundtrip UTF-16=UTF-8=UTF-16. For any data. But there are several reasons why a UTF-8=UTF-16(32)=UTF-8 roundtrip is more valuable, even if it means that the other roundtrip is no

Re: Roundtripping in Unicode

2004-12-11 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan [EMAIL PROTECTED] writes: All assigned codepoints do roundtrip even in my concept. But unassigned codepoints are not valid data. Please make up your mind: either they are valid and programs are required to accept them, or they are invalid and programs are required to reject them.

Re: When to validate?

2004-12-10 Thread Marcin 'Qrczak' Kowalczyk
Arcane Jill [EMAIL PROTECTED] writes: Here's something that's been bothering me. Suppose I write a function - let's call it trim(), which removes leading and trailing spaces from a string, represented as one of the UTFs. If I've understood this correctly, I'm supposed to validate the input,

Re: Nicest UTF

2004-12-10 Thread Marcin 'Qrczak' Kowalczyk
D. Starner [EMAIL PROTECTED] writes: String equality in a programming language should not treat composed and decomposed forms as equal. Not this level of abstraction. This implies that every programmer needs an indepth knowledge of Unicode to handle simple strings. There is no way to avoid

Re: Nicest UTF

2004-12-10 Thread Marcin 'Qrczak' Kowalczyk
Philippe Verdy [EMAIL PROTECTED] writes: The XML/HTML core syntax is defined with fixed behavior of some individual characters like '', '', quotation marks, and with special behavior for spaces. The point is: what characters mean in this sentence. Code points? Combining character sequences?

Re: Nicest UTF

2004-12-10 Thread Marcin 'Qrczak' Kowalczyk
John Cowan [EMAIL PROTECTED] writes: The XML/HTML core syntax is defined with fixed behavior of some individual characters like '', '', quotation marks, and with special behavior for spaces. The point is: what characters mean in this sentence. Code points? Combining character sequences?

Re: Nicest UTF

2004-12-10 Thread Marcin 'Qrczak' Kowalczyk
John Cowan [EMAIL PROTECTED] writes: The XML/HTML core syntax is defined with fixed behavior of some individual characters like '', '', quotation marks, and with special behavior for spaces. The point is: what characters mean in this sentence. Code points? Combining character sequences?

Re: Nicest UTF

2004-12-08 Thread Marcin 'Qrczak' Kowalczyk
D. Starner [EMAIL PROTECTED] writes: You could hide combining characters, which would be extremely useful if we were just using Latin and Cyrillic scripts. It would need a separate API for examining the contents of a combining character. You can't avoid the sequence of code points completely.

Re: If only MS Word was coded this well

2004-12-08 Thread Marcin 'Qrczak' Kowalczyk
Theodore H. Smith [EMAIL PROTECTED] writes: It's because code points have variable lengths in bytes, so extracting individual characters is almost meaningless Same with UTF-16 and UTF-32. A character is multiple code-points, remember? (decomposed chars?) Nope. I've done tons of UTF-8

Re: Invalid UTF-8 sequences

2004-12-08 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan [EMAIL PROTECTED] writes: Quite close. Except for the fact that: * U+EE93 is represented in UTF-32 as 0xEE93 * U+EE93 is represented in UTF-16 as 0xEE93 * U+EE93 is represented in UTF-8 as 0x93 (_NOT_ 0xEE 0xBA 0x93) Then it would be impossible to represent sequences like

Re: Nicest UTF

2004-12-08 Thread Marcin 'Qrczak' Kowalczyk
D. Starner [EMAIL PROTECTED] writes: The semantics there are surprising, but that's true no matter what you do. An NFC string + an NFC string may not be NFC; the resulting text doesn't have N+M graphemes. Which implies that automatically NFC-ing strings as they are processed would be a bad

Re: Nicest UTF

2004-12-08 Thread Marcin 'Qrczak' Kowalczyk
John Cowan [EMAIL PROTECTED] writes: String equality in a programming language should not treat composed and decomposed forms as equal. Not this level of abstraction. Well, that assumes that there's a special string equality predicate, as distinct from just having various predicates that

Re: Nicest UTF

2004-12-06 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan [EMAIL PROTECTED] writes: This is simply what you have to do. You cannot convert the data into Unicode in a way that says I don't know how to convert this data into Unicode. You must either convert it properly, or leave the data in its original encoding (properly marked,

Re: Nicest UTF

2004-12-05 Thread Marcin 'Qrczak' Kowalczyk
Philippe Verdy [EMAIL PROTECTED] writes: The point is that indexing should better be O(1). SCSU is also O(1) in terms of indexing complexity... It is not. You can't extract the nth code point without scanning the previous n-1 code points. But individual characters do not always have any

Re: Nicest UTF

2004-12-05 Thread Marcin 'Qrczak' Kowalczyk
Philippe Verdy [EMAIL PROTECTED] writes: The question is why you would need to extract the nth codepoint so blindly. For example I'm scanning a string backwards (to remove '\n' at the end, to find and display the last N lines of a buffer, to find the last '/' or last '.' in a file name). SCSU

Re: Nicest UTF

2004-12-04 Thread Marcin 'Qrczak' Kowalczyk
Philippe Verdy [EMAIL PROTECTED] writes: There's nothing that requires the string storage to use the same exposed array, The point is that indexing should better be O(1). Not having a constant side per code point requires one of three things: 1. Using opaque iterators instead of integer

Re: Nicest UTF

2004-12-03 Thread Marcin 'Qrczak' Kowalczyk
Philippe Verdy [EMAIL PROTECTED] writes: Decoding SCSU is very straightforward, But not for random access by code point index, which is needed by many string APIs. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-02 Thread Marcin 'Qrczak' Kowalczyk
Arcane Jill [EMAIL PROTECTED] writes: Oh for a chip with 21-bit wide registers! Not 21-bit but 20.087462841250343-bit :-) -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-01 Thread Marcin 'Qrczak' Kowalczyk
Theodore H. Smith [EMAIL PROTECTED] writes: Assuming you had no legacy code. And no handy libraries either, [...] What would be the nicest UTF to use? For internals of my language Kogut I've chosen a mixture of ISO-8859-1 and UTF-32. Normalized, i.e. a string with chracters which fit in narrow

Re: Unicode IDNs

2004-11-09 Thread Marcin 'Qrczak' Kowalczyk
Donald Z. Osborn [EMAIL PROTECTED] writes: Is anyone aware of URLs that use extended Latin characters as examples? http://w.pl/ -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/

Re: bit notation in ISO-8859-x is wrong

2004-10-10 Thread Marcin 'Qrczak' Kowalczyk
[EMAIL PROTECTED] (James Kass) writes: [...] If there are eight bits, why shouldn't they be bits one through eight? Because then the number of a bit doesn't correspond to the exponent of its weight, so I even don't know in which order they are specified (as many people order bits backwards,

Re: XML and Unicode interoperability comes before HTML or even SGML (was: Combining across markup?)

2004-08-14 Thread Marcin 'Qrczak' Kowalczyk
W licie z sob, 14-08-2004, godz. 12:35 +0200, Philippe Verdy napisa: Simply because, for both Unicode and ISO/IEC 10646, the character model includes the fact that ANY base character forms a combining character sequence with ANY following combining character or ZW(N)J character. Shouldn't

Re: Combining across markup?

2004-08-12 Thread Marcin 'Qrczak' Kowalczyk
W licie z czw, 12-08-2004, godz. 13:00 -0400, John Cowan napisa: Even better yet: Have the WC3 rephrase their demand that no element should start with a defective sequence (when considered in separate) as that no *block-level* element should etc., and leave things like span, i and other

RE: Combining across markup? (Was: RE: sign for anti-neutrino - g ree k nu with diacritical line aboveworkaround ?)

2004-08-10 Thread Marcin 'Qrczak' Kowalczyk
W licie z wto, 10-08-2004, godz. 18:33 +0100, Jon Hanna napisa: By the rules of XML replacing #x338; with U+226F would mean the document was no longer well-formed. Really? I don't have a XML spec handy, but character references like #x338; can't be processed before parsing tags, because 60; is

Re: Microsoft Unicode Article Review

2004-08-06 Thread Marcin 'Qrczak' Kowalczyk
W licie z czw, 05-08-2004, godz. 15:52 -0500, John Tisdale napisa: Yet, if you are working with an application that must parse and manipulate text at the byte-level, the costliness of variable length encoding will probably outweigh the benefits of ASCII compatibility. In such a case the fixed

Re: UAX 15 hangul composition

2004-08-03 Thread Marcin 'Qrczak' Kowalczyk
W licie z wto, 03-08-2004, godz. 13:47 +0200, Theo Veenker napisa: Don't know if this has been asked/reported before, but is the example code for hangul composition in UAX 15 correct? I reported it a month ago and got a response stating that This has been forwarded to the right people, and

Re: Umlaut and Tréma, was: Variation selectors and vowel marks

2004-07-23 Thread Marcin 'Qrczak' Kowalczyk
W licie z pi, 23-07-2004, godz. 18:01 +0200, Philipp Reichmuth napisa: However, to return to the original problem, I don't remember ever having seen a data where it would be necessary to distinguish between trema and diaeresis in the data itself. A similar issue: a Polish encyclopaedia I have

Re: Folding algorithm and canonical equivalence

2004-07-18 Thread Marcin 'Qrczak' Kowalczyk
W licie z sob, 17-07-2004, godz. 16:46 -0700, Asmus Freytag napisa: I wonder whether that's truly intended, or whether it could be replaced by a combination of AccentFolding OtherDiacriticFolding where AccentFolding removes *all* nonspacing marks following Latin, Greek or Cyrillic

Re: Looking for transcription or transliteration standards latin- arabic

2004-07-10 Thread Marcin 'Qrczak' Kowalczyk
W licie z pi, 09-07-2004, godz. 19:34 -0700, Asmus Freytag napisa: o-slash, can be analyzed as o and slash, even though that's not done canonically in Unicode. Allowing users outside Scandinavia to perform fuzzy searches for words with this character is useful. In this view of folding,

Re: Looking for transcription or transliteration standards latin- arabic

2004-07-06 Thread Marcin 'Qrczak' Kowalczyk
W licie z wto, 06-07-2004, godz. 10:50 +0100, Peter Kirk napisa: I guess another similar change would be Danzig - Gdansk, but I don't know where the initial G came from so possibly the Polish form is older than the German. A name with initial Gd is older than with D:

Error in Hangul composition code

2004-07-05 Thread Marcin 'Qrczak' Kowalczyk
http://www.unicode.org/reports/tr15/ says: int SIndex = last - SBase; if (0 = SIndex SIndex SCount (SIndex % TCount) == 0) { int TIndex = ch - TBase; if (0 = TIndex TIndex = TCount) { // make syllable of form LVT

Re: Shape of the US Dollar Sign

2001-09-28 Thread Marcin 'Qrczak' Kowalczyk
Fri, 28 Sep 2001 09:58:39 -0600, Jim Melton [EMAIL PROTECTED] pisze: I believe this is nothing but a font/glyph/presentation issue. A font for text mode I once made had the dollar like this: . . . . . . . . . . . . # . # . . . . . . # . # . . . . . # # # # # . . . # # . # . # # .

Re: 3rd-party cross-platform UTF-8 support

2001-09-22 Thread Marcin 'Qrczak' Kowalczyk
Thu, 20 Sep 2001 12:46:49 -0700 (PDT), Kenneth Whistler [EMAIL PROTECTED] pisze: If you are expecting better performance from a library that takes UTF-8 API's and then does all its internal processing in UTF-8 *without* converting to UTF-16, then I think you are mistaken. UTF-8 is a bad form

Re: Any tools to convert HTML unicode to JAVA unicode

2001-09-22 Thread Marcin 'Qrczak' Kowalczyk
Wed, 19 Sep 2001 03:47:59 -0700 (PDT), MindTerm [EMAIL PROTECTED] pisze: I would like to ask any tools to convert HTML unicode ( e.g. # n n n n ) to JAVA unicode ( e.g. \u n n n n ) ? Here is a Perl program which does this: perl -pe 'BEGIN {sub java ($) {sprintf "\\u%04x", $_[0]}}

Re: CESU-8 vs UTF-8

2001-09-16 Thread Marcin 'Qrczak' Kowalczyk
Sun, 16 Sep 2001 01:14:06 -0700, Carl W. Brown [EMAIL PROTECTED] pisze: If it can be demonstrated that there is a real need for an encoding like CESU-8 then is should be very different from UTF-8. How does SCSU for example sort? SCSU encoding is non-deterministic and its representations

Re: PDUTR #26 posted

2001-09-14 Thread Marcin 'Qrczak' Kowalczyk
Thu, 13 Sep 2001 12:52:04 -0700, Asmus Freytag [EMAIL PROTECTED] pisze: UTF-32 does have the same byte order issues as UTF-16, except that byte order is recognizable without a BOM. UTF-8 would be used for external communication almost exclusively. Especially as it's compatible with ASCII and

Re: PDUTR #26 posted

2001-09-13 Thread Marcin 'Qrczak' Kowalczyk
Wed, 12 Sep 2001 11:08:41 -0700, Julie Doll Allen [EMAIL PROTECTED] pisze: Proposed Draft Unicode Technical Report #26: Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8) is now available at: http://www.unicode.org/unicode/reports/tr26/ IMHO Unicode would have been a better standard if

Re: [OT] o-circumflex

2001-09-10 Thread Marcin 'Qrczak' Kowalczyk
Mon, 10 Sep 2001 10:47:48 +0200, Marco Cimarosti [EMAIL PROTECTED] pisze: It's as weird as some Italian names for German cities: Aquisgrana for Aachen, Augusta for Augsburg, Magonza for Mainz, Monaco (di Baviera) for Mnchen. Interesting that Polish names of these cities are more like Italian

Re: Nonsense in http://www.unicode.org/Public/PROGRAMS/CVTUTF/CVTUTF.C?

2001-08-25 Thread Marcin 'Qrczak' Kowalczyk
Wed, 22 Aug 2001 15:59:15 -0700, Michael (michka) Kaplan [EMAIL PROTECTED] pisze: Functions ConvertUCS4toUTF8 and ConvertUTF8toUCS4 use surrogates in UCS4. In particular ConvertUTF8toUCS4 converts a character above U+ into two UCS4 words. Why is this absurd there?! UCS-4 has no

Re: COMMERCIAL AT

2001-07-15 Thread Marcin 'Qrczak' Kowalczyk
Sat, 14 Jul 2001 11:51:29 +0100, Michael Everson [EMAIL PROTECTED] pisze: References to animals are the most common. Germans, Dutch, Finns, Hungarians, Poles and South Africans see it as a monkey tail. Indeed it's commonly called "monkey" in Polish (in parallel with "at"), but some call it

Re: More about SCSU (was: Re: A UTF-8 based News Service)

2001-07-13 Thread Marcin 'Qrczak' Kowalczyk
Fri, 13 Jul 2001 03:01:10 EDT, [EMAIL PROTECTED] [EMAIL PROTECTED] pisze: Unfortunately, you don't hear much about SCSU, and in particular the Unicode Consortium doesn't really seem to promote it much (although they may be trying to avoid the "too many UTF's" syndrome). SCSU doesn't look

Re: Terms constructed script, invented script (was: FW: Re: Shavian)

2001-07-11 Thread Marcin 'Qrczak' Kowalczyk
7 Jul 2001 11:01:18 GMT, Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] pisze: I put a sample at http://qrczak.ids.net.pl/vi-001.gif Now I put a prettier version there: with variable line width, serifs, and by a slightly improved sizing engine (enlargement of rounded parts to make them look

Re: Terms constructed script, invented script (was: FW: Re: Shavian)

2001-07-07 Thread Marcin 'Qrczak' Kowalczyk
In a message dated 2001-07-06 0:31:39 Pacific Daylight Time, [EMAIL PROTECTED] writes: I wonder: why aren't languages with simple syllabic structures written in hiragana? It seems to be built for them. I am using my own script inspired by hiragana 10 years ago for writing Polish. It looks

Re: validity of lone surrogates

2001-07-04 Thread Marcin 'Qrczak' Kowalczyk
Tue, 3 Jul 2001 11:19:05 +0100, Michael Everson [EMAIL PROTECTED] pisze: I would be glad if the resolution allowed UTF-8 and UTF-32 encoders and decoders to not worry about surrogates at all. Please leave surrogate issues to UTF-16. But what if I want to put up a Web page in Etruscan? UTF-8

Re: validity of lone surrogates (was Re: Unicode surroga tes: just say no!)

2001-07-03 Thread Marcin 'Qrczak' Kowalczyk
27 Jun 2001 13:38:33 +0100, Gaute B Strokkenes [EMAIL PROTECTED] pisze: I would be indebted if any of the experts who hang out on the unicode list could sort out this confusion. I would be glad if the resolution allowed UTF-8 and UTF-32 encoders and decoders to not worry about surrogates at

Re: validity of lone surrogates (was Re: Unicode surrogates: just say no!)

2001-07-03 Thread Marcin 'Qrczak' Kowalczyk
Tue, 3 Jul 2001 01:50:56 -0700, Michael (michka) Kaplan [EMAIL PROTECTED] pisze: It's a pity that UTF-16 doesn't encode characters up to U+F, such that code points corresponding to lone surrogates can be encoded as pairs of surrogates. Unfortunately, we would then be stuck with what

Re: How does Python Unicode treat surrogates?

2001-06-25 Thread Marcin 'Qrczak' Kowalczyk
Mon, 25 Jun 2001 07:24:28 -0700, Mark Davis [EMAIL PROTECTED] pisze: In most people's experience, it is best to leave the low level interfaces with indices in terms of code units, then supply some utility routines that tell you information about code points. It's yet better to work on

Re: How will software source code represent 21 bit unicode characters?

2001-04-17 Thread Marcin 'Qrczak' Kowalczyk
Tue, 17 Apr 2001 07:33:16 +0100, William Overington [EMAIL PROTECTED] pisze: In Java source code one may currently represent a 16 bit unicode character by using \u where each h is any hexadecimal character. How will Java, and maybe other languages, represent 21 bit unicode characters?

Re: Latin digraph characters

2001-03-03 Thread Marcin 'Qrczak' Kowalczyk
Wed, 28 Feb 2001 13:35:17 -0800 (GMT-0800), Pierpaolo BERNARDI [EMAIL PROTECTED] pisze: The initial character of the name is transliterated as CH in English, TCH in French, TSCH in German, C or CI in Italian, C WITH CARON in the official Russian transliteration. And CZ in Polish. -- __("

Re: [OT] Unicode-compatible SQL?

2001-02-05 Thread Marcin 'Qrczak' Kowalczyk
Mon, 5 Feb 2001 08:20:43 -0800 (GMT-0800), Mark Davis [EMAIL PROTECTED] pisze: The topic came up in a UTC meeting some time ago, a "UTF-8S". The motivation was for performance (having a form that reproduces the binary order of UTF-16). This is unfair: it slows down the conversion UTF-8 -

Re: Transcriptions of Unicode

2001-01-29 Thread Marcin 'Qrczak' Kowalczyk
Mon, 15 Jan 2001 13:09:47 -0800 (GMT-0800), G. Adam Stanislav [EMAIL PROTECTED] pisze: I would not be surprised if speakers of certain Slavic languages even changed the SPELLING to Unikod (with an acute over the [o]), as they have done with other imported words (such as futbal for football).

Re: Transcriptions of Unicode

2001-01-29 Thread Marcin 'Qrczak' Kowalczyk
Fri, 12 Jan 2001 07:28:18 -0800 (GMT-0800), Mark Davis [EMAIL PROTECTED] pisze: According to the references I have, the prefix "uni" is directly from Latin while the word "code" is through French. The Indo-European would have been *oi-no-kau-do ("give one strike"): *kau apparently being

Re: Teletext mappings

2001-01-27 Thread Marcin 'Qrczak' Kowalczyk
Sun, 21 Jan 2001 09:29:56 -0800 (GMT-0800), Rob Hardy [EMAIL PROTECTED] pisze: [Polish set] contains the line 0x5B 0x01B5 # LATIN CAPITAL LETTER Z WITH STROKE should supposedly be 0x5B 0x017B # LATIN CAPITAL LETTER Z WITH DOT ABOVE My teletext spec definitely has a Z with a stroke.

Re: Character properties

2000-10-25 Thread Marcin 'Qrczak' Kowalczyk
Mon, 23 Oct 2000 09:48:52 +0100, [EMAIL PROTECTED] [EMAIL PROTECTED] pisze: isDigit:Nd isHexDigit: '0'..'9', 'A'..'F', 'a'..'f' isDecDigit: '0'..'9' isOctDigit: '0'..'7' The definition "Nd" is what I would have proposed for isDecDigit. The name isDecDigit is confusing indeed...

Re: Character properties

2000-10-25 Thread Marcin 'Qrczak' Kowalczyk
Mon, 23 Oct 2000 09:48:52 +0100, [EMAIL PROTECTED] [EMAIL PROTECTED] pisze: isDigit:Nd isHexDigit: '0'..'9', 'A'..'F', 'a'..'f' isDecDigit: '0'..'9' isOctDigit: '0'..'7' The definition "Nd" is what I would have proposed for isDecDigit. The name isDecDigit is confusing indeed...

Re: Character properties

2000-10-21 Thread Marcin 'Qrczak' Kowalczyk
Wed, 11 Oct 2000 07:15:05 -0800 (GMT-0800), Mark Davis [EMAIL PROTECTED] pisze: Here is my take on the way Unicode general categories should be mapped to POSIX ones. Reiterated, here is my compilation of mapping of properties proposed for Haskell: isAssigned: all except Cs, Cn isControl:

Re: Character properties

2000-10-08 Thread Marcin 'Qrczak' Kowalczyk
Wed, 4 Oct 2000 18:48:17 -0700 (PDT), Kenneth Whistler [EMAIL PROTECTED] pisze: It is quite clear that many important character properties cannot be deduced from the General Category values in UnicodeData.txt alone. What a pity. Especially as it does work for some properties and I would like

Re: Character properties

2000-09-23 Thread Marcin 'Qrczak' Kowalczyk
Fri, 22 Sep 2000 22:11:44 -0800 (GMT-0800), Roozbeh Pournader [EMAIL PROTECTED] pisze: intToDigit should look at the locale to select the preferred digit form, I think. Sorry, it cannot apply to Haskell, because it's a functional language. It must work the same way all the time, unless it

Re: Character properties

2000-09-22 Thread Marcin 'Qrczak' Kowalczyk
Thu, 21 Sep 2000 23:55:24 +0330 (IRT), Roozbeh Pournader [EMAIL PROTECTED] pisze: isDigit intentionally recognizes ASCII digits only. IMHO it's more often needed and this is what the Haskell 98 Report says. (But I don't follow the report in some other cases.) Would you please give me

Character properties

2000-09-21 Thread Marcin 'Qrczak' Kowalczyk
I am trying to improve character properties handling in the language Haskell. What should the following functions return, i.e. what is most standard/natural/preferred mapping between Unicode character categories and predicates like isalpha etc.? What else should be provided? Here are definitions