Re: [A12n-Collab] Latin alpha (Re: Public Review Issues Update)

2004-08-31 Thread Philippe Verdy
From: John Hudson [EMAIL PROTECTED] Donald Z. Osborn wrote: According to data from R. Hartell (1993), the latin alpha is used in Fe'efe'e (a dialect of Bamileke) in Cameroon. See http://www.bisharat.net/A12N/CAM-table.htm (full ref. there; Hartell names her sources in her book). Not sure

Re: Deseret in use (?) by micronation Molossia

2004-09-07 Thread Philippe Verdy
From: Doug Ewell [EMAIL PROTECTED] Antnio Martins-Tuvlkin antonio at tuvalkin dot web dot pt wrote: Deseret in use (?) by micronation Molossia: It is explained at http://www.molossia.org/alphabet.html , but they put GIFs on-line, making no use of the U+10400 block... I visited their site,

markup on combining characters (was: Compatibility mappings for new Hebrew points)

2004-09-07 Thread Philippe Verdy
From: Peter Kirk [EMAIL PROTECTED] By the way, any suggestion of making the QQ distinction with markup is ruled out by the principle recently expounded on the main Unicode list that separate markup cannot be applied to combining characters. Isn't this need of allowing separate markup on

Re: markup on combining characters

2004-09-08 Thread Philippe Verdy
From: Jony Rosenne [EMAIL PROTECTED] Peter Kirk You mean, you would represent a black e with a red acute accent as something like e, ZWJ, red, IBC, acute, /red? That looks like a nightmare for all kinds of processing and a nightmare for rendering. No, it is more like forecolor:black,

Re: markup on combining characters

2004-09-08 Thread Philippe Verdy
From: Asmus Freytag [EMAIL PROTECTED] At 12:49 AM 9/8/2004, Philippe Verdy wrote: And still no decision if this invisible base character will be added or not. It's just a public review for now, Well, hold your horses for a bit here. If something's out of review, there won't be a decision until

Re: [BULK] - Re: markup on combining characters

2004-09-10 Thread Philippe Verdy
From: Asmus Freytag [EMAIL PROTECTED] On the other hand, all aspects to *coloring* of characters do not belong in the plain text stream - but that was not the question. I think suggested solutions that define markup that apply to combining characters but place that markup outside of the combining

Re: Questions about diacritics

2004-09-13 Thread Philippe Verdy
From: Gerd Schumacher [EMAIL PROTECTED] 2. Another invisible diacritics carrier I also found an acute on diphtongs, placed on the boundary of both letters (au, ei, eu, oe, and ui). Wouldn't such diacritic be hold by the currently proposed invisible base character (in the Public Review section of

Re: Questions about diacritics

2004-09-13 Thread Philippe Verdy
From: Peter Kirk [EMAIL PROTECTED] Surely the intention is for INVISIBLE LETTER, combining acute to be equivalent (although it cannot be canonically equivalent) to spacing acute, U+00B4? But then would this kind of ligature mechanism with ZWNJ and U+00B4 be appropriate? I would think not.

Re: Questions about diacritics

2004-09-13 Thread Philippe Verdy
From: Doug Ewell [EMAIL PROTECTED] Philippe Verdy verdy underscore p at wanadoo dot fr wrote: I also found an acute on diphtongs, placed on the boundary of both letters (au, ei, eu, oe, and ui). Wouldn't such diacritic be hold by the currently proposed invisible base character (in the Public

Re: Questions about diacritics

2004-09-14 Thread Philippe Verdy
] To: Philippe Verdy [EMAIL PROTECTED] Cc: Doug Ewell [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Tuesday, September 14, 2004 6:06 PM Subject: Re: Questions about diacritics In LaTeX2e with the Cork coding (for TeXnicians: \usepackage[T1]{fontenc}) there is a so-called compound word mark. It has

Re: Questions about diacritics

2004-09-14 Thread Philippe Verdy
Since INVISIBLE LETTER is spacing, wouldn't it make more sense to define Isn't rather INVISIBLE LETTER *non-spacing* (zero-width minimum), even though it is *not combining* ? I mean here that its width would be zero unless a visible diacritic expands it. It is then distinct from other

Historic scripts for Albanian: Elsaban and Beitha Kukju

2004-09-16 Thread Philippe Verdy
This page: http://www.omniglot.com/writing/albanian.htm shows two historic scripts that have been used to write Albanian (Shqip): - the Elsaban script in the 18th century, which looks like Old Greek for the language Tosk variant. However there are lots of unique letter forms, and mapping to Old

Re: Questions about diacritics

2004-09-17 Thread Philippe Verdy
From: Doug Ewell [EMAIL PROTECTED] In the case of INVISIBLE LETTER, it seems likely -- based on the comments of experts -- that the benefits outweigh the disadvantages. But new control characters (and quasi-controls like IL) have tended to cause more problems and confusion for Unicode in the past

Re: Unibook 4.0.1 available

2004-09-17 Thread Philippe Verdy
From: Doug Ewell [EMAIL PROTECTED] Marion Gunn mgunn at egt dot ie wrote: Is it really so hard to make multi-platform, open-office-type utilities? Actually, yes, it is. Mac users don't want an application to be too Windows-like, Windows users don't want an application to be too Mac-like (we'll

Re: Unicode Shorthand?

2004-09-18 Thread Philippe Verdy
From: Chris Jacobs [EMAIL PROTECTED] - Original Message - From: Christopher Fynn [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Sunday, September 19, 2004 12:08 AM Subject: Unicode Shorthand? Is there any plan to include sets of shorthand (Pitman, Gregg etc.) symbols in Unicode? Or are

Re: Unicode Shorthand?

2004-09-18 Thread Philippe Verdy
From: D. Starner [EMAIL PROTECTED] Christopher Fynn wrote: Is there any plan to include sets of shorthand (Pitman, Gregg etc.) symbols in Unicode? Or are they something which is specifically excluded? They're a form of handwriting, which is generally excluded. Why do they need to be encoded in a

Re: Unicode Shorthand?

2004-09-18 Thread Philippe Verdy
From: Christopher Fynn [EMAIL PROTECTED] Philippe Verdy wrote: It's not impossible to create a rendering system for such stenographic system, however the general layout is more complex than with traditional alphabets, because the layout of characters is highly dependant of the context

Re: Unicode Shorthand?

2004-09-19 Thread Philippe Verdy
From: Christopher Fynn [EMAIL PROTECTED] Philippe Verdy wrote: Not really, because the actual rendering is bidimensionnal, not linear. It's difficult to predict the line height, as the baseline changes according to the context of previous characters in the word, and its writing direction

Re: [OT] Decode Unicode!

2004-09-25 Thread Philippe Verdy
From: Curtis Clark [EMAIL PROTECTED] on 2004-09-24 10:05 Peter Constable did quote: After the DNA, the ASCII-Code is the most successful code on this planet. Things get more and more complex. DNA is a 2-bit code. Not completely true. It is a bit less than 2 bits, due to its replication chains,

Re: UTF-8 stress test file?

2004-10-11 Thread Philippe Verdy
From: Terje Bless [EMAIL PROTECTED] -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Theodore H. Smith [EMAIL PROTECTED] wrote: I'd like to see a UTF-8 stress test file. The top result on Google for the query UTF-8 Stress Test is http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt. This test

Re: UTF-8 stress test file?

2004-10-12 Thread Philippe Verdy
From: Doug Ewell [EMAIL PROTECTED] Theodore H. Smith delete at elfdata dot com wrote: - the file mixes UTF-8 and UTF-16 Does this file mix UTF-8 and UTF-16? I thought it just had surrogates encoded into UTF-8? Of course a surrogate should never exist in UTF-8. You are right. Philippe's statement

Re: UTF-8 stress test file?

2004-10-12 Thread Philippe Verdy
From: Clark Cox [EMAIL PROTECTED] unless the file was used as a test for CESU-8 The whole point of the CESU-8-like section is that it is not legal UTF-8. Except that the document does not even cite CESU-8 but only UTF-16! The text itself is puzzling as well as nearly all its suggestions about

Re: UTF-8 stress test file?

2004-10-12 Thread Philippe Verdy
From: Philipp Reichmuth [EMAIL PROTECTED] Don't you think you are stretching things a bit? This is an UTF-8 parser stress test file. If an application opens it in a different encoding, well, of course the results will be different, and things will not look UTF-8-ish. Again, this is a

Re: internationalization assumption

2004-09-30 Thread Philippe Verdy
From: Antoine Leca [EMAIL PROTECTED] On Tuesday, September 28th, 2004 03:22 Tom wrote: Let's say. The test engineer ensures the functionality and validates the input and output on major Latin 1 languages, such as German, French, Spanish, Italian, Just a side point: French cannot be fully

Re: internationalization assumption

2004-09-30 Thread Philippe Verdy
About the French ligatures 'oe' (and 'ae'), I should have noted this excellent summary page (in French) on its usage and history: http://fr.wikipedia.org/wiki/Ligature_(typographie) Note that Latin- or Greek-inherited words use the ligature when the vowels are not to be pronounced separately,

Re: internationalisation assumption

2004-10-01 Thread Philippe Verdy
://www.rodage.org/pub/French-Sahel.pdf - Original Message - From: Stefan Persson [EMAIL PROTECTED] To: Unicode Mailing List [EMAIL PROTECTED] Sent: Thursday, September 30, 2004 5:05 PM Subject: Re: internationalisation assumption Philippe Verdy wrote: in addition, French keyboards typically never

Re: Grapheme clusters

2004-10-06 Thread Philippe Verdy
From: Chris Harvey [EMAIL PROTECTED] The users seem determined to put the entire alphabet into the PUA, thus making a single character for ng, kw, ii etc. I would like to be able to present them with something that works and avoid this kind of catastrophe. A better alternative to PUAs, which

Re: internationalization assumption

2004-10-07 Thread Philippe Verdy
RE: internationalization assumptionWell the main issue for internationalization of software is not the character sets with which it was tested. It is in fact trivial today to make an application compliant with Unicode text encoding. What is more complicate is to make sure that the text will be

Polytonic Greek pneuma letters (spirits) and half-eta glyphs

2004-10-07 Thread Philippe Verdy
This page on the French version of wikipedia notes that Polytonic Greek used in the 3rd century B.C. alternate letters to denote the initial spirits (pneuma dasú for the hard spirit, and pneuma psílon for the soft spirit), rather than the modern 9-shaped combining accents.

Re: text-transform

2004-10-23 Thread Philippe Verdy
From: fantasai [EMAIL PROTECTED] Comments on CSS (but not how-to questions) should be directed to the www-style mailing list at w3.org, not unicode: http://lists.w3.org/Archives/Public/www-style/ OK for the numeric versus capitalize|uppercase|lowercase remark, which is related to form

Re: basic-hebrew RtL-space ?

2004-11-01 Thread Philippe Verdy
From: kefas [EMAIL PROTECTED] Inserting unicode/basic-hebrew reults in a convinient RtL, right-to-left, advance of the cursor, but the space-character jumps to the far right. Is there a RtL-space? In MS-Word and OpenOffice I can only change whole paragraphs to RtL-entry. But quoting just a few

Re: Opinions on this Java URL?

2004-11-13 Thread Philippe Verdy
From: A. Vine [EMAIL PROTECTED] I'm just curious about the \0 thing. What problems would having a \0 in UTF-8 present, that are not presented by having \0 in ASCII? I can't see any advantage there. Beats me, I wasn't there. None of the Java folks I know were there either. The problem is in the

Re: U+0000 in C strings (was: Re: Opinions on this Java URL?)

2004-11-15 Thread Philippe Verdy
- Original Message - From: John Cowan [EMAIL PROTECTED] To: Doug Ewell [EMAIL PROTECTED] Cc: Unicode Mailing List [EMAIL PROTECTED]; Philippe Verdy [EMAIL PROTECTED]; Peter Kirk [EMAIL PROTECTED] Sent: Monday, November 15, 2004 7:05 AM Subject: Re: U+ in C strings (was: Re: Opinions

Re: Opinions on this Java URL?

2004-11-15 Thread Philippe Verdy
From: Christopher Fynn [EMAIL PROTECTED] Isn't it already deprecated? The URL that started this thread http://java.sun.com/j2se/1.5.0/docs/api/java/io/DataInput.html is marked as part of the Deprecated API Deprecated does not mean that it is not used. This interface remains accessible when

Re: Eudora 6.2 has been released

2004-11-19 Thread Philippe Verdy
From: Peter Kirk [EMAIL PROTECTED] On the contrary, it is your mobile sync software which is of no use if communication with the outside world is required, if it doesn't support standards-conformant mail clients like Thunderbird, but only communicates in non-standardised ways with the products

Re: Unicode HTML, download

2004-11-20 Thread Philippe Verdy
From: Edward H. Trager [EMAIL PROTECTED] Hi, Elaine, There is of course no limit to how many writing systems one can have on a Unicode-encoded HTML page. My recommendations would be to: (3) Use Cascading Style Sheet (CSS) classes to control display of fonts ... A better CSS class would

Re: Unicode HTML, download

2004-11-20 Thread Philippe Verdy
From: E. Keown [EMAIL PROTECTED] Great idea! I code in the seldom-seen AHTML ('Archaic HTML'), as you all suspected. A friend tested a page I wrote last month and found it wouldn't work on any of his 5 browsersoh well. Well, Elaine, if you want maximum compatibility, you should better use

Re: [even more increasingly OT-- into Sunday morning] Re: Unicode HTML, download

2004-11-21 Thread Philippe Verdy
From: Christopher Fynn [EMAIL PROTECTED] I'd also like to figure out a way to trigger this kind of behavior in other browsers as well as in IE (using Java Script or Java rather than VB) as not quite everyone uses IE - (but I guess you are not going to give me any more clues on how to do that

Re: Unicode HTML, download

2004-11-21 Thread Philippe Verdy
From: Doug Ewell [EMAIL PROTECTED] The best advice for Elaine's situation becomes simpler. To maximize the likelihood that readers will see the right glyphs, add a font-family style line that lists a variety of available fonts, in decreasing order of coverage and attractiveness. My bad advice

Re: Unicode HTML, download

2004-11-21 Thread Philippe Verdy
From: Doug Ewell [EMAIL PROTECTED] Cryptically naming these two CSS classes .he and .heb, which provides no indication of which is the Unicode encoding and which is the Latin-1 hack, merely makes a bad suggestion worse. It was not cryptocraphic: he was meant for Hebrew (generic, properly Unicode

Re: [increasingly OT--but it's Saturday night] Re: Unicode HTML, download

2004-11-21 Thread Philippe Verdy
From: E. Keown [EMAIL PROTECTED] Dear Doug Ewell, fantasai and List: I will try to sort out these diverse pieces of advice. What's the point, really, of going far beyond, even beyond CSS, into XHTML, where few computational Hebraists have gone before? You're right Helen, the web is full of non

Re: Ezra

2004-11-21 Thread Philippe Verdy
From: Edward H. Trager [EMAIL PROTECTED] Are you saying the difference in names is SIL Ezra vs. Ezra SIL ? That's too confusing! You're not alone to be confused. I had completely forgotten the existence of two versions of the same font design. I may have just seen that it used PUAs, so I did not

Re: My Querry

2004-11-23 Thread Philippe Verdy
From: Antoine Leca [EMAIL PROTECTED] I do not know what does mean fully compatible in such a context. For example, ASCII as designed allowed (please note I did not write was designed to allow) the use of the 8th bit as parity bit when transmitted as octet on a telecommunication line; I doubt such

Re: Shift-JIS conversion.

2004-11-25 Thread Philippe Verdy
You just need a mapping table from Unicode codepoints to Shift-JIS code positions, and a very simple code point parser to translate UTF-8 into Unicode code points. You'll find a mapping table in the Unicode UCD, on its FTP server. The UTF-8 form is fully documented in the Conformance section

Re: Misuse of 8th bit [Was: My Querry]

2004-11-25 Thread Philippe Verdy
From: Antoine Leca [EMAIL PROTECTED] On Wednesday, November 24th, 2004 22:16Z Asmus Freytag va escriure: I'm not seeing a lot in this thread that adds to the store of knowledge on this issue, but I see a number of statements that are easily misconstrued or misapplied, including the thoroughly

Re: Shift-JIS conversion.

2004-11-25 Thread Philippe Verdy
- Original Message - From: Addison Phillips [wM] To: pragati ; [EMAIL PROTECTED] Sent: Thursday, November 25, 2004 6:21 PM Subject: RE: Shift-JIS conversion. Dear Pragati, You can write your own conversion, of course. The mapping tables of Unicode-SJIS are readily availably. You should

Re: Misuse of 8th bit [Was: My Querry]

2004-11-26 Thread Philippe Verdy
From: Antoine Leca [EMAIL PROTECTED] On Thursday, November 25th, 2004 08:05Z Philippe Verdy va escriure: In ASCII, or in all other ISO 646 charsets, code positions are ALL in the range 0 to 127. Nothing is defined outside of this range, exactly like Unicode does not define or mandate anything

Re: Relationship between Unicode and 10646 (was: Re: Shift-JIS conversion.)

2004-11-26 Thread Philippe Verdy
From: Doug Ewell [EMAIL PROTECTED] My impression is that Unicode and ISO/IEC 10646 are two distinct standards, administered respectively by UTC and ISO/IEC JTC1/SC2/WG2, which have pledged to work together to keep the standards perfectly aligned and interoperable, because it would be destructive

Re: CGJ , RLM

2004-11-26 Thread Philippe Verdy
From: Mark Davis [EMAIL PROTECTED] I want to correct some misperceptions about CGJ; it should not be used for ligatures. True. CGJ is a combining character that extends the grapheme cluster started before it, but it does not imply any linking with the next grapheme cluster starting at a base

Re: CGJ , RLM

2004-11-26 Thread Philippe Verdy
Message - From: Mark Davis [EMAIL PROTECTED] To: Philippe Verdy [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Friday, November 26, 2004 9:09 PM Subject: Re: CGJ , RLM The statements below are incorrect, but I don't have the time to correct them all.

Re: CGJ , RLM

2004-11-26 Thread Philippe Verdy
From: Doug Ewell [EMAIL PROTECTED] Perhaps a better question to ask would be why you need to indicate both hyphenation points and ligation points in text that is going to be collated. Because one would want to: - prepare documents for correct rendering (including both ligatures and hyphenation

Re: CGJ , RLM

2004-11-26 Thread Philippe Verdy
From: Doug Ewell [EMAIL PROTECTED] Philippe Verdy verdy underscore p at wanadoo dot fr wrote: If I want to encode explicit ligatures for the ffi cluster, if it is not hyphenated, I need to add ZWJ: ef+ZWJ+SHY+f+ZWJ+i+SHY+ca+SHY+ce(1) Great Scott! You can use ZWJ to suggest a ligation

Re: No Invisible Character - NBSP at the start of a word

2004-11-27 Thread Philippe Verdy
From: Jony Rosenne [EMAIL PROTECTED] One of the problems in this context is the phrase original meaning. What we have is a juxtaposition of two words, which is indicated by writing the letters of one with the vowels of the other. In many cases this does not cause much of a problem, because the

Re: (base as a combing char)

2004-11-27 Thread Philippe Verdy
From: Addison Phillips [wM] [EMAIL PROTECTED] For example, Dutch sometimes treats the sequence ij as a single letter (it turns out that there are characters for the letter 'ij' in Unicode too, but they are for compatibility with an ancient non-Unicode character set). Software must be modified

Re: Relationship between Unicode and 10646

2004-11-27 Thread Philippe Verdy
From: Peter Kirk [EMAIL PROTECTED] I don't want to go along with Philippe entirely on this, but surely he must be right on this last point. Formally, Unicode is effectively the agent of just one national body in this decision-making process. To be honest, Peter, I never said that Unicode was a

Re: CGJ , RLM

2004-11-27 Thread Philippe Verdy
I'm not the one that proposed encoding a AE ligature with A+ZWJ+E. I just spoke about cases like true typographical ligatures like ffi. I do know that AE or ae in French is better encoded with their distinct unique code, even if French consider this letter as two letters (which may justify the

Re: (base as a combing char)

2004-11-27 Thread Philippe Verdy
From: John Cowan [EMAIL PROTECTED] the need to encode Dutch ij as a single character, which is neither necessary nor practical. (U+0132 and U+0133 are encoded for compatibility only.) In cases where ij is a digraph in Dutch text, i+ZWNJ+j will be effective. I suppose you wanted to speak about the

Re: Re: Relationship between Unicode and 10646]

2004-11-29 Thread Philippe Verdy
From: Patrick Andries [EMAIL PROTECTED] Enfin, je ne suis plus si sûr que les sociétés américaines considèrent encore Unicode comme quelque chose de stratégique, il s'agit surtout d'efforts individuels de la part de techniciens passionés dans ces entreprises, passionnés qu'on laisse encore

Re: CGJ , RLM

2004-11-29 Thread Philippe Verdy
From: Otto Stolz [EMAIL PROTECTED] Note that there is no algorithm to reliably derive the position of the syllable break from the spelling of a Word. You could even concoct pairs of homographs that differ only in the position of the syllable break (and, consequently, in their respective meaning).

fl/fi ligature examples

2004-11-29 Thread Philippe Verdy
From: Otto Stolz [EMAIL PROTECTED] Just because the st ligature is so uncommon (and the long with its t ligature is almost extinct), I was looking for an example involving fl, or fi). with ff : affable, baffe, biffer, Buffy, affriolant, effaroucher, effacer, ... with ffl : effleurer,

Re: Ideograph?!?

2004-11-29 Thread Philippe Verdy
From: Michael Norton (a.k.a. Flarn) [EMAIL PROTECTED] What's an ideograph? Also, what's a radical? Are they the same thing? Some radicals (in the Han script) may be ideographs, but most ideographs are not radicals: they often (not always) combine 1 or more radicals, with 1 or more strokes that

Re: Keyboard Cursor Keys

2004-11-30 Thread Philippe Verdy
From: Peter R. Mueller-Roemer [EMAIL PROTECTED] Doug Ewell wrote: Robert Finch wrote: 'm trying to implement a Unicode keyboard device, and I'd rather have keyboard processing dealing with genuine Unicode characters for the cursor keys, rather than having to use a mix of keyboard scan codes and

Re: Relationship between Unicode and 10646

2004-11-30 Thread Philippe Verdy
From: Peter Kirk [EMAIL PROTECTED] On 30/11/2004 19:53, John Cowan wrote: Your main misunderstanding seems to be your belief that WG2 is a democratic body; that is, that it makes decisions by majority vote. ... Thank you, John. This was in fact my question: will the amendment be passed

Re: Nicest UTF

2004-12-02 Thread Philippe Verdy
There's no *universal* best encoding. UTF-8 however is certainly today the best encoding for portable communications and data storage (but it competes now with SCSU which uses a compressed form where, on average, each Unicode character is represented by one byte, in most documents; but other

Re: Nicest UTF

2004-12-02 Thread Philippe Verdy
If you need immutable strings, that take the least space as possible in memory for your running app, then consider using SCSU, for the internal storage of the string object, then have a method return an indexed array of code points, or a UTF-32 string when you need it to mutate the string

Re: Nicest UTF

2004-12-03 Thread Philippe Verdy
From: Doug Ewell [EMAIL PROTECTED] I appreciate Philippe's support of SCSU, but I don't think *even I* would recommend it as an internal storage format. The effort to encode and decode it, while by no means Herculean as often perceived, is not trivial once you step outside Latin-1. I said: for

Re: Nicest UTF

2004-12-03 Thread Philippe Verdy
RE: Nicest UTFFrom: Lars Kristan I agree. But not for reasons you mentioned. There is one other important advantage: UTF-8 is stored in a way that permits storing invalid sequences. I will need to elaborate that, of course. Not true for UTF-8. UTF-8 can only store valid sequences of code points,

Re: OpenType vs TrueType (was current version of unicode-font)

2004-12-03 Thread Philippe Verdy
From: Gary P. Grosso [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Friday, December 03, 2004 5:10 PM Subject: RE: OpenType vs TrueType (was current version of unicode-font) Hi Antoine, others, Questions about OpenType vs TrueType come up often in my work, so perhaps the list will suffer a couple

Re: Nicest UTF

2004-12-03 Thread Philippe Verdy
From: Asmus Freytag [EMAIL PROTECTED] A simplistic model of the 'cost' for UTF-16 over UTF-32 would consider 1) 1 extra test per character (to see whether it's a surrogate) 2) special handling every 100 to 1000 characters (say 10 instructions) 3) additional cost of accessing 16-bit registers (per

Re: Nicest UTF

2004-12-03 Thread Philippe Verdy
From: Theo [EMAIL PROTECTED] From: Asmus Freytag [EMAIL PROTECTED] So, despite it being UTF-8 case insensitive, it was totally blastingly fast. (One person reported counting words at 1MB/second of pure text, from within a mixed Basic / C environment). You'll need to keep in mind, that the

Re: OpenType vs TrueType (was current version of unicode-font)

2004-12-03 Thread Philippe Verdy
From: Peter Constable [EMAIL PROTECTED] Why would you think the creation of this site might suggest that Microsoft is selling off its IP in relation to OpenType to Monotype? If Motorola created a site www.pentium4.org, would you jump to the conclusion that they were selling off that IP? What

Re: Nicest UTF

2004-12-04 Thread Philippe Verdy
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] Philippe Verdy [EMAIL PROTECTED] writes: Random access by code point index means that you don't use strings as immutable objects, No. Look at Python, Java and C#: their strings are immutable (don't change in-place) and are indexed by integers

Re: Nicest UTF

2004-12-05 Thread Philippe Verdy
- Original Message - From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Sunday, December 05, 2004 1:37 AM Subject: Re: Nicest UTF Philippe Verdy [EMAIL PROTECTED] writes: There's nothing that requires the string storage to use the same exposed array, The point

Re: script complexity, was Re: OpenType vs TrueType

2004-12-05 Thread Philippe Verdy
Richard Cook rscook at socrates dot berkeley dot edu wrote: Script complexity is not so easily quantified. Has anyone tried to sort scripts by complexity? In terms of the present discussion, Han would be viewed as a simple script, and yet it is simple only in terms of the script model in which

Re: Unicode for words?

2004-12-05 Thread Philippe Verdy
From: Ray Mullan [EMAIL PROTECTED] I don't see how the one million available codepoints in the Unicode Standard could possibly accommodate a grammatically accurate vocabulary of all the world's languages. You have misread the message from Tim: he wanted to use code points above U+10 within

Re: Nicest UTF

2004-12-05 Thread Philippe Verdy
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] Philippe Verdy [EMAIL PROTECTED] writes: The point is that indexing should better be O(1). SCSU is also O(1) in terms of indexing complexity... It is not. You can't extract the nth code point without scanning the previous n-1 code points

Re: Unicode for words?

2004-12-05 Thread Philippe Verdy
of channels (networking links, file storage, database table) with lower throughput than fast but expensive or restricted internal processing memory (including memory caches if we consider data locality). From: D. Starner [EMAIL PROTECTED] Philippe Verdy writes: Suppose that Unicode encodes

Re: Nicest UTF

2004-12-05 Thread Philippe Verdy
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] Now consider scanning forwards. We want to strip a beginning of a string. For example the string is an irc message prefixed with a command and we want to take the message only for further processing. We have found the end of the prefix and we want

Fw: Nicest UTF

2004-12-05 Thread Philippe Verdy
From: Doug Ewell [EMAIL PROTECTED] Here is a string, expressed as a sequence of bytes in SCSU: 05 1C 4D 6F 73 63 6F 77 05 1D 20 69 73 20 12 9C BE C1 BA B2 B0 2E See how long it takes you to decode this to Unicode code points. (Do not refer to UTN #14; that would be cheating. :-) Without looking

Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

2004-12-06 Thread Philippe Verdy
- Original Message - From: Arcane Jill [EMAIL PROTECTED] Probably a dumb question, but how come nobody's invented UTF-24 yet? I just made that up, it's not an official standard, but one could easily define UTF-24 as UTF-32 with the most-significant byte (which is always zero) removed,

Re: proposals I wrote (and also, didn't write)

2004-12-06 Thread Philippe Verdy
From: E. Keown [EMAIL PROTECTED] I wrote 3 Hebrew diacritics proposals between May-July. (...) 1. Proposal to add Samaritan Pointing to the UCS http://www.lashonkodesh.org/samarpro.pdf WG2 number: N2748 2. Proposal to add Palestinian Pointing to ISO/IEC 10646

Re: Nicest UTF

2004-12-07 Thread Philippe Verdy
From: D. Starner [EMAIL PROTECTED] If you're talking about a language that hides the structure of strings and has no problem with variable length data, then it wouldn't matter what the internal processing of the string looks like. You'd need to use iterators and discourage the use of arbitrary

Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

2004-12-07 Thread Philippe Verdy
From: Kenneth Whistler [EMAIL PROTECTED] Yes, and pigs could fly, if they had big enough wings. Once again, this is a creative comment. As if Unicode had to be bound on architectural constraints such as the requirement of representing code units (which are architectural for a system) only as

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Philippe Verdy
RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)I know wht you mean here: most Linux/Unix filesystems (as well as many legacy filesystems for Windows and MacOS...) do not track the encoding with which filenames were encoded and, depending on local user preferences when that user created that

Re: Re: Word dividers, was: proposals I wrote (and also, didn't write)

2004-12-08 Thread Philippe VERDY
De : Michael Everson But there is already in the pipeline a PHOENICIAN WORD SEPARATOR [...] The glyphs for all of these seem indistinguishable, and so are the functions. The only difference seems to be the scripts they are associated with, but punctuation marks are supposed to be

Re: IUC27 Unicode, Cultural Diversity, and Multilingual Computing / Africa is forgotten once again.

2004-12-08 Thread Philippe Verdy
Probably the first thing to do for Africa is to extend the support of softwares with localized contents that can ALREADY be performed with existing encoded scripts. But even there, software companies are not progressing much, even if this causes no technical problems with the existing

Re: Nicest UTF

2004-12-09 Thread Philippe Verdy
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] Ok, so it's the conversion from raw text to escaped character references which should treat combining characters specially. What about with combining acute, which doesn't have a precomposed form? A broken opening tag or a valid text character?

Re: Nicest UTF

2004-12-09 Thread Philippe Verdy
From: D. Starner [EMAIL PROTECTED] Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] writes: If it's a broken character reference, then what about A#769; (769 is the code for combining acute if I'm not mistaken)? Please start adding spaces to your entity references or something, because those of us

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-09 Thread Philippe Verdy
From: Antoine Leca [EMAIL PROTECTED] Err, not really. MS-DOS *need to know* the encoding to use, a bit like a *nix application that displays filenames need to know the encoding to use the correct set of glyphs (but constrainst are much more heavy.) Also Windows NT Unicode applications know it,

Re: Software support costs (was: Nicest UTF

2004-12-10 Thread Philippe Verdy
From: Carl W. Brown [EMAIL PROTECTED] Philippe, Also a broken opening tag for HTML/XML documents In addition to not having endian problems UTF-8 is also useful when tracing intersystem communications data because XML and other tags are usually in the ASCII subset of UTF-8 and stand out making it

Re: Nicest UTF

2004-12-10 Thread Philippe Verdy
From: Philippe Verdy [EMAIL PROTECTED] From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] Philippe Verdy [EMAIL PROTECTED] writes: The XML/HTML core syntax is defined with fixed behavior of some individual characters like '', '', quotation marks, and with special behavior for spaces. The point

Re: Please RSVP... (was: US-ASCII)

2004-12-10 Thread Philippe Verdy
From: Kenneth Whistler [EMAIL PROTECTED] That it has been morphological reanalyzed is demonstrated by the fact that it takes regular English verb endings, as in: I RSVPed yesterday, right after I got the email. As I said, it is now a bona fide English verb, and most English speakers will treat it

Re: Nicest UTF

2004-12-11 Thread Philippe Verdy
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] Regarding A, I see three choices: 1. A string is a sequence of code points. 2. A string is a sequence of combining character sequences. 3. A string is a sequence of code points, but it's encouraged to process it in groups of combining character

Re: Roundtripping in Unicode

2004-12-11 Thread Philippe Verdy
From: Doug Ewell [EMAIL PROTECTED] Lars Kristan wrote: I am sure one of the standardizers will find a Unicodally correct way of putting it. I can't even understand that paragraph, let alone paraphrase it. My understanding of his question and my reponse to his problem is that you MUST not use

Re: Please RSVP... (was: US-ASCII)

2004-12-11 Thread Philippe Verdy
From: Séamas Ó Brógáin [EMAIL PROTECTED] John wrote: As far as I know, they were first used in formal invitations (to weddings, funerals, dances, etc.) in the corner of the card, as both shorter and more fancy than the older phrase The favor of your reply is requested. This is correct. The

Re: infinite combinations, was Re: Nicest UTF

2004-12-11 Thread Philippe Verdy
From: Peter R. Mueller-Roemer [EMAIL PROTECTED] For a fixed length of combining character sequence (base + 3 combining marks is the most I have seen graphically distinguishable) the repertore is still finite. I do think that you are underestimating the repertoire. Also Unicode does NOT define

Re: Please RSVP... (was: US-ASCII)

2004-12-11 Thread Philippe Verdy
From: Michael Everson [EMAIL PROTECTED] Nonsense. You might as well try to explain SPQR on the same basis. I won't. I know that SPQR was used on architectural constructions as a symbol of the Roman Empire, and it was a wellknown acronym of a Latin expression. It largely predates the invention

Re: Roundtripping in Unicode

2004-12-11 Thread Philippe Verdy
RE: Roundtripping in UnicodeMy view about this problem of roundtripping is that if data, supposed to contain only valid UTF-8 sequences, contains some invalid byte sequences that still need to be roundtripped to some code point for internal management that can be roundtripped later to the

Re: RE: Roundtripping in Unicode

2004-12-13 Thread Philippe VERDY
Lars Kristan wrote: What I was talking about in the paragraph in question is what happens if you want to take unassigned codepoints and give them a new status. You don't need to do that. No Unicode application must assign semantics to unassigned codepoints. If a source sequence is invalid, and you

Re: Roundtripping in Unicode

2004-12-14 Thread Philippe Verdy
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] Lars Kristan [EMAIL PROTECTED] writes: Hm, here lies the catch. According to UTC, you need to keep processing the UNIX filenames as BINARY data. And, also according to UTC, any UTF-8 function is allowed to reject invalid sequences. Basically,

<    5   6   7   8   9   10   11   12   13   14   >