Re: [A12n-Collab] Latin alpha (Re: Public Review Issues Update)
From: John Hudson [EMAIL PROTECTED] Donald Z. Osborn wrote: According to data from R. Hartell (1993), the latin alpha is used in Fe'efe'e (a dialect of Bamileke) in Cameroon. See http://www.bisharat.net/A12N/CAM-table.htm (full ref. there; Hartell names her sources in her book). Not sure offhand of other uses, but I thought it was proposed for Latin transcription of Tamashek in Mali at one point (I'll try to check later). In any event it would seem easy to confuse the latin alpha with the standard a, which would seem to either require exaggerated forms (of the alpha, to clarify the difference) or limit its usefulness in practice. The Latin alpha is usually distinguished from the regular Latin lowercase a by making the latter a 'double-storey' form, whereas the alpha is a single-storey form. Of course, this means that the distinction cannot be adequately made in typefaces with a single-storey lowercase a, such as Futura. I agree with you but almost all font designs make a clear distinction between lowercase alpha (latin or greek), and lowercase a: the alpha is a single continuous stroke, whereas the Latin letter a is almost always (in either single-eye or double-eye forms) a closed circle/ellipse and a tangeant vertical stroke on the right. I was speaking about the distinction between single-eye and double eye forms of the Latin letter a (excluding Latin alpha), where: - the single-eye form is generally an x-height circle or vertical ellipse and a x-height tangeant vertical stroke (possibly curved on the lower end to become tangeant to the baseline to become the start of a connecting edge), - and the double-eye form is generally an half-x-height flat ellipse and a x-height vertical tangeant curved above the ellipse to become tangeant to the x-height horizontal line; so it has two eyes (one closed below, one open eye above it). The Latin small letter alpha is always a single-eye form, but sometimes there's a second open eye on the right of the closed eye (which should not be a ellipse, but should present some angle on its right edge). My question was about the distinction of letter a only, even if there are some fonts where it will be difficult to see the difference between the single-eye letter a and the small letter alpha.
Re: Deseret in use (?) by micronation Molossia
From: Doug Ewell [EMAIL PROTECTED] Antnio Martins-Tuvlkin antonio at tuvalkin dot web dot pt wrote: Deseret in use (?) by micronation Molossia: It is explained at http://www.molossia.org/alphabet.html , but they put GIFs on-line, making no use of the U+10400 block... I visited their site, wondering if they could use some assistance with transcription and Unicode from an American foreign national who reads and writes Deseret (and has been to the area recently). But they seem more interested in relations with residents of other micronations than with Americans. , . Seriously, do these kings and emperors, that reign on these lands and claim various disputed places around the world, or even on the Moon or Mars, and can issue currency by buying Monopoly(tm) game bills, be taken seriously? On these micronations, you'll find so many consitutional changes, for so few peoples (most often not more than a handful), that I doubt these peoples can claim create a standard. What is real is that they have found a legal way to escape from their hosting country to put them out of laws, but also out of assistance. None of them are recognized internationally, except between themselves in a virtual forum (if they can pay for their Internet access used abroad...) We can accept their volonty of independance, but they have also to accept what this implies. Most of these self-claimed lands have disappeared after less than a dozen of years (divorce, family conflicts, or simply poverty caused by lack of local job and resources; none of these lands could live without tourism, if people visiting them accept to pay their tax in full-fledged US dollars or Euros...) I remember such self-claimed country by someone who bought an old oil platform, and anchored it in the international waters North of Europe. This platform is under International laws, but is not a homeland (nobody lives there), despite it is out of laws of the neighboring country. This has allowed the man owning it to escape from taxes and to found a financial company with black money that resists from fiscal inspections... until the status of the waters was resolved at the UN, by an agreement between neighbouring countries;
markup on combining characters (was: Compatibility mappings for new Hebrew points)
From: Peter Kirk [EMAIL PROTECTED] By the way, any suggestion of making the QQ distinction with markup is ruled out by the principle recently expounded on the main Unicode list that separate markup cannot be applied to combining characters. Isn't this need of allowing separate markup on combining characters addressed by the current proposal to encode a invisible base character (IBC), so that markup can be applied to a non defective combining sequence? I understand that this proposed new character would more likely be used to allow rendering isolated combining marks, without needing to encode their spacing variant, but the sequence IBC,combining mark (now possibly enclosed in markup) could become a candidate for possible ligaturing by preceding it by a ZWJ, or for word-wrap exclusion with a leading WJ...
Re: markup on combining characters
From: Jony Rosenne [EMAIL PROTECTED] Peter Kirk You mean, you would represent a black e with a red acute accent as something like e, ZWJ, red, IBC, acute, /red? That looks like a nightmare for all kinds of processing and a nightmare for rendering. No, it is more like forecolor:black, combiningcolor:red e acute And there is no Unicode decision against it. And still no decision if this invisible base character will be added or not. It's just a public review for now, to address the first issue of rendering isolated non-spacing combining marks that currently don't have a spacing variant (I think it's a good idea as it would avoid adding most of the missing ones, notably for the non-generic L/G/C combining marks). Note that your suggestion of: forecolor:black, combiningcolor:red e acute should also work with any normalized form of the same text, i.e. with: forecolor:black, combiningcolor:red e with acute where the combining mark is composed. The issue here is that this becomes tricky for renderers that will need to redecompose strings in normalized forms, before applying style. Basically I prefer the Peter solution with: e, ZWJ?, red, IBC, acute, /red which is more independant of the normalization form. Then the question is whever the text within red.../red markup should combine visually when rendered. For now I see the proposed IBC (no name for it for now) only as a way to transform non-spacing combining marks in spacing non-combining variants, when they dont exist separately in Unicode (so this would not be recommanded for the non-spacing acute accent which already has a spacing version that does not require using a leading IBC.) Technically, if an IBC character is added, a renderer will not necessarily render IBC, non-spacing combining acute the same way as spacing non-combining acute accent, even if it should better do so. In this past sentence, the should means that the existing spacing non-combining marks are left as the standard legacy way to encode them, and they normally don't combine when rendered after a base letter, even if there's markup around them (except if this markup explicitly says that they should combine): If I take the above example, e, ZWJ?, red, IBC, acute, /red the same rich-text should also be renderable without the markup in plain-text as if it was: e, ZWJ?, IBC, acute i.e. (with the should above) like if it was also: e, ZWJ?, spacing acute I have placed the ? symbol after ZWJ to exhibit the fact that something would be necessary to allow this last text to remove the non-combining non-spacing behavior of the spacing acute character. Without it, the text: e, spacing acute or equivalently (with the should above): e, IBC, combining acute would not be allowed to render a combined e with an accute, and two separate glyphs would be rendered, and two separate character entities interpreted (as they are today in legacy plain-texts). So the question remains about how to add markup on combining marks: the proposed IBC alone cannot solve such problems, unless there's an agreement that ZWJ immediately followed by IBC should be rendered as if they were not present (but in that case, a spacing acute becomes semantically and graphically distinct from IBC, combining acute: this is what will happen in any case with normalization forms due to the Unicode stability policy, as existing spacing marks must remain undecomposable in NFD or NFKD forms). I also note that IBC is intended to replace the need to use a standard SPACE as the base character for building a spacing variant of combining marks when there's no standard spacing variant encoded in Unicode (this is a legacy hack, which causes various problems because of whitespace normalization in many plain-text formats or applications, or in XML and HTML, and the special word-breaking behavior of spaces). I don't see it as a way to deprecate the existing block of spacing marks.
Re: markup on combining characters
From: Asmus Freytag [EMAIL PROTECTED] At 12:49 AM 9/8/2004, Philippe Verdy wrote: And still no decision if this invisible base character will be added or not. It's just a public review for now, Well, hold your horses for a bit here. If something's out of review, there won't be a decision until the review is over. Anything that has this much potential exposure is something we should move very slowly on, to make sure we get it right. Isn't the public review there specially to think about such things? It's not too soon to discuss it now, because the most serious issues will hapen when the new character will be encoded, possibly with missing or incorrect properties. I don't know if a formal proposal has been sent to ISO/IEC WG too. May be this review is there to allow creating such formal proposal, to be encoded later by ISO if it accepts to give it a codepoint, and then accepted too by UTC when properties are fixed and usage is properly documented.
Re: [BULK] - Re: markup on combining characters
From: Asmus Freytag [EMAIL PROTECTED] On the other hand, all aspects to *coloring* of characters do not belong in the plain text stream - but that was not the question. I think suggested solutions that define markup that apply to combining characters but place that markup outside of the combining sequence would be a better answer than protocols trying to put markup inside the combining character sequence. My personal take is that the UTC might make a recommendation to that effect, but it's not part of the standard proper. It's not clear that the issue has practical urgency - if I should be mistaken on that, I'd like to find out how and why. Placing markup out of the combining sequence seems attractive, apparently, but exposes to other difficulties about how to refer to parts of combining sequences (I did not say parts of characters, because I agree that combining characters are not part of characters, but effectively true abstract characters per the Unicode definition), when combining sequences are themselves subject to transformations like normalization. A solution would be to specify in the markup which normalization to apply to the combining sequence before refering to its component characters, with some syntax like: font style=color:red nfd(2,1);ecombining-acute;/font which would resist to normalization of the document such as NFC in: font style=color:red nfd(2,1);e-with-acute;/font Here some syntax in the markup style indicates an explicit NFD normalization to apply to the plain-text fragment encoded in the text element, before specifying a range of characters to which the style applies (Here it says that color:red applies to only 1 character starting at the second one in the surrounded text fragment, after it has been forced to NFD normalization. May be this seems tricky, but other simplified solutions may be implemented in a style language, such as providing more basic restrictions using new markup attributes: font style=combining-color:rede-with-acute;/font where the new combining-color attribute implies such prenormalization and automatic selection of character ranges to which to apply coloring. May be there are better solutions, that will not imply augmenting the style language schema with lots of new attribute names, such as in: font style=color:combining(red)e-with-acute;/font Here also, Unicode itself is not affected. But markup languages and renderers are seriously modified to take new markup property names or values into account.
Re: Questions about diacritics
From: Gerd Schumacher [EMAIL PROTECTED] 2. Another invisible diacritics carrier I also found an acute on diphtongs, placed on the boundary of both letters (au, ei, eu, oe, and ui). Wouldn't such diacritic be hold by the currently proposed invisible base character (in the Public Review section of the Unicode website), by encoding for example: a,INVISIBLE LETTER,combining acute,u If you think there's a grapheme cluster here, I suggest using ZWJ to attach the three default grapheme clusters: a,ZWJ,INVISIBLE LETTER,combining acute,ZWJ,u to create a kerning ligature between the two vowels. The invisible letter in PR-41 is also intended to support the INV character found in ISCII for standard Brahmic scripts of India, but with probable interoperability problems. But I currently do not see indication for its correct usage in the Latin script, except as a way to transform a combining diacritic into a non-combining one in isolation, when the legacy use of SPACE causes interoperability problems such as in XML and HTML or with word-breaking algorithms. As the intent is to create a spacing diacritic, not using a joining/ligaturing control before and after it would not create the desired effect, as the acute above would be shown on a blank space between 'a' and 'u', as wide as the acute accent itself. The PR-41 proposal document suggests that the typical use of the Invisible Letter would be to display a isolated spacing diacritic between two spaces (or punctuations), a case where the XML/HTML treatment of whitespace sequences is to collapse them before rendering or interpreting them. Your request is quite similar to the case of double diacritics already encoded in Unicode, except that double diacritics are displayed on the whole display width above the two letters, when your usage would just be to put a standard width diacritic centered on the kerning space between them. For an acute accent, it's unlikely that doubling its width would be very readable, where it could be confused with a macron. May be your centered diacritic should be encoded like the other double diacritics.
Re: Questions about diacritics
From: Peter Kirk [EMAIL PROTECTED] Surely the intention is for INVISIBLE LETTER, combining acute to be equivalent (although it cannot be canonically equivalent) to spacing acute, U+00B4? But then would this kind of ligature mechanism with ZWNJ and U+00B4 be appropriate? I would think not. INVISIBLE LETTER,combining acute will not be canonically equivalent effectively, depite it should produce and behave like the spacing acute. As ZWJ is intended to indicate that there's effectively a ligature opportunity between two grapheme clusters, I don't see why one would not support a,ZWJ,SPACING ACUTE to kern the spacing acute on the right side of a. It won't create an accent *centered* above the letter, but it now allows the accent to move within the spacing area of the preceding letter. I accept the fact that this is just a ligature opportunity for renderers, with no different semantics than in absence of the joiner. But I wonder if the digraph with the centered accent above is not simply that: the accent is a notation that does not change the semantics of the surrounding two vowels, with no orthographic consideration. In that case, this is really a rendering feature, and using ZWJ could be appropriate here, notably because IL,combining acute will remain canonically distinct from U+00B4, which also has the wrong character properties (not a letter, this is a symbol and a word-breaker by itself...). Most uses of isolated diacritics however are mainly symbolic rather than orthographic. The IL however changes this, and becomes appropriate within the middle of words.
Re: Questions about diacritics
From: Doug Ewell [EMAIL PROTECTED] Philippe Verdy verdy underscore p at wanadoo dot fr wrote: I also found an acute on diphtongs, placed on the boundary of both letters (au, ei, eu, oe, and ui). Wouldn't such diacritic be hold by the currently proposed invisible base character (in the Public Review section of the Unicode website), by encoding for example: a,INVISIBLE LETTER,combining acute,u If you think there's a grapheme cluster here, I suggest using ZWJ to attach the three default grapheme clusters: a,ZWJ,INVISIBLE LETTER,combining acute,ZWJ,u to create a kerning ligature between the two vowels. I thought one of the unstated, beneficial side effects of INVISIBLE LETTER was that it might reduce the need for non-intuitive ZWJ and ZWNJ sequences. I may be wrong, though; I haven't followed the INVISIBLE LETTER debate very closely. In the (short) PR-41 document, the intent is really to substitute the SPACE character by another one to serve as a base character for isolated diacritics. (SPACE is known to cause problems in HTML/XML due to whitespace compression and in text parsers such as word-breakers). It won't deprecate the existing spacing diacritics block, but will avoid adding new spacing variants for the existing or future diacritics that may need them. The current semantics of SPACE and its reuse to serve as a base for diacritics requires changing the character properties of SPACE when a diacritic follows it, and this is really a bad exception to the general framework where a combining sequence should inherit almost all its properties from its base character. I don't see the proposal as a way to avoid any use of joiners/non-joiners in its current form.
Re: Questions about diacritics
Good point, but is the ZWNJ control supposed to be used as a base character with a defined height? I thought it was just a control for indicating where ligatures are preferably to avoid when rendering, leaving it fully ignorable if the renderer has no other option than rendering the ligature. For this application, the following character was a base character. Other uses of ZWNJ before diacritics are in Indic scripts, or in the Hebrew proposals (in Public Review for Meteg), to control the meaning of the following character. So I do think that the LateX2e compound word mark should map to ZWNJ,INVISIBLE LETTER rather than just ZWNJ... The (-)burg abbreviation as (-)bg (with a non-spacing but non-combining breve) should then be encoded with the invisible letter, in combination with ZWNJ to make it non-spacing.) - Original Message - From: Jrg Knappen [EMAIL PROTECTED] To: Philippe Verdy [EMAIL PROTECTED] Cc: Doug Ewell [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Tuesday, September 14, 2004 6:06 PM Subject: Re: Questions about diacritics In LaTeX2e with the Cork coding (for TeXnicians: \usepackage[T1]{fontenc}) there is a so-called compound word mark. It has the functions of teh ZERO WIDTH NON JOINER in the UCS: It breaks ligatures, it can be used to produce a final s in the middle of a word. By design, it has zero width but x height. So it can be used to carry accents to be placed in the middle between two characters. My classic for this situation is the german -burg abbreviature often seen in cartography: It is -bg. with breve between b and g. The abbreviature -bg. without accent means -berg. --Jorg Knappen
Re: Questions about diacritics
Since INVISIBLE LETTER is spacing, wouldn't it make more sense to define Isn't rather INVISIBLE LETTER *non-spacing* (zero-width minimum), even though it is *not combining* ? I mean here that its width would be zero unless a visible diacritic expands it. It is then distinct from other whitespaces which have a non-zero minimum width, but still expand too with a diacritic above them (width expansion is normally part of the job for the renderer or positioning/ligating tables of characters in fonts). I would expect that an INVISIBLE LETTER not followed by any diacritic will *really* be invisible, and will not alter the positioning of subsequent base characters (and would not even prevent their kerning into the previous base letter such as in CAPITAL LETTER V, INVISIBLE LETTER, CAPITAL LETTER A, where A can still kern onto the baseline below V.
Historic scripts for Albanian: Elsaban and Beitha Kukju
This page: http://www.omniglot.com/writing/albanian.htm shows two historic scripts that have been used to write Albanian (Shqip): - the Elsaban script in the 18th century, which looks like Old Greek for the language Tosk variant. However there are lots of unique letter forms, and mapping to Old Greek is not straightforward. - the Beitha Kukju script invented in 1840 and named after its inventor. This second one looks very like a modified version of the Latin script (the scans reproduce handwriting), but with major changes in the letterforms and some unique letters for 'j, d-with-stroke, th, kj, ng, ks, tsj, and ts. It is quite hard to read for Latin readers, and some forms may cause confusion for Latin readers (notably the letters for e, d, d-with-stroke, h, y and ü; so I think it's a distinct script rather than a variant of the Latin script. Are these alphabets represented in Unicode? The page also gives the modern Latin alphabet (including Latin digraphs), based on Western European Latin letters.
Re: Questions about diacritics
From: Doug Ewell [EMAIL PROTECTED] In the case of INVISIBLE LETTER, it seems likely -- based on the comments of experts -- that the benefits outweigh the disadvantages. But new control characters (and quasi-controls like IL) have tended to cause more problems and confusion for Unicode in the past than new graphically visible characters. The possibility of misuse has to be evaluated, and the rules do have to be stated clearly. Combinations involving IL plus SPACING ACCENT, or IL plus ZW(N)J, or whatever, should be part of the rules; what effect should such combinations have, and are they discouraged? For IL, that is probably good enough. The most important misuse of IL could be avoided by saying in the standard that a renderer should make this character visible if it is not followed by a combining character that it expects. This would avoid possible spoofing by including it within some critical texts such as people and company names in signatures. A candidate rendering would be the dotted circle and square as seen in the proposal, or a dotted square with IL letters inside. This glyph would appear even if visible controls editing mode is not enabled.
Re: Unibook 4.0.1 available
From: Doug Ewell [EMAIL PROTECTED] Marion Gunn mgunn at egt dot ie wrote: Is it really so hard to make multi-platform, open-office-type utilities? Actually, yes, it is. Mac users don't want an application to be too Windows-like, Windows users don't want an application to be too Mac-like (we'll see how the latest version of Photoshop goes over), and isolating all the differences in platform-specific modules while leaving the core functionality in common modules is a lot of work. If it were easy, it would be done more often. Isn't Java hiding most of these platform details, by providing unified support for platform-specific look and feel? Aren't there now many PLAF and themes manager available with automatic default selection of the look and feel of each platform? Aren't there enough system properties in these development tools so that the application can simply consult these properties to autoadapt to the platform differences? Some known issues were related to filesystem differences, but even on MacOS X, Linux or Windows, these systems have to manage multiple filesystems simulaneously, so a good application made only for one platform needs to consult filesystem properties to get naming conventions, etc... On Linux only, and now also on Solaris and AiX, the need to support multiple window managers also influences any single-platform development. Also softwares have to adapt to various versions and localizations of the OS kernel and core libraries to get a wider compatible audience. Whatever we do today, we nearly always need to separate the core modules of the application from its system integration layer, using various wrappers. Not doing that will greatly limit the compatibility of the application, and even customers don't know the exact details of how to setup the application to work in his environment. It's certainly not easy, and there are tons of options, but writing a system wrapper once avoids many customer support costs later when a customer is furious of having paid for a product that does not work on his host. We are speaking here about software development, not about ad-hoc services for deployment on a unified platform (but even today, the cost of licences and upgrades makes that nearly nobody has a standard platform to deploy an application).
Re: Unicode Shorthand?
From: Chris Jacobs [EMAIL PROTECTED] - Original Message - From: Christopher Fynn [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Sunday, September 19, 2004 12:08 AM Subject: Unicode Shorthand? Is there any plan to include sets of shorthand (Pitman, Gregg etc.) symbols in Unicode? Or are they something which is specifically excluded? Pitman and Gregg are common in English-speaking countries, but most of these shorthand methods work well only with a particular language and are specific to it. Note that shorthand transcription is still not dead today, because of the natural speed of writing with it (more than 120 words/minute, instead of roughly 60 words/minute with stenotype or dactylography), and also because the quality of transcriptions from magnetic tapes or audio records is still highly discutable, notably when the audio environment is noisy (for juridical applications, it can be a big problem when one answer by a witness can't be understood clearly from the tape record). One solution used today (because stenographs become rare and old) is that the stenotypist or dactylograph that transcript a conversation must be present when the tape is created. I don't know if it is excluded. A reason to exclude it would be if it were a cipher of something already in. The only set of shorthand I know something of, dutch Groote, follows the pronounciation of the words rather than the spelling. Can shorthand be seen as a cipher of IPA ? Not at all. Most shorthand do not reflect the same level of precision found in IPA, and the same sign represent several phonemes. See for example the wellknown French stenographie Prévost-Delaunay method, with a small online presentation and initiation on http://perso.wanadoo.fr/lepetitstenographe/index.html In this method, most signs have multiple meanings, and there are abbreviations for phonemic elements commonly found at end of words, plus specialized signs for common semantics or words that are specific to the French language. It's not impossible to create a rendering system for such stenographic system, however the general layout is more complex than with traditional alphabets, because the layout of characters is highly dependant of the context of previous letters, and the system includes glyphic differences for initial, medial and final forms, and special joining rules that alter the glyph form, just to ease its fast transcription without holding up the drawing pen.
Re: Unicode Shorthand?
From: D. Starner [EMAIL PROTECTED] Christopher Fynn wrote: Is there any plan to include sets of shorthand (Pitman, Gregg etc.) symbols in Unicode? Or are they something which is specifically excluded? They're a form of handwriting, which is generally excluded. Why do they need to be encoded in a computer? General practice, at least, is to transcribe them into standard writing first. Don't forget that shorthand methods are still taught today, with methods published in books. Books are published today using special encodings or using image scans. Scanned images are often hard to create cleanly, and this is often a problem for the first readers of such publications, when the system requires carefully drawn signs, that would benefit from numeric composition. There are good reasons why a shorthand-written text should be encoded as such, without going through transcription to the normal alphabetic system.
Re: Unicode Shorthand?
From: Christopher Fynn [EMAIL PROTECTED] Philippe Verdy wrote: It's not impossible to create a rendering system for such stenographic system, however the general layout is more complex than with traditional alphabets, because the layout of characters is highly dependant of the context of previous letters, and the system includes glyphic differences for initial, medial and final forms, and special joining rules that alter the glyph form, Sounds a bit like Arabic... Not really, because the actual rendering is bidimensionnal, not linear. It's difficult to predict the line height, as the baseline changes according to the context of previous characters in the word, and its writing direction (forward or backward).
Re: Unicode Shorthand?
From: Christopher Fynn [EMAIL PROTECTED] Philippe Verdy wrote: Not really, because the actual rendering is bidimensionnal, not linear. It's difficult to predict the line height, as the baseline changes according to the context of previous characters in the word, and its writing direction (forward or backward). Phillipe As Werner mentioned, this is like Nastaleeq. All the things you mention are rendering issues - not character encoding issues - and not very different from things necessary to render some other complex scripts. As long all these changes are based on contextual rules they can be handled with a fairly simple encoding once the essential characters that make up the script are determined. regards I do agree that this shorthand method looks very much like Arabic, but my answer was really about making a difference with IPA. This is clearly not a pure phonetic notation, it has its own orthographic conventions, as well as very unique rendering rules, which make systems capable of rendering Arabic not enough to render shorthands. Your precision about Nastaleeq is correct, as this is the nearest script with which the standard French shorthand script looks like. But even with Nastaleeq there's a clear concept of a baseline that helps rendering it in an acceptable way. Rendering French shorthand on a constant baseline would be inacceptable: There's a baseline defined only for the begining of words, not for each individual character, and this left-to-right baseline is visible for the whole text, but only because words are very often abbreviated (there are also specific symbols for common abbreviations, and often articles or particles are not written, but if needed some functional suffixes are added). My mother learned that script in the early 60's and used that throughout her carrier for her work as a secretary in a juridic domain. In many cases, most words are abbreviated by noting only the first 1 or 2 lexemes, and adding eventually a functional suffix. Not all words need to be noted, and she also used some personal abbreviated symbols for her most recurrent terms. The script is really compact: a single A5 sheet of handdrawn shorthand was enough to note more than 2 A4 page of typesetted text (in Times our Courrier 12 points). She was able to note more than 120 words per minute, to note conversations with several participants such as public meetings, discussions about juridic problems, negociations... She still used a magnetic tape, for the case where she would have forgotten to note some terms or if there were cases where she could not remember the exact meaning of some items, but she rarely needed to use it to transcript the noted text back to a typesetted form (using the shorthand notes was even more practical than using dictaphones when typesetting it in a word processor later, as she had an immediate global view of the sentences to type). I have always been impressed by her ability of noting so many things in a so compact form and so fast.
Re: [OT] Decode Unicode!
From: Curtis Clark [EMAIL PROTECTED] on 2004-09-24 10:05 Peter Constable did quote: After the DNA, the ASCII-Code is the most successful code on this planet. Things get more and more complex. DNA is a 2-bit code. Not completely true. It is a bit less than 2 bits, due to its replication chains, and the presence of insertion points where cross-overs are possible. But the effective code is a bit more complex than just the ATCG system, as some studies have demonstrated that the DNA alone has no function out of its substrate, whose nature influence its decoding. There are some extra pieces of information that are not coded directly in the DNA, and the DNA itself has a 3D structure which cannot be modeled completely with just this alphabet (try computing the position of sulfurs and oxidations only from this chain!). Research on DNA solves this problem by isolating active subchains of the DNA whose behavior does not depend significantly on the substrate. The DNA is splitted by locus points where variation can occur. And not all of the DNA is actively coding useful information; large fragments are simply there to consolidate its structure, or to recover from replication damages. In fact you can determine much more things from ARN fragments than from ADN itself. Simply because ARN is not only the replication of ADN, but also the result of its structuration in the substrate, with which it will help synthetize proteinic chains. Other information are also contained in the mediators that help transform the ARN information into proteins. Some of these mediators are sometimes external to the cell, or may come from parasitic agents (bacteries, virus), or live in synbiotic condition with the cell that need this pollution to live itself. Suppress those parasitic or synbiotic agents and the DNA alone will not allow the cell to survive...
Re: UTF-8 stress test file?
From: Terje Bless [EMAIL PROTECTED] -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Theodore H. Smith [EMAIL PROTECTED] wrote: I'd like to see a UTF-8 stress test file. The top result on Google for the query UTF-8 Stress Test is http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt. This test file is out of date and incorrect: it uses Unicode incorrectly, where it should relate to the old RFC definition of UTF-8 referenced by previous versions of ISO/IEC 10646: in that file, all UTF-8 sequences with 5 bytes or more are invalid (they are not boundary cases). So the list of impossible bytes is longer than documented there. The more exact definition of UTF-8, shared now by Unicode and by the current version of ISO/IEC 10646 is documented in the conformance section of the Unicode standard. Still, this file will be useful to determine if your browser or editor effectively shows substitutes (like ?) where it should for all invalid sequences. But if your browser just says that this is not a UTF-8 encoded file, it will be right, if it does not display it at all: - the file mixes UTF-8 and UTF-16 - invalid sequences may raise an exception that informs the user that the file can't be decoded. - a browser or text editor may as well attempt to trigger its charset-autodetection mechanism to try finding another charset. If the file is then displayed assuming ISO-8859-1 and showing each byte of UTF-8 or UTF-16 sequences as if they were ISO-8859-1 characters, it will not be a conformance problem for the browser or text editor.
Re: UTF-8 stress test file?
From: Doug Ewell [EMAIL PROTECTED] Theodore H. Smith delete at elfdata dot com wrote: - the file mixes UTF-8 and UTF-16 Does this file mix UTF-8 and UTF-16? I thought it just had surrogates encoded into UTF-8? Of course a surrogate should never exist in UTF-8. You are right. Philippe's statement was incorrect, and also puzzling. Have you read the file content? It clearly and explicitly speaks about UTF-16, which has nothing to do in a text file for UTF-8, unless the file was used as a test for CESU-8 (which is not UTF-16 as well, and not even UTF-8). My statement was correct: it is based on the fact that the test file was created for the older (RFC version) of UTF-8 used in old versions of ISO 10646, and never referenced (at least explicitly until the v4.01 clarification) by Unicode in any version.
Re: UTF-8 stress test file?
From: Clark Cox [EMAIL PROTECTED] unless the file was used as a test for CESU-8 The whole point of the CESU-8-like section is that it is not legal UTF-8. Except that the document does not even cite CESU-8 but only UTF-16! The text itself is puzzling as well as nearly all its suggestions about conformance levels or the way the text should be rendered, or the way a parser should recover after encoding violations...
Re: UTF-8 stress test file?
From: Philipp Reichmuth [EMAIL PROTECTED] Don't you think you are stretching things a bit? This is an UTF-8 parser stress test file. If an application opens it in a different encoding, well, of course the results will be different, and things will not look UTF-8-ish. Again, this is a non-issue. It's like distributing a Linux binary for testing something and then getting complaints that it doesn't work under DOS and that it shouldn't make assumptions on operating systems. That's not the good point I wanted to focus. Things CANNOT look UTF-8-ish in a UTF-8 conforming editor or browser that will correctly detect all encoding errors in that file, and thus will never properly present the text properly aligned. What a conforming editor or browser *may* eventually do is to recover and mandatorily signal to the user the positions of errors (possibly by using a replacement glyph as if each error was coding a U+FFFD substitute), but how many errors will you signal given that the error recovery level is not defined in the Unicode/ISO/IEC UTF-8 standard? Even in the old ISO/IEC10646 standard, recovery is only possible after errors only if uninterpretable byte sequences were still properly parsed into sub-sequences (of unspecified length) where a substiture could be used. The problem is in the length of each invalid byte sequence; for example, if there's a 4-bytes old UTF-8 encoding sequence (or longer) the error will be detected at the first byte, recovery will take place at the second byte after the first byte as been interpreted as a invalid sequence represented by a substitute glyph, but then each of the immediately following trailing byte will signal an error. Suppose that the parser recovers until it can find a new starter byte, it will still need to parse this byte to see if its a leading byte for a longer sequence, so the recovery is not necessarily immediately possible after the first invalid byte, or after the supposed end of the byte sequence. Now if the parser will reover by skipping all bytes until a valid sequence is found, there will be only 1 encoding error thrown on the leading byte, and only 1 substitution glyph. We are navigating within unspecified areas where error recovery after decoding errors is not defined in the current UTF-8 standard itself (not even in the old RFC version with ISO/IEC 10646-1:2000) And as I said, the document itself is not complete enough, because it forgets other invalid sequences for non-characters.
Re: internationalization assumption
From: Antoine Leca [EMAIL PROTECTED] On Tuesday, September 28th, 2004 03:22 Tom wrote: Let's say. The test engineer ensures the functionality and validates the input and output on major Latin 1 languages, such as German, French, Spanish, Italian, Just a side point: French cannot be fully addressed with Latin 1. True, due to the missing (but rare) oe or OE ligature (which is present in the newer Latin 9, as well as in the Windows ANSI codepage 1252 for western European languages). Anyway, no French users actually complain of this omission: either they use ISO-8859-1 and the ligatures will simply be replaced by separate vowels (which is still correct for French collation, even though the strict French orthograph requires using a ligature when *rendering*; in addition, French keyboards typically never include a key to enter these ligatures, which are only entered with assisted word processors with on-the-fly autocorrection), or they will use the Windows 1252 codepage without seeing that these characters were added to Latin 1 by Microsoft in its Windows codepage. A few common sample words that use these ligatures are oeil (english: eye), oeuf (english: egg) and boeuf (english: beef), and coeur (english: heart). (Note that this message does not use the mandatory ligature). There are some other words, but they are really uncommon in French conversations (most of them are in the medical and botanic vocabulary). This ligature cannot be automated so simply in renderers, because there are exceptions: see coexister where the two vowels are clearly voiced separately and must never be ligated. But one way to determine if oe must be ligated in French is when it is followed by another vowel (normally an 'i' or 'u'), and if the e has no accent. The ae ligature is used in French, but not in the common language (I think it is used only in some technical juridic or religious terms, inherited from Latin, or in some medical and botanic jargon): I can't even remember of one French word that uses it; that's why there were some fonts designed for French where the oe and OE ligatures replaced the ae and AE ligatures. (Note that I say ligature and not vowel, because it is their actual usage in French, that also matches its collation rules). With those considerations, would a software that only supports the ISO-8859-1 character set be considered not ready for French usage? I think not, and even today most French texts are coded with this limited subset, without worrying about the absence of a rare ligature, whose absence is easily infered by readers.
Re: internationalization assumption
About the French ligatures 'oe' (and 'ae'), I should have noted this excellent summary page (in French) on its usage and history: http://fr.wikipedia.org/wiki/Ligature_(typographie) Note that Latin- or Greek-inherited words use the ligature when the vowels are not to be pronounced separately, but with the etymological 'o' not vocalized. So it remains only the final 'e' vowel, sometimes pronounced like 'é', or more recently and very commonly like the digraph 'eu'. The French page on Wikipedia is more complete than the corresponding English page; but the German page contains interesting information about ligatures in German or other central european languages.
Re: internationalisation assumption
I use my own keyboard with the standard AZERTY French layout with some extensions. I would not use the QWERTY-based Swedish layout. Can be downloaded for Windows on http://www.rodage.org/pub/French-Sahel.html (built with MSKLC, available for free under LGPL). Its layout is shown on http://www.rodage.org/pub/French-Sahel.pdf - Original Message - From: Stefan Persson [EMAIL PROTECTED] To: Unicode Mailing List [EMAIL PROTECTED] Sent: Thursday, September 30, 2004 5:05 PM Subject: Re: internationalisation assumption Philippe Verdy wrote: in addition, French keyboards typically never include a key to enter these ligatures, which are only entered with assisted word processors with on-the-fly autocorrection In that case, I'd recommend French people to use my Swedish Linux keyboard which *does* contain the ligature. For Swedish people, the ligature fills a much smaller usage; it's only used by Swedes who need to write documents in French, or use English transcriptions of ancient Greek names such as dipus. Stefan
Re: Grapheme clusters
From: Chris Harvey [EMAIL PROTECTED] The users seem determined to put the entire alphabet into the PUA, thus making a single character for ng, kw, ii etc. I would like to be able to present them with something that works and avoid this kind of catastrophe. A better alternative to PUAs, which would require specific fonts and no interopable solution would be to use controls that make explicit grapheme clusters: ZWJ notably, and make sure that the editor handles it effectively as a single cluster, including for backspace. Or, may be using existing combining modifier letters, even if they look like superscript in existing fonts (if you are ready to go to PUAs, you would need to develop a font for them), but as we don't know the whole extents of the alphabet, it's hard to determine which solution is best. I am assuming (I'm possibly wrong) that you'll need it to support some African languages, and if so, there are existing proposals to increase their support in Unicode with pending new Latin letters. Using PUAs could be an interim solution, before new characters are introduced, notably if you need combining modifier letters to act with the base letter as a single cluster. If you need that to support the Latin transliteration of Native North American languages that you support on your web site, as a convenient tool allowing a reverse transliteration to the native script (which has constraints on its syllabic structure), and a convenient way to fix the Latin orthography in order to create richer contents transliterated appropriately and automatically into the native script, may be you need really a specific editor that can check and enforce the Latin orthography. For example you cite the case of Pacific coast schwas, raised consonants and ejectives (like kw q), or Hawayian long vowels (with macrons, rarely supported in fonts) which are difficult to enter with existing keyboards and fonts. Using a more basic ASCII-based orthography seems like an input method for such languages, and an intermediate before the production of actual existing Unicode characters using the proper combining or modifier letters (in that case, Unicode itself is not the issue, and you may wonder how to create an input method editor which can show a simplified ASCII-only transliteration which can reliably be converted to the more exact orthography.
Re: internationalization assumption
RE: internationalization assumptionWell the main issue for internationalization of software is not the character sets with which it was tested. It is in fact trivial today to make an application compliant with Unicode text encoding. What is more complicate is to make sure that the text will be properly displayed. The main issues that cause most of the problems come in the following area: - dialogs and GUI interfaces need to be resized according to text lengths - a GUI may have been built with a limited set of fonts, all of them with the same line height for the same point size; if you have to display Thai characters, you'll need a larger line height for the same point size. - some scripts are not readable at small point sizes, notably Han sinograms or Arabic - the GUI layout should be preferably reversed for RTL languages. - you need to be aware BiDi algorithm and you'll have to manage the case of mixed directions each time you have to include portions of texts from a general LTR script within a RTL interface (for Hebrew or Arabic notably): ignoring that, your application will not insert the appropriate BiDi controls that are needed to properly order the rendered text, notably for mirrored characters such as parentheses. For some variable inclusions in a RTL resource string, you may need to insert some surrounding RLE/PDF pair so that the embedded Latin items will display correctly. - The GUI controls such as input boxes need should be properly aligned so that input will be performed from the correct side. - Tabular data may have to be presented with distinct alignments, notably if items are truncated in narrow but extensible columns (traditionally, tabular text items are aligned on the left and truncated on the right, but for Hebrew or Arabic, they should be aligned and truncated in the opposite direction) - You have to be aware of the variation of scripts that may be used even in a pure RTL interface: a user may need to enter sections of texts in another script, most often Latin. You have to wonder how these foreign text items will be handled. - In editable parts of the GUI, mouse selection will be more complex than what you think, notably with mixed RTL/LTR scripts. - You can't assume that all text will be readable with a fixed-width font. Some scripts require using variable-width letters. - You have to worry about grapheme clusters, notably in Hebrew, Arabic, and nearly all Indian scripts. This is more complex than what you think for Latin, Gree, Cyrillic, Han, Hiragana or Katakana texts. Even with the Latin script, you can't assume that all grapheme clusters will be made of only 1 character. For various reasons, common texts will be entered using combining characters, without the possibility to make precomposed clusters (this is specially true for modern Vietnamese that uses multiple diacritics on the same letter). - Text handling routines, that change the presentation of text (such as capitalisation) will not work properly or will not be reversible: even in the Latin script, there are some characters which are available with only 1 case. Titlecasing is another issue. Such automated presentation effects should be avoided, unless you are aware of the problem. - Plain-text searches often need to support indifferent case. This issue is closely related to collation order, which is sensitive to local linguistic conventions, and not only to the used script. For example, plain-text search in Hebrew will often need to support searches with or without vowel marks, which are combining characters, simply because they are optional in the language. When this is used to search and match identifiers such as usernames or filenames, various options will be exposed to you. In addition, there are lots of legacy text that are not coded with the most accurate Unicode character, simply because they are entered with more restricted input methods or keyboards, or were coded with more restricted legacy charsets (the 'oe' ligature in French is typical: it is absent from ISO-8859-1 and from standard French keyboards, although it is a mandatory character for the language; however it is present in Windows codepage 1252, and may be present in texts coded with it, because itwill be entered through assisted editors or word processors that can perform autocorrection of ligatures on the fly) - GUI keyboard accelerators may not be workable with some scripts: you can't assume that the displayed menu items will contain a matching ASCII letter, so you'll need some way to allow keyboard navigation of the interface. This issue is related to accessibility guidelines: you need to offer a way for users to see which keyboard accelerators they can use to navigate easily in your interface. Don't assume that accelerators for one language will be used as easily for another language. - toolbar buttons should avoid graphic icons with text elements, unless these items are also
Polytonic Greek pneuma letters (spirits) and half-eta glyphs
This page on the French version of wikipedia notes that Polytonic Greek used in the 3rd century B.C. alternate letters to denote the initial spirits (pneuma dasú for the hard spirit, and pneuma psílon for the soft spirit), rather than the modern 9-shaped combining accents. http://fr.wikipedia.org/wiki/Diacritiques_de_l%27alphabet_grec (Note: to see all letters in Internet Explorer, you have to configure it to use the Arial Unicode MS font from Office or the free Code2000 font, and to indicate to Internet Explorer, in the Accessibility options, to ignore the fonts styles selected on web pages: the default font selected in the Wikipedia CSS stylesheet for Internet Explorer forces the Arial font which does not contain glyphs for all these characters; apparently Wikipedia has problems to find a reliable way to configure their stylesheets to work with various versions of Windows or IE). These letters were noted initially by Aristophane with a variant of the historic H letter that noted the /h/ sound (but was later borrowed when it became unused to note the sound /è/ with eta), by cutting the H (eta) glyph in two half-parts (and sometimes found with L-shaped glyphs without the lower part of the vertical). These historic phonemes subsist today only as diacritics for modern polytonic greek, but this is not the case of historic texts where they may still be pronounced /h/ on initial vowels or diphtongues or rho. The same page gives an encoding for the latest non-combining form where these spirits are represented by upper tacks (before they became diacritics). My question is: can these historic half-eta letters be unified with these tacks, or are they distinct letters? Are there variants encoded for these historic half-eta letters, to mean that they should not be shown with the upper tack glyphs but with the historic half-eta glyphs?
Re: text-transform
From: fantasai [EMAIL PROTECTED] Comments on CSS (but not how-to questions) should be directed to the www-style mailing list at w3.org, not unicode: http://lists.w3.org/Archives/Public/www-style/ OK for the numeric versus capitalize|uppercase|lowercase remark, which is related to form validation and has probably nothing to do within the Unicode list. But the general discussion of the behavior of BiDi with vertical scripts, or horizontal scripts rendered vertically (or even in Boustrophedon) is still something that the Bidi algorithm in Unicode is not solving completely. There are bidirectional properties that are inherent to scripts and their characters, and that are in the direct focus of Unicode standardization. Although this was discussed in relation with CSS3, this is still a big issue of Unicode because it's not a problem specific to CSS as it directly affects any rendering of plain text. The CSS3 article was very interesting to read because it really speaks about problems that exist today with scripts already in Unicode, and for which the BiDi properties do not seem enough to effectively write a generic renderer for all of them (including the interation of Latin/Greek/Cyrillic with Han/Hiragana/Katakana, or the special interaction of Hiragana/Katakana in Han text.) I bet that if the proposed CSS3 model works, it will demonstrate which properties need to be added by Unicode in its standard, for use in other non-CSS based applications. May be this will require now BiDi controls and a more complex algorithm to handle them. For now, the only safe way to do that is to base the augmented properties on the script property of characters (but still with ambiguity problems for general purpose characters like punctuation and spacing).
Re: basic-hebrew RtL-space ?
From: kefas [EMAIL PROTECTED] Inserting unicode/basic-hebrew reults in a convinient RtL, right-to-left, advance of the cursor, but the space-character jumps to the far right. Is there a RtL-space? In MS-Word and OpenOffice I can only change whole paragraphs to RtL-entry. But quoting just a few words in hebrew WITHIN a paragraph would be helpful to many. And this is what the embedding controls are made for: - surround an RTL subtext (Hebrew, Arabic...) within LTR paragraphs (Latin...) with a RLE/PDF pair. - souround an LTR subtext (Latin, ...) within RTL paragraphs (Hebrew, ...) with a LRE/PDF pair. There's no need of a separate RTL space, given that the regular ASCII SPACE (U+0020) character is used within all RTL texts as the standard default word separator, and it inherits it has a weak directionality, that does not force a direction break, but that his inherited from the surrounding text. A good question however is whever the space should inherit its direction from the previous ctext or the next one. - If the previous text has a strong directionality, then the space should inherit its direction. This should be the case everytime you are entering text with a space at end: it's very disturbing to see this new space shift on the opposite side, when entering some space-sparated hebrew words within a Latin text, because the editor assumes that no more Hebrew will be added on the same line (this causes surprizing editing errors, for example when creating a translation resource file where translated resources are prefixed by an ASCII key, for example when editing a .po file for GNU programs using gettext()). - If the previous text in the same paragraph has no directionality, then it inherits its direction from the text after it (if it has a strong directionality); - if this does not work then a global context for the whole text should be used, or alternatively the directionality of the end of the previous paragraph (this influences where the cursor would go to align such weakly-directed paragraph with the previous paragraph, including the default start margin position.) The regular Bidi algorithm should be used to render a complete text, but strict Bidi rules should not be obeyed everytime when composing a text, where the current cursor position should act as a sentence break with a strong inherited directionality: the text can then be redirected at this position when the cursor moves to other parts of the text. I don't think this is an issue of renderers but of editors (notably in Notepad, where you won't know exactly where to enter a space during edition, unless you use the contextual menu that allows switching the global default directionality, and swap the alignment to the side margins; sometimes, when you want to know where there are REL/RLE and PDF Bidi controls, it's nearly impossible to determine it vizually in Notepad, unless you use an external tool such as native2ascii, from the Java SDK, to change the encoding with clearly visible marks). It's unfortunate, given that Notepad (since Windows XP) offers you a directly accessible contextual menu to enter Bidi controls and change the global direction and alignment to side margins. (But notepad has a visible controls editing mode, to solve such ambiguities.) Related: The other Hebrew characters in the alphabetic presentation forms insert themselves in LtR-fashion? Why this difference? I read about Logical and Visual entry, but don't see how that answers my 2 questions above. Visual entry should never be used. It was used for some legacy encodings to render text on devices that don't implement the Bidi algorithm and can only render text as LTR. Nobody enters RTL text in pseudo-visual LTR order; only the logical input order is needed. But don't mix the input order and the encoding order as they can be different (it should not if the text is converted and stored in Unicode, where only the logical order is legal for any mix of Latin, Greek, Cyrillic, and Hebrew, Arabic). The case for Thai is different because its input order is (historically) visual rather than logical, and then the text is encoded using the same (visual) order. This is not changed with Thai in Unicode, to keep its compatibility with the national Thai standard TIS-620 (and further revizions). So even though Thai uses an non-logical order, its input order and encoding order is the same. The difference of encoding orders is known mainly for historic texts created for modern Hebrew, and more rarely Arabic, or for texts encoded in a private pre-press encoding used to prepare the global layout of pages (these texts are more easily and fast processed in complex page layouts if they are prepared in visual order before flowing them in the page layout template; such applications use specific encodings in a richer rendering context than just plain text, so this is out of scope of the Unicode standard itself).
Re: Opinions on this Java URL?
From: A. Vine [EMAIL PROTECTED] I'm just curious about the \0 thing. What problems would having a \0 in UTF-8 present, that are not presented by having \0 in ASCII? I can't see any advantage there. Beats me, I wasn't there. None of the Java folks I know were there either. The problem is in the way strings that get passed to JNI via the legacy *UTF() APIs are accessed: there's no indicator of the string length, so it would be impossible to know if the \0 terminates the string if if is allowed in the content of the string data. The C080 encoding is a way to escape this character, so that it can be passed to JNI using the legacy *UTF() APIs that exist since Java 1.0. This encoding is also part of the Java class file format, where string constants are also encoded this way. Note that the Java String object allows storing ANY UTF-16 code unit, including invalid ones (0xFFFE and 0x), as well as isolated or unpaired surrogates. So Java internally does not use UTF-16 strictly. Using a plain UTF-8 representation would have prevented the class format to support such string instances, which are invalid for Unicode, but not in Java. Using CESU-8 would not work either. There are legacy Java applications that use the String object to store unrestricted arrays of unsigned 16-bit integers (Java native type char), without any association with the fact that it may represent valid characters, and it has the advantage that such representation allows fast loading of classes containing large constant pools (these classes won't perform a long class initialization code, like the one performed when initilizing an array of integer type, but will directly use the String constant pool which is decoded and loaded into chars directly by native CPU code in the JVM rather than with interpreted bytecode which will never be compiled; this may seem a bad programming practice, but the Java language specs allows this, and Sun will not remove such possibility without breaking compatibility with those programs). This modified UTF should then be regarded as a specific encoding scheme that supports the unrestricted encoding form used Java String instances (extended UTF-16, more exactly UCS-2) which, by initial design, can represent and store *more* than just valid Unicode strings. The newer JNI interface allows reading/returning String instance data directly in UCS-2 encoding form, without using the specific modified UTF encoding scheme: there's a API parameter field to pass the actual string length, so the interface is binary safe. Applications can then use it to pass any valid Unicode string, or even invalid ones (with invalid code units or unpaired surrogates) if they wish. There's no requirement that this data represent only true characters. Note that even Windows uses an unrestricted UCS-2 representation in its Unicode-enabled Win32 APIs. The newer UCS-2 interface is enough for JNI extensions to generate true UTF-8 if they wish. I don't see the interest of adding an additional support for true UTF-8 in JNI, given that this support is trivial to implement using either the null-terminated *UTF() JNI APIs or the UCS-2-based JNI APIs... In addition, this support is not really needed for performance (the UCS-2 interface is the fastest one for JNI, as it avoids the JNI extension to allocate internal work buffers to work with native OS APIs that can also use UCS-2 directly without using extra code-converters).
Re: U+0000 in C strings (was: Re: Opinions on this Java URL?)
- Original Message - From: John Cowan [EMAIL PROTECTED] To: Doug Ewell [EMAIL PROTECTED] Cc: Unicode Mailing List [EMAIL PROTECTED]; Philippe Verdy [EMAIL PROTECTED]; Peter Kirk [EMAIL PROTECTED] Sent: Monday, November 15, 2004 7:05 AM Subject: Re: U+ in C strings (was: Re: Opinions on this Java URL?) Doug Ewell scripsit: As soon as you can think of one, let me know. I can think of plenty of *binary* protocols that require zero bytes, but no *text* protocols. Most languages other than C define a string as a sequence of characters rather than a sequence of non-null characters. The repertoire of characters than can exist in strings usually has a lower bound, but its full magnitude is implementation-specific. In Java, exceptionally, the repertoire is defined by the standard rather than the implementation, and it includes U+. In any case, I can think of no language other than C which does not support strings containing U+ in most implementations. This is exactly the inclusion of U+ as a valid character in Java strings that requires that this character be preserved in the JNI interface and in String serializations. Some are thinking here that this is a broken behavior, but there's no other wimple way to represent this character when passing a Java String instance to and from a JNI interface, or though serialization such as in class files. My opinion is that the Java behavior does not define a new encoding, it is rather a transfer encoding syntax (TES), so that it can effectively serialize String instances (which are UCS-2 encoded using the 16-bit char Java datatype, and not only the UTF-16 restriction of UCS-2 which also requires paired surrogates, but does not make the '\u' and '\uFFFE' char or code unit illegal as they are simply mapped to U+ and U+FFFE code points, even if these code points are permanently assigned as non-characters in Unicode and ISO/IEC 10646). The internal working storage of Java Strings is not a character set (CCS or CES), and these strings are not necessarily bound to Unicode (even if Java provides lots of Unicode-based character properties, and character sets conversion libraries), as they can store as well other charsets, using other charset encoding/decoding libraries than those found in java.io.* and java.text.* packages. Once you admit that, Java String instances are just arrays of code units, not arrays of code points, their interpretation as encoded characters being left to other layers. Should there exist any successor to Unicode (or a preference in a Chinese implementation to handle String instances internally with GB18030), with different mappings from code units to code points and characters, the working model of Java String instances and char datatype would not be affected. This would still be conforming to Java specifications, if the standard java.text.* and java.io.* or java.nio.* packages that perform the various mappings between code units and code points, characters and byte streams are not modified: new alternate packages could be used, without changing the String object and the unsigned 16-bit integer char datatype. In Java 1.5, Sun chose to support supplementary characters without changing the char and String representations, but the Character object was extended to support the static representation of code points as static 32-bit int, and include the mapping from any Unicode code points in the 17 planes with char code units. The String class has then been extended to allow parsing char-encoded strings by int code points (so with the automatic support and detection of surrogate pairs), but the legacy interface was preserved. In ICU4J, the UCharacter object does not use a static representation but stores code points directly as int, unlike Character whose instances still only store a single 16-bit char, and offers only a static support for code points: there's still no Character(int codepoint) constructor, only a Character(char codeunit), because Character keeps its past serialization for compatibility, and Character is also bound to the 16-bit char datatype for object-boxing (automatic boxing only exists in Java 1.5, explicit boxing in previous and current versions is still supported). If Java needs some more extension, it's to include the ICU4J UCharacter class that would allow storing 32-bit int codepoints, or building a UCharacter from a char-coded surrogates pair of code units, or from a Character instance; and also to add a UString class using internally arrays of int-coded code units, with converters between String and UString. Such extension would not need any change in the JVM, just new supported packages. But even with all these extensions, the U+ Unicode character would remain valid and supported, and there would still remain the need to support it in JNI and in internal JVM serializations for String instances. I really don't like the idea of some people
Re: Opinions on this Java URL?
From: Christopher Fynn [EMAIL PROTECTED] Isn't it already deprecated? The URL that started this thread http://java.sun.com/j2se/1.5.0/docs/api/java/io/DataInput.html is marked as part of the Deprecated API Deprecated does not mean that it is not used. This interface remains accessible when working with internal class file format. I don't understand however why the storage format of the string constants pool was not changed when the class format was updated in Java 1.5 (Classes compiled for Java 1.5 won't run on previous versions of Java, due to the addition of new class interface elements like annotations, and generics, however classes that don't use these new features can still be compiled in Java 1.5 for compatibility with Java 1.4 and lower, and they will still run in Java 1.5; this means that Java 1.5 still needs to recognize the legacy class format that uses the modified UTF serialization of the String constants pool; as Java 1.4.1 also introduced the support for supplementary characters, it may have been useful that Sun changed at the same time its modified UTF encoding in classes to encode supplementary characters as 4 bytes if possible, when they are represented in the String instance as a valid surrogate pair, instead of 6 bytes today with the separate encoding of surrogates, to optimize the size of the String constants pool containing them; I don't know if this has been done in the new compact distribution format that replaces the legacy Zipped JAR format).
Re: Eudora 6.2 has been released
From: Peter Kirk [EMAIL PROTECTED] On the contrary, it is your mobile sync software which is of no use if communication with the outside world is required, if it doesn't support standards-conformant mail clients like Thunderbird, but only communicates in non-standardised ways with the products of a single company. Note that some PDAs come bundled with a synchronization software for PC that supports only Outlook (not even Outlook Express), and with a CDROM and licence to install Outlook (Toshiba PDAs for example, running on Windows CE). The synchronization uses Microsoft ActiveSync, which uses its native support for Outlook local folders only (does not work if you have something else than a standard POP3 account configured or a private Exchance account, because it won't synchronize other types of Outlook folders like HTTP on MSN or Hotmail... despite it's also a Microsoft product). And there's nothing for Mac users. Well you're free to buy other PDAs, or to buy and install other synchronization software for your PDA. Synchronization still lacks good standards, like instant messaging and chat, or management of personnal calendars or contact lists... May be there's something in Sun's Open-Office that can connect you to Windows CE PDAs or other types of PDAs?
Re: Unicode HTML, download
From: Edward H. Trager [EMAIL PROTECTED] Hi, Elaine, There is of course no limit to how many writing systems one can have on a Unicode-encoded HTML page. My recommendations would be to: (3) Use Cascading Style Sheet (CSS) classes to control display of fonts ... A better CSS class would additionally specify the font-family, for example, something like the SIL Ezra font (http://scripts.sil.org/cms/scripts/page.php?site_id=nrsiid=EzraSIL_Home) (4) Since your readers may not have certain fonts, In the case of legally downloadable fonts like SIL Ezra, I would definitely put a link to the download site so readers can download the (Hebrew) fonts if they need it to view your page. Probably a bad advice here: Elaine speaks about a technical glossary, which would probably be written in modern Hebrew, for which there's not much complication with traditional accents. So any suitable font for modern Hebrew (on Windows XP, the default fonts provided are suitable: Arial, Tahoma, Times New Roman, David, David Transparent, Myriam, Myriam Transparent. With Office installed: Arial Unicode MS) could be prefered by users, and configured in their browser. Why forcing them to use SIL Ezra in the CSS stylesheet? At least you should say to Elaine to use a font family with multiple font names, in order of preference, separated by commas, and surrounded by quotes if font names are not single identifiers: !DOCTYPE ... htmlhead meta http-equiv=Content-Type content=text/html; charset=UTF-8 / titlesome title/title style type=text/css!-- .he { font-family: SIL Ezra, Arial Unicode MS, David, Myriam, Tahoma, Arial, sans-serif; direction: rtl; } .r { text-align: right; right-margin: 2em; } //--/style /headbody p class=he r(some hebrew text goes here)/p /body/html (Note that, like in the above example, you can specify multiple class names separated by spaces in the class= attribute, so it's possible to create style rules for localized font families, that can be reused independantly with other style classes. This may be useful notably if the document displays multiple languages in a tabular format where many attributes in a column should be set nearly identically for the other column, differing only by the font families to use for each language/script.)
Re: Unicode HTML, download
From: E. Keown [EMAIL PROTECTED] Great idea! I code in the seldom-seen AHTML ('Archaic HTML'), as you all suspected. A friend tested a page I wrote last month and found it wouldn't work on any of his 5 browsersoh well. Well, Elaine, if you want maximum compatibility, you should better use XHTML which adds more restrictions than it adds features. It's the old HTML which causes most troubles across distinct browsers, due to its ambiguities, or differences of implementations (frames, table formats with non-zero cell spacing and cell padding, backgrounds, column widths in percentages specified by HTML width=x% attributes instead of by CSS style=width: x% attributes or stylesheet rules...). So: (1) enforce the XML rules: close all tags (notably p.../p or li.../li paragraphs, and br / or img ... / and meta ... / empty elements), make them all properly nested. (2) use only the standard subset of HTML elements and attributes. And make sure you don't include HTML block elements within HTML inline elements (for example font elements surrounding p paragraphs...) (3) use simple CSS stylesheets, with only one rule per element or class. And don't overuse some advanced CSS2 or CSS3 style features. Keep some tolerance for table column widths (make sure that font sizes can be reduced or increased for accessibility). (4) test your pages in IE-6, FireFox-1.0 (excellent!), Netscape-4 (old...), and on Mac Safari if you can: it should be enough to work well with most other browsers (Netscape 6+ should behave mostly like IE-6 and FireFox on Windows, as long as you don't need JavaScript).
Re: [even more increasingly OT-- into Sunday morning] Re: Unicode HTML, download
From: Christopher Fynn [EMAIL PROTECTED] I'd also like to figure out a way to trigger this kind of behavior in other browsers as well as in IE (using Java Script or Java rather than VB) as not quite everyone uses IE - (but I guess you are not going to give me any more clues on how to do that :-) ) If only there was a portable way to determine in JavaScript that a string can be rendered with the existing fonts, or to enumerate the installed fonts and get some of their properties... we could prompt the user to install some fonts or change their browser settings, or we could autoadapt the CSS style rules, notably the list of fonts inserted in the font-family: or abbreviated font: CSS properties... There are limited controls with the CSS @ keys that allow building virtual font names, but not enough to tune the font selections by script or by code point ranges. And Javascript is of little help to paliate. Certainly there's a need to include in a refined standard DOM for styles the properties needed to manage prefered font stacks associated to a virtual font name (for example, in a way similar to what Java2D v1.5 allows), that can then be referenced directly within legacy HTML font name=virtualname or in CSS font-family: virtualname properties (some examples of virtual font names are standardized in HTML: serif, sans-serif, monospace; Java2D or AWT adds dialog and dialoginput; but other virtual names could be defined as well like decorated or handscript or ocr). The key issue here is to create documents that refer to font families according to their usage rather than their exact appearance and the limited set of languages and scripts they support. Another possibility would be to create a portable but easily tunable font format (XML based? so that they can be created or tuned by scripting through DOM?) which would be a list of references to various external but actual fonts or glyph collections, and parameters to allows selecting in them with various priorities. For now this is not implemented in font technologies (OpenType, Graphite, ...) but within vendor-specific renderer APIs (than contain some rules to create such font mappings).
Re: Unicode HTML, download
From: Doug Ewell [EMAIL PROTECTED] The best advice for Elaine's situation becomes simpler. To maximize the likelihood that readers will see the right glyphs, add a font-family style line that lists a variety of available fonts, in decreasing order of coverage and attractiveness. My bad advice comes effectively from the confusion about two SIL related fonts: one with legacy encoding (handled in browsers as if it was ISO-8859-1 encoded, so that you need to insert text in the HTML page using only the code points in the Latin-1 page starting at U+, even though they do not represent the correct Unicode characters), and the other coded with Unicode (for which you need to encode your text with Habrew code points...). But your advice, Doug, still won't work when multiple fonts in the font-family style use distinct encodings: Mixing SIL Ezra with Arial, or similar Unicode encoded fonts will never produce the intended fallbacks if users don't have SIL Ezra effectively installed and selectable in their browser environment. Legacy encoded fonts only contain a codepage/charset identifier (most often ISO-8859-1) and no character to glyph translation table; also don't work properly with browsers configured for accessibility, where only the user-defined prefered fonts are allowed, and fonts specified in HTML pages must be ignored by the browser, user styles having been set to higher priority (even if one uses the important (!) CSS style rule markers), unless the default font mapping associated with the codepage/charset identifier effectively corresponds to what would be found in a regular char-to-glyph mapping table present in that font.
Re: Unicode HTML, download
From: Doug Ewell [EMAIL PROTECTED] Cryptically naming these two CSS classes .he and .heb, which provides no indication of which is the Unicode encoding and which is the Latin-1 hack, merely makes a bad suggestion worse. It was not cryptocraphic: he was meant for Hebrew (generic, properly Unicode encoded, suitable for any modern Hebrew), and heb for Biblic Hebrew where a legacy encoding may still be needed, in absence of workable Unicode support for now: this won't be the same language however, so a change of encoding may be justified. I was not advocating for mixing encodings within the same text for the same language... But I was nearly sure that a technical jargon in Hebrew would probably not need Biblic Hebrew, except for illustration purpose within small delimited block quotes or spans, where there will be simultaneously changes of: - language level - needed character set, some characters not being encodable with Unicode - a needed changed encoding (from Unicode to Latin-1 override hack) - specific font to render the legacy encoding. In that case, it is acceptable to have the general text in modern Hebrew properly coded with Unicode, even if the small illustrative quotes remain fully in a non standard mapping, and won't appear correctly without the necessary font. Note that PDF files DO mix encodings within the embedded fonts that PDF writers dynamically create for only the necessary glyphs. These encodings are specific to the document, for each embedded font... This is why PDF files can encode text that still don't have Unicode character mappings. You can see that when you attempt to copy/paste text fragments from PDF files in sections using embedded fonts; the pasted text will not reproduce the same characters as what you can see in the PDF reader; copy/pasting however works for PDF files using external fonts with standard mappings.
Re: [increasingly OT--but it's Saturday night] Re: Unicode HTML, download
From: E. Keown [EMAIL PROTECTED] Dear Doug Ewell, fantasai and List: I will try to sort out these diverse pieces of advice. What's the point, really, of going far beyond, even beyond CSS, into XHTML, where few computational Hebraists have gone before? You're right Helen, the web is full of non XHTML conforming documents. You probably don't need full XHTML conformance too, but having your document respect the XML nesting and closure of elements is certainly a must today, because it avoids most interoperability problems in browsers. So: make sure all your HTML elements and attributes are lowercase, and close ALL elements (even empty elements that should be closed by / instead of just , for example br / instead of br, and even li.../li, or p.../p). And then don't embed structural block elements (like p.../p or div../div or blockquote.../blockquote or li.../li or table.../table) within inline elements (like b.../b or font.../font or a href=../a or span.../span) Note that most inline elements are related to style, and they better fit outside of the body by assigning style classes to the structural elements (most of them are block elements). XHTML has deprecated most inline style elements, in favor of external specification of style through the class property added to structural block elements. XHTML has an excellent interoperability with a wider range of browsers, including old ones, except for the effective rendering of some CSS styles. The cost to convert an HTML file to full XML well-formedness is minor for you, but this allows you to use XML editors to make sure the document is properly nested, a pre-condition that will greatly help its interoperable interpretation. If you have FrontPage XP or 2003, you can use its apply XML formatting rules option to make this job nearly automatically, and make sure that all elements are properly nested and closed.
Re: Ezra
From: Edward H. Trager [EMAIL PROTECTED] Are you saying the difference in names is SIL Ezra vs. Ezra SIL ? That's too confusing! You're not alone to be confused. I had completely forgotten the existence of two versions of the same font design. I may have just seen that it used PUAs, so I did not install it (I did not remember that it used PUAs, and the wording of the sentence that introduced it in this discussion made me think that it was NOT using Unicode, and thus not PUAs which are Unicode things; that's where I supposed it was using some legacy Latin-1 override or similar hacks found in some special-purpose fonts, or in legacy non-TrueType-based font formats, like PostScript mappings within a 0-based indexed vector or hashed dictionnary of glyph names...)
Re: My Querry
From: Antoine Leca [EMAIL PROTECTED] I do not know what does mean fully compatible in such a context. For example, ASCII as designed allowed (please note I did not write was designed to allow) the use of the 8th bit as parity bit when transmitted as octet on a telecommunication line; I doubt such use is compatible with UTF-8. The parity bit is not data; it's a framing bit used for transport/link purpose only. ASCII is 7 bit only, so even if a parity bit is added (parity bit can be added as well to 8-bit quantities...), it won't be part of the effective data, because once the transport unit is received and checked, it has to be cleared (so an '@' character will effectively be equal to 64 in ASCII, not to 192 if a even parity bit is added.) By saying UTF-8 is fully compatible with ASCII, it says that any ASCII-only encoded file needs no reencoding of its bytes to make it UTF-8. Note that this is only true for the US version of ASCII (well, ASCII is normally designating only the last standard US variant of ISO 646, other standard national variants or proprietary variants of ISO 646 should not be named ASCII, but more accurately, for example, ISO 646-FR:1989, or without the ISO prefix if this is a proprietary charset and not an approved charset published in the ISO 646 standard).
Re: Shift-JIS conversion.
You just need a mapping table from Unicode codepoints to Shift-JIS code positions, and a very simple code point parser to translate UTF-8 into Unicode code points. You'll find a mapping table in the Unicode UCD, on its FTP server. The UTF-8 form is fully documented in the Conformance section of the Unicode standard and requires no table to convert UTF-8 to 21-bit Unicode codepoints. There are existing tools that perform that for you, because they integrate both: - Java (international edition) has a Shift-JIS mapping to Unicode which is reversible. It is used with the Charset support in java.io.* and java.nio.* packages and classes. You can even use the prebuilt tool native2ascii (from the Java SDK) to do that: native2ascii -encoding UTF-8 filename.UTF-8.txt | native2ascii -reverse -encoding SHIFT-JIS filename.SHIFT-JIS.txt - GNU recode on Linux/Unix may do that for you too. - the Open-Sourced ICUoffered byIBM has an API and support mappings for lots of charsets. - Original Message - From: pragati To: [EMAIL PROTECTED] Sent: Thursday, November 25, 2004 6:00 AM Subject: Shift-JIS conversion. Hello, Can anyone please tell me how to convert from UTF-8to shift-JIS? Please let me know if there is any formula to do it other than using readymade functions as provided by pearl. Because these functions do not provide mapping for all characters. Warm Regards,Pragati Desai. Cybage Software Private Ltd.ph(0)- 020-4044700Extn: 302mailto: [EMAIL PROTECTED]
Re: Misuse of 8th bit [Was: My Querry]
From: Antoine Leca [EMAIL PROTECTED] On Wednesday, November 24th, 2004 22:16Z Asmus Freytag va escriure: I'm not seeing a lot in this thread that adds to the store of knowledge on this issue, but I see a number of statements that are easily misconstrued or misapplied, including the thoroughly discredited practice of storing information in the high bit, when piping seven-bit data through eight-bit pathways. The problem with that approach, of course, is that the assumption that there were never going to be 8-bit data in these same pipes proved fatally wrong. Since I was the person who did introduce this theme into the thread, I feel there is an important point that should be highlighted here. The widely discredited practice of storing information in the high bit is in fact like the Y2K problem, a bad consequence of past practices. Only difference is that we do not have a hard time limit to solve it. Whever an application chooses to use the 8th (or even 9th...) bit of a storage or memory or networking byte used also to store an ASCII-coded character, as a zero, or as a even or odd parity bit, of for other purpose is the choice of the application. It does not change the fact that this (these) extra bit(s) is not used to code the character itself. I see this usage as a data structure, that *contains* (I don't say *is*) a character code. This completely out of topic of the ASCII encoding itself which is only concerned by the codes assigned to characters, and only characters. In ASCII, or in all other ISO 646 charsets, code positions are ALL in the range 0 to 127. Nothing is defined outside of this range, exactly like Unicode does not define or mandate anything for code points larger than 0x10, should they be stored or handled in memory with 21-, 24-, 32-, or 64-bit code units, more or less packed according to architecture or network framing constraints. So the question of whever an application can or cannot use the extra bits is left to the application, and this has no influence on the standard charset encoding or on the encoding of Unicode itself. So a good question to ask is how to handle values of variables or instances, that are supposed to contain a character code, but whose internal storage can make values out of the designed range fit in the storage code unit. For me it is left to the application, but many applications will simply assume that such a datatype is made to accept a unique code per designated character. Using the extra storage bits for something else will break this legitimate assumption, and so applications must be prepared specially to handle this case, by filtering values before checking for character identity. Neither Unicode or US-ASCII or ISO 646 define what an application can do there. The code positions or code points they define are *unique* only in their *definition domain*. If you use larger domains for values, nothing defines in Unicode or ISO 646 or ASCII how to interpret the value: these standards will NOT assume that the low-order bits can safely be used to index equivalent classes, because these equivalence classes cannot be defined strictly within the definition domain of these standard. So I see no valid rationale behind requiring applications to clear the extra bits, or to leave the extra bits unaffected, or to force these applications to necessarily interpreting the low order bits as valid code points. We are out of the definition domain, so any larger domain is application-specific, and applications may as well use ASCII or Unicode within storage code units which add some offsets, or multiply the standard codes by a constant, or apply a reordering transformation (permutation) on them and other possible non-character values. When ASCII and ISO 646 in general define a charset with 128 unique code positions, they don't say how this information will be stored (an application may as well need to use 7 distinct bytes (or other structures...), not necessarily consecutive, to *represent* the unique codes that represent ASCII or ISO 646 characters), and they don't restrict the usage of these codes separately of any other independant information (such as parity bits, or anything else). Any storage structure that allows keeping the identity and equivalences of the original standard code in its definition domain is equally valid as a representation of the standard, but this structure is out of scope of the charset definition.
Re: Shift-JIS conversion.
- Original Message - From: Addison Phillips [wM] To: pragati ; [EMAIL PROTECTED] Sent: Thursday, November 25, 2004 6:21 PM Subject: RE: Shift-JIS conversion. Dear Pragati, You can write your own conversion, of course. The mapping tables of Unicode-SJIS are readily availably. You should note that there are several vendor specific variations in the mapping tables. Notably Microsoft code page 932, which is often called Shift-JIS, has more characters in its character set than standard Shift-JIS (and it maps a few characters differently too...) The important fact that you should be aware of: Shift-JIS is an encoding of the JIS X0208 character set. UTF-8 is an encoding of the Unicode character set. More exactly, UTF-8 is an encoding of the ISO/IEC 10646 character set (the character set here designates the set of characters, i.e. the repertoire that describes characters with a name and a representative glyph and some annotations, to which a numeric code is then assigned, the code point. The char. set is Unicode by itself is not a character set, only an implementation of the ISO/IEC 10646 character set, in which which the Unicode standard assign additional properties and behavior for characters allocated in ISO/IEC 10646. The link between Unicode and ISO/IEC 10646 is the assigned code point and character name, which are now common between the two standards. Of course the Unicode technical commitee may propose new assignments to ISO/IEC, but this is still ISO/IEC 10646 which maintains the repertoire and approves or rejects the proposals. A new character proposal may be rejected by Unicode, but accepted by ISO/IEC 10646; and it is the ISO/IEC 10646 vote that prevails (so Unicode will have to accept this ISO/IEC decision, even if it has voted against it in a prior decision). On the opposite, ISO/IEC 10646 says nothing about character properties or behaviors. It can suggest, but the Unicode committee will make its own decisions for the character properties and behavior that it chooses to standardize. If Unicode wants to make its decisions widely accepted by all users of the ISO/IEC 10646 repertoire, it's in the interest of Unicode of trying to make these decisions in conformance with other existing national or international standards, to maximize interoperability of national or international applications based on the ISO/IEC 10646 character set.
Re: Misuse of 8th bit [Was: My Querry]
From: Antoine Leca [EMAIL PROTECTED] On Thursday, November 25th, 2004 08:05Z Philippe Verdy va escriure: In ASCII, or in all other ISO 646 charsets, code positions are ALL in the range 0 to 127. Nothing is defined outside of this range, exactly like Unicode does not define or mandate anything for code points larger than 0x10, should they be stored or handled in memory with 21-, 24-, 32-, or 64-bit code units, more or less packed according to architecture or network framing constraints. So the question of whever an application can or cannot use the extra bits is left to the application, and this has no influence on the standard charset encoding or on the encoding of Unicode itself. What you seem to miss here is that given computers are nowadays based on 8-bit units, there have been a strong move in the '80s and the '90s to _reserve_ ALL the 8 bits of the octet for characters. And what was asking A. Freitag was precisely to avoid bringing different ideas about possibilities to encode other class of informations inside the 8th bit of a ASCII-based storage of a character. This is true for example in an API that just says that a char (or whatever datatype used in some convenient language) contains an ASCII code or Unicode code point, and expects that the datatype instance will be equal to the ASCII code or Unicode code point. In that case, the assumption of such API is that you can compare the char instance for equality instead of comparing only the effective code points, and this greately simplifies the programmation. So an API that says that a char will contain ASCII code positions should always assume that only the instance values 0 to 127 will be used; same thing if an API says that an int contains an Unicode code point. The problem lives only in the usage of the same datatype to store also something else (even if it's just a parity bit or bit forced to 1). As long as this is not documented with the API itself, it should not be used, to preserve the rational assumption about identities of chars and identies of codes. So for me, a protocol that adds a parity bit to the ASCII code of a character is doing that on purpose, and this should be isolated in that documented part of its API. If the protocol wants to snd this data to an API or interface that does not document this use, it should remove/clear the extra bit, to make sure that the character identity is preserved and interpreted correctly (I can't see how such a protocol implementation can expect that a '@' character coded as 192 will be correctly interpreted by the other simpler interface that expects that all '@' instances will be equal to 64...) In safe programming, any unused field in a storage unit should be given a mandatory default. As the simplest form that perserves the code identity in ASCII or code point identity in Unicode is the one that use 0 as this default, extra bits should be cleared. If not, anything can appear within the recipient of the character: - the recipient may interpret the value as something else than a character, behaving as if the characterdata was absent (so there will be data loss, in addition to unpected behavior). Bad practice, given that it is not documented in the recipient API or interface. - the recipient may interpret the value as another character, or may not recognize the expected character. It's not clearly a bad programming practice for recipients, because it is the simplest form of handling for them. However the recipient will not behave the way expected by the sender, and it is the sender's fault, not the recipient's fault. - the recipient may take additional unexpected actions in addition to the normal handling of the character without the extra bits. It would be a bad programming practive of recipients, if this specific behavior is not documented, so senders should not need to care about it. - the recipient may filter/ignore the value completely... resulting in data loss; this may be sometimes a good practice, but only if this recipient behavior is documented. - the recipient may filter/ignore the extra bits (for example by masking); for me it's a bad programming practice for recipients... - the recipient may substitute the incorrect value by another one (such as a SUB ASCII control or a U+FFFD Unicode substitute to mark the presence of an error, without changing the string length). - an exception may be raised (so the interface will fail) because the given value does belong to the expected ASCII code range or Unicode code point range (the safest practice for recipients, that are working under the design by contract model, is to check the domain value range of all its incoming data or parameters, to force the senders to obey the contract). Don't expect blindly that any interface capable of accepting ASCII codes in 8-bit code units will also accept transparently all values outside of the restricted ASCII code range, unless this behavior
Re: Relationship between Unicode and 10646 (was: Re: Shift-JIS conversion.)
From: Doug Ewell [EMAIL PROTECTED] My impression is that Unicode and ISO/IEC 10646 are two distinct standards, administered respectively by UTC and ISO/IEC JTC1/SC2/WG2, which have pledged to work together to keep the standards perfectly aligned and interoperable, because it would be destructive to both standards to do otherwise. I don't think of it at all as the slave and master relationship Philippe describes. Probably not with the assumptions that one can think about slave and master, but it's still true that there can only be one standard body for the character repertoire, and one formal process for additions of new characters, even if two standard bodies are *working* (I don't say *decide*) in cooperation. The alternative would have been that UTC and WG2 are allocated each some code space for making the allocations they want, but with the risk of duplicate assignments. I really prefer to see the system like the master and slave relationships, because it gets a simpler view for how characters can be assigned in the common repertoire. For example, Unicode has no more rights than national standardization bodies making involved at ISO/IEC WG2. All of them will make proposals, all of them will amend proposals, or suggest modifications, or will negociate to create a final specification for the informal drafts. All what I see in the Unicode standardization process is that it will finally approve a proposal, but Unicode cannot declare it standard until there's been a formal agreement at ISO/IEC WG2, which really rules the effective allocations in the common repertoire, even if most of the preparation work will have been heavily discussed within UTC, creating the finalized proposal and with Unicode partners or with ISO/IEC members. At the same time, ISO/IEC WG2 will also study the proposals made by other standardization bodies, including the specifications prepared by other ISO working groups, or by national standardization bodies. Unicode is not the only approved source of proposals and specifications for ISO/IEC WG2 (and I tend to think that Unicode best represent the interests of private companies, whilst national bodies are most often better represented by their permanent membership at ISO where they have full rights of voting or vetoing proposals, according to their national interests...) The Unicode standard itself agrees to obey to ISO/IEC 10646 allocations in the repertoire (character names, representative glyphs, code points, and code blocks), but in exchange, ISO/IEC has agreed with Unicode to not decide about character properties or behavior (which are defined either by Unicode, or by national standards based on the ISO/IEC 10646 coded repertoire, for example the P.R.Chinese GB18030 standard, or by other ISO standards like ISO 646 and ISO 8859). So, even if the UTC decides to veto a proposal submitted by Unicode members, nothing prevent the same members to find allies within national standard bodies, so that they submit the (modified) proposal to ISO/IEC 10646, instead of Unicode which refuses to transmit that proposal. I want to demonstrate some recent example: the UTC decided to vote against the allocation of a new invisible character, with the properties of a letter, a zero-width, and the same allowances of break opportunities as letters, considering that the existing NBSP was enough, despite it causes various complexities related to the normative properties of NBSP used as a base character for combining diacritics. This proposal (that was previously in informal discussion) has been rejected by UTC, but this leaves Indian and Israeli standards with complex problems for which Unicode proposes no easy solution. So nothing prevents India and Israel to reformulate the proposal at ISO/IEC WG2, which may then accept it, even if Unicode previously voted against it. If ISO/IEC WG2 accepts the proposal, Unicode will have no other choice than accepting it in the repertoire, and so giving to the new character some correct properties. Such proposal will be easily accepted by ISO/IEC WG2 if India and Israel demonstrate that the allocation allows making distinctions which are tricky or computationnally difficult or ambiguous to resolve when using NBSP. With a new distinct character, on the opposite, it can be demonstrated by ISO/IEC 10646 members to Unicode that defining its Unicode properties is not difficult, and simplifies the problem for correctly representing complex cases found in large text corpus. Unicode may think that this is a duplicate allocation, because there will exist cases where two encoding are possible, but without the same difficulties for implementations of applications like full-text search, collation, or determination of break opportunities, notably in the many cases where the current Unicode rules are already contradicting the normative behavior of existing national standards (like ISCII in India). My opinion is that the
Re: CGJ , RLM
From: Mark Davis [EMAIL PROTECTED] I want to correct some misperceptions about CGJ; it should not be used for ligatures. True. CGJ is a combining character that extends the grapheme cluster started before it, but it does not imply any linking with the next grapheme cluster starting at a base character. So, even if one encodes, A+CGJ+E, there will still be two distinct grapheme clusters A+CGJ and E, and the exact role of the trailing CGJ in the A+CGJ is probably just a pollution, given that this CGJ has no influence on the collation order, so that the sequence A+CGJ+E will collate like A+E, and it does not influence the rendering as well. A correct ligaturing would be A+ZWJ+E, with the effect of creating three default grapheme clusters, that can be rendered as a single ligature, or as separate A and E glyphs if the ZWJ is ignored. For example, a ligaturing opportunity can be encoded explicitly in the French word efficace: ef+ZWJ+f+ZWJ+icace. Note however that the ZWJ prohibits breaking, despite in French there's a possible hyphenation at the first occurence, where it is also a syllable break, but not for the second occurence that occurs in the middle of the second syllable. I don't know how one can encode an explicit ligaturing opportunity, while also encoding the possibility of an hyphenation (where the sequence above would be rendered as if the first ZWJ had been replaced by an hyphen followed a newline.) To encode the hyphenation opportunity, normally I would use the SHY format control (soft hyphen): ef+SHY+fi+SHY+ca+SHY+ce If I want to encode explicit ligatures for the ffi cluster, if it is not hyphenated, I need to add ZWJ: ef+ZWJ+SHY+f+ZWJ+i+SHY+ca+SHY+ce(1) The problem is whever ZWJ will have the expected role of enabling a ligature if it is inserted between a letter and a SHY, instead of the two ligated glyphs. In any case, the ligature should not be rendered if hyphenation does occur, else the SHY should be ignored. So two rendering are to be generated depending on the presence or absence of the conditional syllable break: - syllable break occurs, render as: ef-+NL+f+ZWJ+icace, i.e. with a ligature only for the fi pair, but not for the ff pair and not even for the generated f+hyphen... - syllable break does not occur, render as ef+ZWJ+f+ZWJ+icace, i.e. with the 3-letter ffi ligature... I am not sure if the string coded as (1) above has the expected behavior, including for collation where it should still collate like the unmarked word efficace...
Re: CGJ , RLM
Which statements? My message is mostly a read as a question, not as an affirmation... I also took the precaution of using terms like not sure if..., or i don't know if..., which mean that it's a problem for which I can't find easy solutions, i.e. the interaction of ligature opportunities and hyphenation (syllable break opportunities), and how a document can be prepared to allow both in renderers, without breaking the semantics and collation of words in the document (notably if one wants to preserve the full-text search capabilities for such prepared documents)... - Original Message - From: Mark Davis [EMAIL PROTECTED] To: Philippe Verdy [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Friday, November 26, 2004 9:09 PM Subject: Re: CGJ , RLM The statements below are incorrect, but I don't have the time to correct them all.
Re: CGJ , RLM
From: Doug Ewell [EMAIL PROTECTED] Perhaps a better question to ask would be why you need to indicate both hyphenation points and ligation points in text that is going to be collated. Because one would want to: - prepare documents for correct rendering (including both ligatures and hyphenation capabilities easily rendered in simple text browsers without using any lexical analysis) and use such prepared document as the prefered form for archiving, and then want to - have such prepared corpus still usable for full-text searches...
Re: CGJ , RLM
From: Doug Ewell [EMAIL PROTECTED] Philippe Verdy verdy underscore p at wanadoo dot fr wrote: If I want to encode explicit ligatures for the ffi cluster, if it is not hyphenated, I need to add ZWJ: ef+ZWJ+SHY+f+ZWJ+i+SHY+ca+SHY+ce(1) Great Scott! You can use ZWJ to suggest a ligation opportunity, and SHY to suggest a hyphenation opportunity, but if you need to suggest both within the same word, let alone *between the same pair of letters*, you have probably stepped over the plain-text line. If encoding ligation oportunity is not plain-text, why then have it in Unicode? If hyphenation opportunity is not plain-text, why then have it in Unicode? Both exist in Unicode, and I don't think that they are considered not plain-text. So why would you want to restrict their usage so that they will be used only separately? The ZWJ and SHY format controls for these two targets are added on purpose when preparing documents for later rendering. They shouldn't affect the collation of text and will not change their semantic, and this transformation of text cannot be fully automated without using complex lexical and linguistic knowledge. That's why they should be allowed in texts kept for archiving. If you want to use later those prepared texts on more simpler renderers and parsers, you can still ignore and filter out the ZWJ and SHY very easily, so this preparation work, performed most often by typists, is normally reversible. Nobody is required to use them, but if one wants to do it for better rendering of prepared documents, why would Unicode forbid it? Was my question really so stupid?
Re: No Invisible Character - NBSP at the start of a word
From: Jony Rosenne [EMAIL PROTECTED] One of the problems in this context is the phrase original meaning. What we have is a juxtaposition of two words, which is indicated by writing the letters of one with the vowels of the other. In many cases this does not cause much of a problem, because the vowels fit the letters, but sometimes they do not. Except for the most frequent cases, there normally is a note in the margin with the alternate letters - I hope everyone agrees that notes in the margin are not plain text. Are you making here a parallel with the annotations added on top or below ideographs in Asian texts, using the ruby notation (for example in HTML) which may also be represented in plain-text Unicode with the interlinear annotation? Are you arguing that interlinear annotations are not plain-text? If so why were they introduced in Unicode? The notations in questions are not merely presentation features, they have their own semantic which merit being treated as plain-text, because their structure also ressembles a linguistic grammar, not far from the other common annotations also found in Latin text with phrases between parentheses or em-dashes. Plain text is widely used since ever to embed several linguistic levels, which are also often represented too in the spoken language, by variation of tonality. The content of these annotations is also plain text. The graphic representation itself is not that important, it is just there to easily demonstrate the relations that exist between one level of the written language and the annotation language level. If a text appears to mix these levels, there's no reason not to represent it. These annotations are present in the text, there must be a way to represent them in its encoding, even if it implies encoding mixed words belonging to different interpretation levels (such as Qere and Ketiv texts in Biblic Hebrew). You are arguing against millenia of written language practices, just too much focused on the common Latin usage where many concessions to your intuitive model have already been integrated into Unicode (think about the various characters that have been added as symbols or special punctuations, or about other annotations added on top of Latin letters such as mathematical arrows... I see less problems with the correct representation of Ketiv and Qere annotations mixed within plain text, and rendered as supplementary letters on top or around the core Hebrew letters, than with the representation concessed to the Latin script for various usages (including technical annotations or punctuations, or formatting controls...)
Re: (base as a combing char)
From: Addison Phillips [wM] [EMAIL PROTECTED] For example, Dutch sometimes treats the sequence ij as a single letter (it turns out that there are characters for the letter 'ij' in Unicode too, but they are for compatibility with an ancient non-Unicode character set). Software must be modified or tailored to provide behavior consistent with the specific language and context. Not sure about that: not all Dutch ij letter pairs are a single grapheme, so there are cases where the two letters must be treated as distinct and not as a single letter. For this reason, Dutch will need a distinct ij letter, coded as a single character, and with its own capitalization rules (the uppercase or titlecase form of ij will be the single letter IJ, not two letters and not Ij; also there exists cases where diacritics can be added on top of the ij letter, which is then more tied as a single letter than a simple digraph.) This distinction is also often made visible in the typography (where the single letter ij digraph is shown with the leg of the j kerned deeply below (and sometimes to the left of) the leading i, unlike cases where they are treated as two letters where no kerning occurs (the 'i' is shown completely on the left of the bottom-left leg of 'j'), and it is even more evident in the uppercase style (where there will even be the standard small distance between I and J glyphs when they are two distinct letters, but where the uppercase I may be drawn in the middle of the left leg of J). Note the very near ressemblance of the ij signel letter with a y with a diaeresis (so you'll find also Dutch texts that use y with diaeresis instead of the correct ij letter, notably in texts coded with legacy charsets). This distinction is also preserved for uppercase, where the missing IJ single letter appears encoded with Y with diaeresis... These cases in Dutch where there's a distinction between the single letter digraph and two letters are rare, so it is often acceptable to encode the digraph with two letters, without creating linguistic ambiguities (in most cases...), or with y with diaeresis/umlaut (which otherwise is not a letter used in Dutch). For me, your allusion to legacy charsets is about the deprecating use of y with diaeresis, not about the use of a distinct IJ letter which is needed for Dutch and should be treated as distinct from the I then J letters pair.
Re: Relationship between Unicode and 10646
From: Peter Kirk [EMAIL PROTECTED] I don't want to go along with Philippe entirely on this, but surely he must be right on this last point. Formally, Unicode is effectively the agent of just one national body in this decision-making process. To be honest, Peter, I never said that Unicode was a national body, because I know that there are several non-US governments that are full members of Unicode and voting at the UTC, and because I know that the official representation of US in ISO is ANSI, not the Unicode Consortium. But it's true that the United States have delegated several times their official international representation to the Unicode Concertium, acting on behalf of the US government for some decisions or some limited domains (this is valid because Unicode is incorporated in US, a necessary condition to represent the US government in international organizations); this is a private contractual arrangement between Unicode and the official US representant, but this does not change the rights of Unicode at ISO. So the true representant of US in ISO (and also ITU) is certainly not Unicode, but ANSI, or any other US-incorporated organization that the US government chooses to represent it (other US private organizations are given a US mandate for the management of some public resources or standards, like IANA, ARIN, ICANN, and IEEE, despite these organizations have also integrated some international voting members)
Re: CGJ , RLM
I'm not the one that proposed encoding a AE ligature with A+ZWJ+E. I just spoke about cases like true typographical ligatures like ffi. I do know that AE or ae in French is better encoded with their distinct unique code, even if French consider this letter as two letters (which may justify the prefered encoding as A+ZWJ+E, for which no collation tailoring is needed). The current practice in most French texts is to use either the two separate vowels A+E or a+e, or to use the separate codepoint for the ae letter or the AE letter (and then use tailored collation to sort the ae single letter and AE single letter with a+e and A+E, i.e. between a+e and a+f). I've never seen any French text coded with A+ZWJ+E or a+ZWJ+e... Same remark for the French oe and OE ligatures (which, like the ae and AE ligatures, are orthographic, not typographical like ffi) that French also considers (i.e. collates, sorts) as two letters o+e or O+E. - Original Message - From: Asmus Freytag [EMAIL PROTECTED] PS: since we have a perfectly fine AE as a character, there seems little gained in attempting a ligature. My suspicion would be that fonts would not provide the necessary mappings since the character code is available.
Re: (base as a combing char)
From: John Cowan [EMAIL PROTECTED] the need to encode Dutch ij as a single character, which is neither necessary nor practical. (U+0132 and U+0133 are encoded for compatibility only.) In cases where ij is a digraph in Dutch text, i+ZWNJ+j will be effective. I suppose you wanted to speak about the rare cases in Dutch where ij is NOT a digraph for a single letter, and for which i+ZWNJ+j could be effective... if only it was not opposed to the tradition (and many legacy encodings and keyboards), that do generate U+0132 and U+0133 or an y/Y with diaeresis when this is a digraph, considering that i+j in that case is not a digraph but two distinct letters. There will remain an ambiguity for long time in Dutch, simply because ISO-8859-1 (U+ to U+00FF) is too often the only subset offered to Dutch typists, where neither U+0132 and U+10133 are present, nor ZWNJ (in that case, those that want the distinction often use an y with diaeresis for lowercase, and don't mark the difference for uppercase (as there's no uppercase Y with diaeresis in ISO-8859-1) which occurs much more rarely (Windows users can however use an uppercase Y with diaeresis, U+0178, to mark the single-letter digraph, because it is present in Windows codepage 1252 at the code position 0x9F). I doubt seeing one day a ZWNJ key mapped on standard Dutch keyboards, given that most occurences of the non-digraph two-letters i+j come from some imported (originally non-Dutch) rare words. (But Windows notepad and some Windows text input components include a contextual menu to insert this formating control...) The problem with ZWNJ is that it is just encoding a typographic distinction, not a semantic one that Dutch users would expect: this means that it has no semantic itself, and its rendering is also optional. Those that want a strong distinction will more likely use U+0132 and U+0133 in their word processors, assisted by Dutch lexical correctors so that they will just need to enter i then j, and let the word processor substitute the two letters appropriately by the ij ligated letter when it is appropriate, leaving other instances unchanged. As the ij ligated letter is most certainly the most frequent case for entering Dutch text, it may be the default behavior of a Dutch input method, and the assisting dictionnary will just need to reference the rare cases where the substitution must not occur (the substitution will not occur within text sections marked as belonging to another language, and users can also cancel with backspace this automatic substitution in their word processor). Other less performing word processors, without assisting dictionnaries, may substitute instead the occurences of y/Y with diaeresis that are inputed by users into U+0132/U+0133 (a solution which may be quite easy for Belgian and French users that can easily make use of the diaeresis dead key, also useful for entering French text)... This means that modern word processors will contain lots of U+0132/U+0133 which will be clearly distinct from the other cases where i and j are left isolated; and ZWNJ will not be needed!
Re: Re: Relationship between Unicode and 10646]
From: Patrick Andries [EMAIL PROTECTED] Enfin, je ne suis plus si sûr que les sociétés américaines considèrent encore Unicode comme quelque chose de stratégique, il s'agit surtout d'efforts individuels de la part de techniciens passionés dans ces entreprises, passionnés qu'on laisse encore faire sans doute parce que cela crée un bon capital de sympathie multiculturel. C'est d'ailleurs ce qui me fait doûter de plus en plus de l'intérêt de continuer à soutenir Unicode, s'il n'obéit même plus à des objectifs économiques jugés utiles par les seuls membres américains capables de soutenir son développement uniquement depuis les Etats-Unis, alors qu'Unicode n'est pas encore au point pour bon nombre d'autres pays qui, eux, ont des impératifs économiques à soutenir leurs propres langues. S'il n'y a plus grand chose à faire concernant les écritures latines, ou cyrilliques, et si les idéographes chinois sont maintenant laissés à la gestion du Rapporteur Idéographique travaillant en Extrème-Orient, il serait peut-être bon d'envisager que le développement d'Unicode concernant les écritures Africaines, ou du Moyen-Orient se fasse dans des lieux plus appropriés que les Etats-Unis, notamment concernant les décisions. L'Europe offre des lieux de rencontre semble-t-il plus appropriés pour ces alphabets mal supportés par Unicode, dont les décisions sont fondées sur des rapports distants, sans implication économique sérieuse de la part des sociétés encore participantes (si elles continuent à soutenir et payer leurs collègues encore engagés pour ce travail de passionnés). Il semble que bien des sociétés ou organisations Européennes ou du Moyen-Orient, ou d'Afriquepourraient participer plus facilement au sujet des langues qui leur tiennent à coeur, en effectuant ces réunions de décision dans un lieu plus centré. Il est d'ailleurs dommage, à l'heure des communications virtuelles, qu'Unicode s'en tienne encore, pour la question du vote final, à vouloir faire cela uniquement lors de comités restreints aux Etats-Unis, comme si le vote électronique n'existait pas! Cela n'empêchera pas la tenue de réunions de discussions ou d'arbitrage en différents lieux mais Unicode et ceux qui le soutiennent fairaient pas mal d'économies en travaillant de façon moins centralisée, et en acceptant de déléguer une partie de son travail. Il est symptomatique par exemple de voir que la moitié des votants potentiels d'Unicode n'utilisent jamais les ressources électroniques en ligne (que rien n'interdit de mettre en forme selon des procédures administratives propres à Unicode), en ne prenant leurs décisions que sur la base de documents imprimés (chers à produire et distribuer) lors de conventions (chères aussi pour y assister, à cause de frais de déplacement, hébergement, et des heures de travail supplémentaires payées uniquement pour ce sujet!), et que des documents importants puissent de ce fait échapper à leur analyse...
Re: CGJ , RLM
From: Otto Stolz [EMAIL PROTECTED] Note that there is no algorithm to reliably derive the position of the syllable break from the spelling of a Word. You could even concoct pairs of homographs that differ only in the position of the syllable break (and, consequently, in their respective meaning). So far, I have only found the somewhat silly example - Brief+SYH+lasche (letter flap) vs. - Brie+SYH+flasche (bottle to keep Brie cheese in), but I am sure I could find better examples if I would try in earnest. French hyphenation does not work reliably based only on orthographic rules. It works wuite well, but with many exceptions, that require using an hyphenation dictionnary. I think it's true also of almost all alphabet-based languages, and even for some languages written with so-called syllabic scripts, probably as a matter of style, where separate vocal syllables must not be broken, as those breaks are not the best according to meaning (notably for compound words). The case of German is that there are many possible compound words, and breaks preferably occur between radical words rather than between syllables, with exceptions: - due to other stylistic constraints, or - on short particles that should better not be detached from their respective radical (but where do you best break the hereinzugehen or simply zugehen verbs?), - also because not all verb particles are detachable, as they belong to the radical (many excamples with the be particle or radical prefix) Even if you allow hyphenation only between lexical units, there will exist some exceptions that can't be resolved without understanding the semantic. Such compound words with no separator are extremely rare in English, and very rare in French. (French examples: there's a clear vocal syllable break in millionce after -li- and before -on- prononced with separate vowels, but in million, no break occurs within -lions which is a single syllable, pronounced with a diphtong; none of these examples are compound words.) But hyphenation is still preferable in German than only word breaks (on spaces), due to the average length of compound words, whose margin alignment may look ugly and hard to read in narrow columns like in newspapers or in dictionnaries. In Dutch, there's more freedom for the creation of compounds, that can often be written with or without a separator (a modern Dutch style prefers using separators, or not creating any compound, by using word separation with space, but historically Dutch was using the German style still in use today despite its possible semantic ambiguities). I think that a German writer that sees a possible ambiguity will often tolerate to use an unconditional hyphen to create compound words (in your example, he would write Brief-Lasche or Brie-Flasche but not Brieflasche whose interpretation is problematic because there's no easy way to determine it even with the funny semantic of the two alternatives; unless the author is sure that ligatures are correctly handled with a ligature on fl for the interpretation as Brie-Flasche, and no ligature, and a narrow spacing, between f and l for the interpretation as Brief-Lasche). (Historically, German texts were full of ligatures -- much more often than in other Latin-based written languages -- those ligatures tending now to disappear from most modern publications; with the German rule that a ligature should not occur between two syllables, and should be present within the same radical, it's easy to see how ligatures are part of the orthographic system and that they have a semantic value which helps the correct understanding of text, so it would be even more important to use ZWNJ or ZWJ in German words, and not letting a renderer do this job automatically but inaccurately; for simplicity, I think that ZWNJ inserted between radicals to avoid their ligature would be easier to manage than ZWJ between two ligaturable letters that must be kept in the same syllable).
fl/fi ligature examples
From: Otto Stolz [EMAIL PROTECTED] Just because the st ligature is so uncommon (and the long with its t ligature is almost extinct), I was looking for an example involving fl, or fi). with ff : affable, baffe, biffer, Buffy, affriolant, effaroucher, effacer, ... with ffl : effleurer, baffle, affligeant, ... with fl : afleurer, flower, fleur, floral, floraison, inflation, dflation, flic, infliger... with ffi : traffic, efficace, effilocher, officier, affiche, affine, ... with fi : fi, fin, final, fil, fils, filature, filin, firme, firmament, aficionados, dfi, figure... Many more examples of modern and widely used words (at least in English and French, but probably too in most Romance languages and other European languages including Roman Latin radicals)... Other widely used ligatures include st and ct: est, test, acte, octet...
Re: Ideograph?!?
From: Michael Norton (a.k.a. Flarn) [EMAIL PROTECTED] What's an ideograph? Also, what's a radical? Are they the same thing? Some radicals (in the Han script) may be ideographs, but most ideographs are not radicals: they often (not always) combine 1 or more radicals, with 1 or more strokes that are not radicals themselves. Radicals in the Han script serve to their classification, and help users to locate ideographs in dictionnaries, but they also consider the additional strokes (radicals are themselves made of a wellknown number of strokes). Ideographs rarely represent alone a concept or word, but most often a single syllable. In Chinese many words are short and consist in 2 syllables, and so are written with two ideographs. We should call these characters syllabographs instead of ideographs, but this may conflict with the concept of syllabaries that are much simpler, unlike Han ideographs that can each represent very complex syllables (with diphtongs, multiple consonnants, and distinctive tones), and sometime (in fact rarely) a concept or word (which may spelled with more than one syllable, depending on local dialects). Many words are created from two ideographs, and the concept behind each ideograph is unrelated or sometimes very far to the meaning of the whole word. In that case, the pair of ideographs is chosen mostly because the concepts are pronounced similarly in some dialect of Chinese (sometimes old dialects), and so they can be read phonetically (For example, Beijing is written with the two ideographs for bei and jing, but you may wonder why bei and jing were used, and which concepts they represent, and their relation to the name of the city...). For these reasons, some linguists prefer to speak about sinographs (reference to Chinese), or sometimes pictographs (because of their visual form, instead of their meaning)...
Re: Keyboard Cursor Keys
From: Peter R. Mueller-Roemer [EMAIL PROTECTED] Doug Ewell wrote: Robert Finch wrote: 'm trying to implement a Unicode keyboard device, and I'd rather have keyboard processing dealing with genuine Unicode characters for the cursor keys, rather than having to use a mix of keyboard scan codes and Unicode characters. This will quickly spiral out of control as you move past the easy cases like adding character codes for cursor control functions. the easy cases like adding character codes for cursor control functions are not so easy when you have a short phrase or R-text (Right-to Left) embedded in a line of Englisch (L-text). (...) This is not related to the Robert's concern about why he has to use a mix of scan codes and Unicode characters in a keyboard driver. Effectively scan codes are not characters, but this is how a keyboard communicate with the OS, before the OS translate these scancodes into characters, according to a keyboard map. When there's no plain-text character associated to a key, the keyboard map will not map characters, but will leave these scan codes mostly intact (in fact Windows drivers translate some physical scancodes to logical scancodes to translate some functions which have various positions depending on keyboard, but that should be treated as equivalent. With MSKLC, you won't notice these changes in the generated customized table, because this translation is performed either in the BIOS, or in the keyboard hardware, or in a default scancode map within the generic keyboard driver, so that the effective keyboard mapping is from the pair of a virtual (translated) scancodes and keyboard mode, to characters. The virtual scancodes are simpler, and also hide some details about the special byte 0x00 that can prefix some extension keys before their actual scancode. Also the physical scancodes are defined on 7 bits only, the 8th one corresponding to a keypress or keyrelease status; keyboards may generate multiple keypress scancodes at regular intervals as long as the key is maintained pressed (the rate of this autorepeatition is not specified in the keyboard mapping, but with an external setting, depending on user's preferences, sent to the running keyboard driver, which may give this rate information to the hardware as a configuration command. Several scancode translations are performed in the generic keyboard driver, such as recognizing the AltGr key of European keyboard as equivalent to Ctrl+Alt (if this is enabled in a flag set in the custom keyboard map), or the translation of AltGr+keys on the numeric pad to compose either local OEM characters or local ANSI characters. Note however that the generic keyboard driver included in MSKLC has no support for the composition of characters per their Unicode hexadecimal code point; for such a thing, you need a custom driver code, and not only a simple mapping table. Same thing if you want to support more complex input modes (for example with Asian character sets), that can't be represented easily with a simple table of pairs combining a current state mode and a logical scancode. Keyboard drivers also contain several other hardware specific commands to set some advanced features in keyboards, and MSKLC will not let you program them, but the generic driver it contains will contain several standard features, enabled with a physical keyboard interface driver. What remains to the application is a set of keypress/keyrelease events with a virtual (translated scancode), that an application may trap if the application does not want that these events be translated through the keyboard mapping in the generic driver. The generic driver then intercepts the untrapped events, and translate them to characters according to the key mapping table you have created in MSKLC. Most applications will then be interested only in trapping some function keys that never generate a character in the mapping table, and will leave the other virtual scancodes (the VK_key codes) translated by the installed local keymap, which will generate other character events. The form of the generated character events depends on the intercepting application: if the application with the keyboard focus is Unicode-aware, it will wait for Unicode char events, and the keyboard mapping will then send these characters; if the application with the keyboard focus is waiting for characters using some legacy interface emulation (BIOS, DOS, or Windows ANSI), the mapping will be used, but the Unicode characters in the keymap will first be transcode into the appropriate charset. If an Unicode character can't be converted into the appropriate charset, the application will receive an error event, that by default will take the form of a sound event, or the generation of a default character, and sometimes a combination of boths. Such events won't occur for virtual keys that have no mapping to a Unicode character in the localized keymap. Note: Virtual keys is what allows many models of keyboard to be unified,
Re: Relationship between Unicode and 10646
From: Peter Kirk [EMAIL PROTECTED] On 30/11/2004 19:53, John Cowan wrote: Your main misunderstanding seems to be your belief that WG2 is a democratic body; that is, that it makes decisions by majority vote. ... Thank you, John. This was in fact my question: will the amendment be passed automatically if there is a majority in favour, or does it go back for further discussion until a consensus is reached? You have clarified that the latter is true. And I am glad to hear it. Probably, the WG2 will now consider alternatives to examine how Phoenician can be represented. The current proposal may be voted no for other reasons that just a formal opposition against the idea of encoding it as a separate script, possibly because the proposal is still incomplete, or does not resolve significant issues, or does not help making Phoenician texts better worked with computers... Ther may exist arguments caused by the difficulties to treat several variations of Phoenician, or possibly a misrepresentation of what the new script is supposed to cover (given that Phoenician is itself at the connecting node of separate scripts, and may cause specific difficulties when some variations are occuring in direction to the future Greek or Hebrew or Arabic scripts). If the script itself is not well delimited, there's no reason to encode it, but preferably to approach it from one of the existing branches. How the various branches will converge to the original script may cause lots of unresolved questions, and other more complex problems if Phoenician is not the root of the tree and as other predecessors. So may be it's too soon to encode Phoenician now, given that its immediate successors are still not encoded, and a formal model for them is still missing. In addition, there may already several alternatives for its representation, with too strong and antogonist arguments from either Ellenists or Semitists, that have adopted distinct models for the same origin text, based on the models they have established for its successors. So there's possibly a need to reconciliate (unify) these models, even if this requires encoding some well-identified letters with distinct codes, depending on their future semantic evolutions, or the set of variants they should cover. My opinion, is that Semitists are satisfied today when handling Phoenician text as if it was a historic variant Hebrew, and Ellenists satisfied as if it was a historic variant of Greek (which itself could be written alternatively as RTL or LTR or boustrophedon). A way to reconciliate those approaches can consist in a transliteration scheme. So until such a working transliteration scheme is created, that will specify the matching rules, it may be hard to define prematurely the set of letters needed for representing Phoenician texts. My view does not exclude a future encoding of Phoenician, to avoid constant transliterations for the same texts, but for now the need to do it now is not justified, and not urgent. In the interim, fonts can be built for Phoenicians according to the encoding of Hebrew, or according to the encoding of Greek, and this can fit with the respective works performed by the two categories of searchers. Now if both agree on the same set of base letters and variants, they could create a more definitive set of representative letters and variants, and formulate a future proposal for a separate script encoding, from which an easy transliteration scheme from legacy Hebrew or Greek will be possible. What do you think of this answer?
Re: Nicest UTF
There's no *universal* best encoding. UTF-8 however is certainly today the best encoding for portable communications and data storage (but it competes now with SCSU which uses a compressed form where, on average, each Unicode character is represented by one byte, in most documents; but other schemes also exist that use deflate compression on UTF-8). The problem with UTF-16 and UTF-32 is byte ordering, where byte is meant in terms of portable networking and file storage, i.e. 8-bit in almost all current technologies. With UTF-16 and UTF-32, you need to get a way to determine how bytes are ordered in the code unit, as read from a byte-oriented stream. You need not with UTF-8. The problem with UTF-8 is that it will be most often inefficient or not easy to work with within applications and libraries, that are easier accessing strings and counting characters coded on fixed-width code units. Although UTF-16 is not strictly fixed-width, it is quite easy to work with, and is often more efficient than UTF-32 due to memory allocations. UTF-32 however is the easiest solution when applications really want to handle each possible character encoded on one Unicode code point with a single code unit. All UTF encodings (including the SCSU compressed encoding, or BOCU-8 which is a variant of UTF-8, or also now the GB18030 Chinese standard which is now a valid representation of Unicode) have their pros and cons. Choose among them because they are widely documented, and offer good interoperabilities within lots of libraries handling them with similar semantics. If you are not satisfied in your application by these encodings, you may even create your own one (like Sun did when modifying UTF-8 to allow representing any Unicode string within a null-terminated C string, and also allow any sequence of 16-bit code units, even the invalid ones where surrogates are unpaired, to be represented on 8-bit streams). If you do that, don't expect this encoding to be easily portable and recognized by other systems, unless you document it with a complete specification and make it available for free alternate implementations by others. - Original Message - From: Arcane Jill [EMAIL PROTECTED] To: Unicode [EMAIL PROTECTED] Sent: Thursday, December 02, 2004 2:19 PM Subject: RE: Nicest UTF Oh for a chip with 21-bit wide registers! :-) Jill -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Antoine Leca Sent: 02 December 2004 12:12 To: Unicode Mailing List Subject: Re: Nicest UTF There are other factors that might influence your choice. For example, the relative cost of using 16-bit entities: on a Pentium it is cheap, on more modern X86 processors the price is a bit higher, and on some RISC chips it is prohibitive (that is, short may become 32 bits; obviously, in such a case, UTF-16 is not really a good choice). On the other extreme, you have processors where byte are 16 bits; obviously again, then UTF-8 is not optimum there. ;-)
Re: Nicest UTF
If you need immutable strings, that take the least space as possible in memory for your running app, then consider using SCSU, for the internal storage of the string object, then have a method return an indexed array of code points, or a UTF-32 string when you need it to mutate the string object into another. SCSU is excellent for immutable strings, and is a *very* tiny overhead above ISO-8859-1 (note that the conversion from ISO-8859-1 to SCSU is extremely trivial, may be even simpler than to UTF-8!) From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] For internals of my language Kogut I've chosen a mixture of ISO-8859-1 and UTF-32. Normalized, i.e. a string with chracters which fit in narrow characters is always stored in the narrow form. I've chosen representations with fixed size code points because nothing beats the simplicity of accessing characters by index, and the most natural thing to index by is a code point. Strings are immutable, so there is no need to upgrade or downgrade a string in place, so having two representations doesn't hurt that much. Since the majority of strings is ASCII, using UTF-32 for everything would be wasteful. Mutable and resizable character arrays use UTF-32 only.
Re: Nicest UTF
From: Doug Ewell [EMAIL PROTECTED] I appreciate Philippe's support of SCSU, but I don't think *even I* would recommend it as an internal storage format. The effort to encode and decode it, while by no means Herculean as often perceived, is not trivial once you step outside Latin-1. I said: for immutable strings, which means that these Strings are instanciated for long term, and multiple reuses. In that sense, what is really significant is its decoding, not the effort to encode it (which is minimal for ISO-8859-1 encoded source texts, or Unicode UTF-encoded texts that only use characters from the first page). Decoding SCSU is very straightforward, even if this is stateful (at the internal character level). But for immutable strings, there's no need to handle various initial states, and the states associated with each conponent character of the string has no importance (strings being immutable, only the decoding of the string as a whole makes sense). The stateful decoding of SCSU can be part of an accessor from a storage class, which can also be optimized easily to avoid multiple reallocations of the decoded buffer. SCSU can only be a complication if you want mutable strings; however mutable strings are needed only if you intend to transform a source text and work on its content. If this is a temporary need to create other immutable strings, you can still use SCSU for encoding the final results, and work with UTFs for intermediate results. In a text editor, where you'll constantly need to work at the character level, the text is not immutable, and this is effectively not a good encoding for working on it (but all UTFs, including UTF-8 or GB18030) are easy to work with at this level. In practice, a text editor often needs to split the edited text into manageable fragments encoded separately, for performance reason (as text insertion and deletion in a large buffer is a lengthy and costly operation). Given that UTFs can increase the memory need, it is not completly stupid to think about using a compression scheme for individual fragments of the large text file; the cost of encoding/decoding SCSU, if this limits the number VM swaps to the disk to access to more fragments, can be an interesting optimization, as the total size on disk will be smaller, reducing the number of I/O operations, and so enhancing the program responsiveness to user commands. (Note that there already exists applications of such compression schemes even within filesystems that support editable but still compressed files... SCSU is not the option used in this case, because it is too specific to Unicode texts, but they use a much more complex compression scheme, most often derived from Lempel-Ziv-Welsh compression algorithms, and this is not significantly increasing the total load time, given that this also significantly reduces the frequency of disk I/O, which is a much longer and costly operation...) The bad thing about SCSU is that the compression scheme is not deterministic: you can't compare easily too instances of strings encoded with SCSU (because several alternatives are possible) without actually decoding it prior to performing their collation (with standard UTFs, including the chinese GB18030 standard, the encoding is deterministic and allows comparing encoded strings without first decoding them). But this argument is also true for almost all compression schemes, even for the well-known deflate algorithm or for very basic compressors like RLE, or a newer bzip2 compression (depending on the compressor implementation used and some tunable parameters, and the number of alternatives and size of internal dictionaries considered during the compression). The advantage of SCSU over generic data compressors like deflate is that it does not require a large and complex state (all the SCSU decoding states are managed with a very limited number of fixed-sized variables), so its decompression can be easily hardcoded and optimized a lot, up to a point were the cost of decompression will be nearly invisible to almost all applications: the most significant costs will be most often within collators or text parsers; a compliant UCA collation algorithm is much more complex to implement and optimize than a SCSU decompressor, and it is more CPU- and resource-intensive.
Re: Nicest UTF
RE: Nicest UTFFrom: Lars Kristan I agree. But not for reasons you mentioned. There is one other important advantage: UTF-8 is stored in a way that permits storing invalid sequences. I will need to elaborate that, of course. Not true for UTF-8. UTF-8 can only store valid sequences of code points, in the valid range from U+ to U+D7FF and U+E000 to U+10 (so excluding surrogate code points). But it's true that there are non standard extensions of UTF-8 (such as Sun's one for Java) that allow escaping some byte values normally generated by the standard UTF-8 (notably the single byte 0x00 representing U+), or that allow representing isolated or incorrectely paired surrogate code points which may be present in a normally invalid Unicode string, or that allow to represent non-BMP characters with 6 bytes, where each pair of 3 bytes represent surrogate code units (not code points!). Only the CESU-8 variant of UTF-8 is documented and standardized (where non-BMP characters are represented by encoding on two groups of 3 bytes the two surrogate code units that would be used in UTF-16 to represent the same character). CESU-8 is less efficient than UTF-8, but even in that case it does not allow representing invalid Unicode strings containing surrogate *code points* which are not characters (I did not say *code units*), even if they are apparently correctly paired (the concept of paired surrogates only exist within the UTF-16 encoding scheme, that represent strings not as stream of characters coded with code points, but as streams of 16-bit code units). If you need extensions like this, you do because you need to represent data which is not valid Unicode text. Such extended scheme is not a UTF, but a serialization format for this type of data (even if this type can represent all instances of valid Unicode text).
Re: OpenType vs TrueType (was current version of unicode-font)
From: Gary P. Grosso [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Friday, December 03, 2004 5:10 PM Subject: RE: OpenType vs TrueType (was current version of unicode-font) Hi Antoine, others, Questions about OpenType vs TrueType come up often in my work, so perhaps the list will suffer a couple of questions in that regard. First, I see an O icon, not an OT icon in Windows' Fonts folder for some fonts and a TT icon for others. Nothing looks like OT to me, so are we talking about the same thing? See www.opentype.org: OpenType is a trademark of Microsoft Corporation (bottom of page) The handdrawn-like O is a logo used by Microsoft as the icon representing OpenType fonts. However the OpenType web site is apparently fixed only to this presentation page, with a single link to MonoType Corporation, not to the previous documentation hosted by Microsoft. Is Microsoft stopping supporting OpenType, and about to sell the technology to the MonoType font foundry?
Re: Nicest UTF
From: Asmus Freytag [EMAIL PROTECTED] A simplistic model of the 'cost' for UTF-16 over UTF-32 would consider 1) 1 extra test per character (to see whether it's a surrogate) 2) special handling every 100 to 1000 characters (say 10 instructions) 3) additional cost of accessing 16-bit registers (per character) 4) reduction in cache misses (each the equivalent of many instructions) 5) reduction in disk access (each the equivaletn of many many instructions) (...) For 4 and 5, the multiplier is somewhere in the 100s or 1000s, for each occurrence depending on the architecture. Their relative weight depends not only on cache sizes, but also on how many other instructions per character are performed. For text scanning operations, their cost does predominate with large data sets. I tend to disagree with you on points 4 and 5: cache misses, and disk accesses (more commonly refered to as data locality in computing performances) really favors UTF-16 face to UTF-32, simply because UTF-16 will be more compact for almost every text you need to process, unless you are working on texts that only contain characters from a script *not present at all* in the BMP (this sentence excludes Han, even if there are tons of ideographs out of the BMP, because these ideographs are almost never used alone, but used seldomly within tons of other conventional Han characters in the BMP). Given that these scripts are all historic ones, or were encoded for technical purpose with very specific usage, a very large majority of texts will not use significant numbers of characters out of the BMP, so the use of surrogates in UTF-16 will remain a minority. In all cases, even for texts made only of characters out of the BMP, UTF-16 can't be larger than UTF-32. The only case where it would be worse than UTF-32 is for the internal representation of strings in memory, where 16-bit code units can't be represented with 16-bit only, for example if memory cells are not individually addressable below units of at least 32 bits, and the CPU architecture is very inefficient when working with 16-bit bitfields within 32-bit memory units or registers, due to extra shifts and masking operations needed to pack and unpack 16-bit bitfields into a single 32-bit memory cell. I doubt that such architecture would be very successful, given that too many standard protocols depend on being able to work with datastreams made of 8-bit bytes: with such architecture, all data I/O would need to store 8-bit bytes in separate but addressable 32-bit memory cells, which would really be a poor usage of available central memory (such architecture would require much more RAM to work with equivalent performances for data I/O, and even the very costly fast RAM caches would need to be increased a lot, meaning higher hardware construction costs). So even on such 32-bit only (or 64-bit only...) architectures (where for example the C datatype char would be 32-bit or 64-bit), there would be efficient instructions in the CPU to allow packing/unpacking bytes in 32-bit (or 64-bit) memory cells (or at least at the register level, with instructions allowing to work efficiently with such bitfields).
Re: Nicest UTF
From: Theo [EMAIL PROTECTED] From: Asmus Freytag [EMAIL PROTECTED] So, despite it being UTF-8 case insensitive, it was totally blastingly fast. (One person reported counting words at 1MB/second of pure text, from within a mixed Basic / C environment). You'll need to keep in mind, that the counter must look up through thousands of words (Every single word its come across in the text), on every single word lookup. Anyhow, from my experience, UTF-8 is great for speed and RAM. Probably true for English or most Western European Latin-based languages (plus Greek and Coptic). But for other languages that still use lots of characters in the range U+ to U+03FF (C0 and C1 controls, Basic Latin, Latin-1 suplement, Latin Extended-A and -B, IPA Extensions, Spacing Modifier Letters, Combining Diacritical Marks, Greek and Coptic) UTF-8 and UTF-16 may be nearly as efficient. For all others, that need lots of characters out of the range U+ to U+03FF (Cyrillic, Armenian, Hebrew, Arabic, and all Asian or Native-American or African scripts, or even PUAs), UTF-16 is better (more compact in memory, so faster). UTF-32 will be better only for historic texts written nearly completely with characters out of the BMP (for now, only Old Ialic, Gothic, Ugaritic, Deseret, Shavian, Osmanya, Cypriot Syllabary), if C0 controls (such as TAB, CR and LF), or ASCII SPACE, or NBSP are a minority.
Re: OpenType vs TrueType (was current version of unicode-font)
From: Peter Constable [EMAIL PROTECTED] Why would you think the creation of this site might suggest that Microsoft is selling off its IP in relation to OpenType to Monotype? If Motorola created a site www.pentium4.org, would you jump to the conclusion that they were selling off that IP? What alarmed me is that this domain was previously referencing Microsoft's documentation. Also the fact that MonoType was sold by Agfa, with his name changed. Also the fact that Microsoft's presentation of OpenType (previously TrueType Open, previously TrueType) has removed the reference to Apple's contributions in TrueType, leaving only Microsoft as the owner of the trademark and technology, also partly attributed to Adobe). With Apple now supporting other layout tables, that are not referenced in the Microsoft documentations for OpenType, this really suggested me a branch split after disagreement (also increased by the new status of Monotype). What is strange also is that the www.opentype.org web site is a page whose title refers to Arial Unicode MS. Isn't it a Microsoft font? These things all combined are very intrigating. Is there a way outside OpenType for other system vendors than Microsoft and Apple? This standard loks more and more proprietary...
Re: Nicest UTF
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] Philippe Verdy [EMAIL PROTECTED] writes: Random access by code point index means that you don't use strings as immutable objects, No. Look at Python, Java and C#: their strings are immutable (don't change in-place) and are indexed by integers (not necessarily by code points, but it doesn't change the point). Those strings are not indexed. They are just accessible through methods or accessors, that act *as if* they were arrays. There's nothing that requires the string storage to use the same exposed array, and in fact you can as well work on immutable strings, as if they were vectors of code points, or vectors of code units, and sometimes vectors of bytes. Note for example the difference between the .length property of Java arrays, and the .length() method of java String instances... Note also the fact that the conversion of an array of bytes or code units or code points to a String requires distinct constructors, and that the storage is copied rather than simply referenced (the main reason being that indexed vectors or arrays are mutable in their indexed content, but not String instances which become sharable). Anyway, each time you use an index to access to some components of a String, the returned value is not an immutable String, but a mutable character or code unit or code point, from which you can build *other* immatable Strings (using for example mutable StringBuffers or StringBuilder or similar objects in other languages). When you do that, the returned character or code unit or code point does not guarantee that you'll build valid Unicode strings. In fact, such character-level interface is not enough to work with and transform Strings (for example it does not work to perform correct transformation of lettercase, or to manage grapheme clusters). The most powerful (and universal) transformations are those that don't use these interfaces directly, but that work on complete Strings and return complete Strings. The character-level APIs are convenience for very basic legacy transformations, but they do not solve alone most internationalization problems; or they are used as a protected interface that allow building more powerful String to String transformations. Once you realize that, which UTF you use to handle immutable String objects is not important, because it becomes part of the blackbox implementation of String instances. If you consider then the UTF as a blackbox, then the real arguments for an UTF or another depends on the set of String-to-String transformations you want to use (because it conditions the implmentation of these transformations), but more importantly it affects the efficiency of the String storage allocation. For this reason, the blackbox can determine itself which UTF or internal encoding is the best to perform those transformations: the total volume of immutable string instances to handle in memory and the frequency of their instanciation determines which representation to use (because large String volumes will sollicitate the memory manager, and will seriously impact the overall application performance). Using SCSU for such String blackbox can be a good option if this effectively helps in store many strings in a compact (for global performance) but still very fast (for transformations) representation. Unfortunately, the immutable String implementations in Java or C# or Python does not allow the application designer to decide which representation will be the best (they are implemented as concrete classes instead of virtual interfaces with possible multiple implementations, as they should; the alternative to interfaces would have been class-level methods allowing the application to trade with the blackbox class implementation the tuning parameters). There are other classes or libraries within which such multiple representations are possible and easily and transparently convertible from one to the other. (Note that this discussion is related to the UTF used to represent code points, but today, there are also needs to work on strings within grapheme cluster boundaries, including the various normalization forms, and a few libraries do exist for which the various normalizations can be changed without changing the immutable aspect of Strings, the complexity being that Strings do not always represent plain-text...)
Re: Nicest UTF
- Original Message - From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Sunday, December 05, 2004 1:37 AM Subject: Re: Nicest UTF Philippe Verdy [EMAIL PROTECTED] writes: There's nothing that requires the string storage to use the same exposed array, The point is that indexing should better be O(1). SCSU is also O(1) in terms of indexing complexity... simply because it keeps the exact equivalence with codepoints, and requires a *fixed* (and small) number of steps to decode it to code points, but also because the decoder states uses a *fixed* (and small) number of variables for the internal context (unlike more powerful compression algorithms like dictionnary-based, Lempel-Ziv-Welsh-like, algorithms such as deflate). Not having a constant side per code point requires one of three things: 1. Using opaque iterators instead of integer indices. 2. Exposing a different unit in the API. 3. Living with the fact that indexing is not O(1) in general; perhaps with clever caching it's good enough in common cases. Altough all three choices can work, I would prefer to avoid them. If I had to, I would probably choose 1. But for now I've chosen a representation based on code points. Anyway, each time you use an index to access to some components of a String, the returned value is not an immutable String, but a mutable character or code unit or code point, from which you can build *other* immatable Strings No, individual characters are immutable in almost every language. But individual characters do not always have any semantic. For languages, the relevant unit is almost always the grapheme cluster, not the character (so not its code point...). As grapheme clusters need to be represented on variable lengths, an algorithm that could only work with fixed-width units would not work internationaly or would cause serious problems for correct analysis or transformation of true languages. Assignment to a character variable can be thought as changing the reference to point to a different character object, even if it's physically implemented by overwriting raw character code. When you do that, the returned character or code unit or code point does not guarantee that you'll build valid Unicode strings. In fact, such character-level interface is not enough to work with and transform Strings (for example it does not work to perform correct transformation of lettercase, or to manage grapheme clusters). This is a different issue. Indeed transformations like case mapping work in terms of strings, but in order to implement them you must split a string into some units of bounded size (code points, bytes, etc.). Yes, but why do you want that this intermediate unit be the code point? Such algorithm can be developped with any UTF, or even with compressed encoding schemes through accessor or enumerator methods... All non-trivial string algorithms boil down to working on individual units, because conditionals and dispatch tables must be driven by finite sets. Any unit of a bounded size is technically workable, but they are not equally convenient. Most algorithms are specified in terms of code points, so I chose code points for the basic unit in the API. Most is the right term here: this is not a requirement, and it's not because it is the simplest way to implement such algorithm that it will be the most efficient in terms of performance or resource allocations. Most experiences prove that the most efficient algorithms are also complex to implement. Code points are probably the easiest thing to describe what an text algorithm is supposed to do, but this is not a requirement for applications (in fact many libraries have been written that correctly implement the Unicode algorithms, without even dealing with code points, but only with in-memory code units of UTF-16 or even in UTF-8 or GB18030, or directly with serialization bytes of UTF-16LE or UTF-8 or SCSU or ether encoding schemes). Which represent will be the best is left to implementers, but I really think that compressed schemes are often introduced to increase the application performances and reduce the needed resources both in memory and for I/O, but also in networking where interoperability across systems and bandwidth optimization are also important design goals...
Re: script complexity, was Re: OpenType vs TrueType
Richard Cook rscook at socrates dot berkeley dot edu wrote: Script complexity is not so easily quantified. Has anyone tried to sort scripts by complexity? In terms of the present discussion, Han would be viewed as a simple script, and yet it is simple only in terms of the script model in which ideographs are the smallest unit. In a stroke-based Han script model, Han is at least as complex as any. If Han had not been encoded with a ideograph-based model, may be(?) we would have needed much less code points. However the main immediate problem would have been that the layout of composite radical and strokes in the ideographic square is very complex, highly contextual, and in fact too much variable across dialects and script forms to allow a layout algorithm to be designed and standardized. At least one could have standardized a Han strokes-to square layout system, but it would have required a huge dictionnary, requiring many dialect-specific sections to handle the variant forms and placement of the composing strokes. In addition, the square model is not imperitive in Han, because there are various styles for writing it, where the usual square model is much relaxed, or simply not observed on actual documents. To model such variations in a stroke-based model, it would have been needed to encode: - the strokes themselves (all, not just the radicals!) - stroke variants - descriptive composition pseudo-characters (like the existing IDC in Unicode) - dialectal composition rules. And then to create a very complex specification to describe each ideograph according to this model, and allow a renderer to redraw the ideographs from such composition grapheme clusters. The second problem is that GB* and BigFive encodings already existed as widely used standards, but there was no concrete and interoperable solution to represent Han characters with such composed sequences. This modeling was possible for Hangul, but with a simplification: the encoded jamos sometime represent several strokes (considered as letters, also because they have a clear phonetic value, but sometimes grouped within the same jamo to simplify the design of the Hangul layout system, notably for double-consonnant SANG* jamos). But a simpler system of jamos was still possible (for example it was easy to model the double-consonnant jamos as two successive simpler jamos, and then update the Hangul syllable model accordingly)
Re: Unicode for words?
From: Ray Mullan [EMAIL PROTECTED] I don't see how the one million available codepoints in the Unicode Standard could possibly accommodate a grammatically accurate vocabulary of all the world's languages. You have misread the message from Tim: he wanted to use code points above U+10 within the full 32-bit space (meaning more than 4 billions codepoints, when Unicode and ISO-10646 only allow 2 millions...) He wanted to use that to encode words on a single code point, as a possible compression scheme. But he forgets that words can have its component letters affected by style or during rendering. Also a font or renderer would be unable to draw the text without having the equivalent of an indexed dictionnary of all words on the planet! If compression is a goal, he forgets that the space gain offered by such compression will be very modest face to more generic data compressors like deflate or bzip2 that can compress the represented texts more efficiently without even needing such large dictionnary (that is in perpetual evolution by every speaker of any language, without any prior standard agreement anywhere!). Forget his idea, it is technically impossible to do. At best you could create some protocols that will compact some widely used words (this is what WAP does for widely used HTML elements or attributes), but this is still not a standard outside of this limited context. Suppose that Unicode encodes the common English words the, an, is, etc... then a protocol could decide that these words are not important and will filter them. What will happen if these words do appear in non-English languages where they are semantically significant? These words would be missing. To paliate this inconvenient the codepoints would only designate the words used in one language and not the other, so an would have different codes whever it is used in English or in another language. The last problem is that too many languages do not have well-established and computerized lexical dictionnaries, and grammatical rules that allow composing words are not always known. The number of words in a single language cannot also be bound to a known maximum (a good example in German where composed words are virtually unlimited!) So forget this idea: Unicode will not create a standard to encode words. Words will be represented after modeling them to a script system made of simpler sets of letters or ideographs or punctuation and diacritics. The representation of words with those letters is an orthographic system, specific to each language, that Unicode will not standardize.
Re: Nicest UTF
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] Philippe Verdy [EMAIL PROTECTED] writes: The point is that indexing should better be O(1). SCSU is also O(1) in terms of indexing complexity... It is not. You can't extract the nth code point without scanning the previous n-1 code points. The question is why you would need to extract the nth codepoint so blindly. If you have such reasons, because you know the context in which this index is valid and usable, then you can as well extract a sequence using an index in the SCSU encoding itself using the same knowledge. Linguistically, extracting a substring or characters at any random index in a sequence of code points will only cause you problems. In general, you will more likely use index as a way to mark a known position that you have already parsed sequentially in the past. However it is true that if you have determined a good index position to allow future extraction of substrings, SCSU will be more complex because you not only need to remember the index, but also the current state of the SCSU decoder, to allow decoding characters encoded starting at that index. This is not needed for UTF's and most legacy character encodings, or national standards, or GB18030 which looks like a valid UTF, even though it is not part of the Unicode standard itself. But remember the context in which this discussion was introduced: which UTF would be the best to represent (and store) large sets of immutable strings. The discussion about indexes in substrings is not relevevant in that context.
Re: Unicode for words?
Don't misinterpret my words or arguments here: the purpose of the question was strictly about which UTF or other transformation would be good for interoperability, and storage, and whever it would be a good idea to encode words with standard codes. So in my view, it is completely unneeded to create such standard codes for common words, if these words are in the natural human language (it may make sense for computer languages, but this is specific to the implementation of such language, and should be part of its specification rather than being standardized in a general purpose encoding like Unicode code points, made to fit also all the needs for the representation of human languages, which are NOT standardized and constantly evolving.) Creating such standard codes for human words would not only be an endless task, but also a work that would rapidly become obsoleted, and not based on the very variable uses of human languages. Let's keep Unicode simple without attempting to encode words (even for Chinese, we encode ideographic characters, but not words made often of two characters each representing a single syllable). If you want to encode words, you create an encoding based on a pictographic representation of human languages, and you are going to another way than the way followed for a very long history of evolution by the inventors of script systems. You would be returning to the first ages of humanity... where men had lots of difficulty to understand each other, and difficulties to transmit their acquired knowledge. This does not exclude other UTF representation to implement algorithms, only as an intermediate form which eases the processing. However, you are not required to create an actual instance of the other UTF to work with it, and there are many examples where you can perfectly work with a compact representation that will fit marvelously in memory with excellent performance, and where the decompressed form will only be used locally. In *many* cases, notably if the text data to manage like this is large, adding an object representation with just an API to access to a temporary decompressed form, it will improve the global performence of the system, due to reduced internal processing resource needs. A code that decompresses SCSU to UTF-32 can fit in less than 1KB of memory, but it will allow saving as many megabytes of memory as you wish for your large database, given that SCSU will take an average of nearly one byte per character (or code point) instead of 4 with UTF-32. Such examples exist in real-world applications, notably in spelling and grammatical correctors, whose performance depend completely on the total size of the information thay have in their database, and the level at which this information is compressed (to minimize the impact on system resources, which is mostly determined by the quantity of information you can fit into fast memory without swapping between fast memory and slow disk storage). The most efficient correctors use very compact forms with very specific compression and indexing schemes through a transparent class managing the conversion between this compact form and the usual representation of text as a linear stream of characters. Other examples exist in some RDBMS to allow improve the speed of query processing for large databases, or the speed of full-text searches, or in their networking connectors to reduce the bandwidth taken by result sets. The interest of data compression becomes immediate as soon as the data to process must go through any kind of channels (networking links, file storage, database table) with lower throughput than fast but expensive or restricted internal processing memory (including memory caches if we consider data locality). From: D. Starner [EMAIL PROTECTED] Philippe Verdy writes: Suppose that Unicode encodes the common English words the, an, is, etc... then a protocol could decide that these words are not important and will filter them. Drop the part of the sentence before then. A protocol could delete the, an, etc. right now. In fact, I suspect several library systems do drop the, etc. right now. Not that this makes it a good idea, but that's a lousy argument. If such a library does this, only based on the presence of the encoded words, without wondering in which language the text is written, that kind of processing text will be seriously inefficient or inaccurate when processing other languages than English for which you will have built such a library. For plain-text (which is what Unicode deals about), even the an, the, is words (and so on...) are equally important as other parts of the text. Encoding frequent words with a single compact code may be effective for a limited set of applications, but it will not be as much effective as a more general compression scheme (deflate, bzip2, and so on...) which will work best independantly of the language, and without needing (when
Re: Nicest UTF
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] Now consider scanning forwards. We want to strip a beginning of a string. For example the string is an irc message prefixed with a command and we want to take the message only for further processing. We have found the end of the prefix and we want to produce a string from this position to the end (a copy, since strings are immutable). All those are not demonstration: decoding IRC commands or similar things does not constitute the need to encode large sets of texts. In your examples, you show applications that need to handle locally some strings made for computer languages. Texts of human languages, or even a collection of person names, or places are not like this, and have a much wider variety, but with huge possibilities for data compression (inherent to the phonology of human languages and their overall structure, but also due to repetitive conventions spread throughout the text to allow easier reading and understanding). Scanning backward a person name or human text is possibly needed locally, but such text has a strong forward directionality without which it does not make sense. Same thing if you scan such text starting at random positions: you could make many false interpretations of this text by extracting random fragments like this. Anyway, if you have a large database of texts to process or even to index, you will, in fine, need to scan this text linearily first from the beginning to the end, should it be only to create an index for accessing it later randomly. You will still need to store the indexed text somewhere, and in order to maximize the performance, or responsiveness of your application, you'll need to minimize its storage: that's where compression takes place. This does not change the semantic of the text, does not remove its semantics, but this is still an optimization, which does not prevent a further access with more easily parsable representation as stateless streams of characters, through surjective (sometimes bijective) converters between the compressed and uncompressed forms. My conclusion: there's no best representation to fit all needs. Each representation has its merits in its domain. The Unicode UTFs are excellent only for local processing of limited texts, but they are not necessarily the best for long term storage or for large text sets. And even for texts that will be accessed frequently, compressed schemes can still constitute optimizations, even if these texts need to be decompressed repeatedly each time they are needed. I am clearly against the arguments with one scheme fits all needs, even if you think that UTF-32 is the only viable long-term solution.
Fw: Nicest UTF
From: Doug Ewell [EMAIL PROTECTED] Here is a string, expressed as a sequence of bytes in SCSU: 05 1C 4D 6F 73 63 6F 77 05 1D 20 69 73 20 12 9C BE C1 BA B2 B0 2E See how long it takes you to decode this to Unicode code points. (Do not refer to UTN #14; that would be cheating. :-) Without looking at it, it's easy to see that this tream is separated in three sections, initiated by 05 1C, then 05 1D, then 12. I can't remember without looking at the UTN what they perform (i.e. which Unicode code points range they select), but the other bytes are simple offsets relative to the start of the selected ranges. Also the third section is ended by a regular dot (2E) in the ASCII range selected for the low half-page, and the other bytes are offsets for the script block initiated by 12. Immediately I can identify this string, without looking at any table: Mossov? is ??. where is some openining or closing quotation mark and where each ? replaces a character that I can't decipher only through my defective memory. (I don't need to remember the details of the standard table of ranges, because I know that this table is complete in a small and easily available document). A computer can do this much better than I can (also it can even know much better than I can what corresponds to a given code point like U+6327, if it is effectively assigned; I'll have to look into a specification or to use a charmap tool, if I'm not used to enter this character in my texts). The decoder part of SCSU still remains extremely trivial to implement, given the small but complete list of codes that can alter the state of the decoder, because there's no choice in its interpretation and because the set of variables to store the decoder state is very limited, as well as the number of decision tests at each step. This is a basic finite state automata. Only the encoder may be a bit complex to write (if one wants to generate the optimal smallest result size), but even a moderate programmer could find a simple and working scheme with a still excellent compression rate (around 1 to 1.2 bytes per character on average for any Latin text, and around 1.2 to 1.5 bytes per character for Asian texts which would still be a good application of SCSU face to UTF-32 or even UTF-8).
Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...
- Original Message - From: Arcane Jill [EMAIL PROTECTED] Probably a dumb question, but how come nobody's invented UTF-24 yet? I just made that up, it's not an official standard, but one could easily define UTF-24 as UTF-32 with the most-significant byte (which is always zero) removed, hence all characters are stored in exactly three bytes and all are treated equally. You could have UTF-24LE and UTF-24BE variants, and even UTF-24 BOMs. Of course, I'm not suggesting this is a particularly brilliant idea, but I just wonder why no-one's suggested it before. UTF-24 already exists as an encoding form (it is identical to UTF-32), if you just consider that encoding forms just need to be able to represent a valid code range within a single code unit. UTF-32 is not meant to be restricted on 32-bit representations. However it's true that UTF-24BE and UTF-24LE could be useful as a encoding schemes for serializations to byte-oriented streams, suppressing one unnecessary byte per code point. (And then of course, there's UTF-21, in which blocks of 21 bits are concatenated, so that eight Unicode characters will be stored in every 21 bytes - and not to mention UTF-20.087462841250343, in which a plain text document is simply regarded as one very large integer expressed in radix 1114112, and whose UTF-20.087462841250343 representation is simply that number expressed in binary. But now I'm getting /very/ silly - please don't take any of this seriously.) :-) I don't think that UTF-21 would be useful as an encoding form, but possibly as a encoding scheme where 3 always-zero bits would be stripped, providing a tiny compression level, which would only be justified for transmission over serial or network links. However I do think that such optimization would have the effect of removing byte alignments, on which more powerful compressors are working. If you really need a more effective compression use SCSU or apply some deflate or bzip2 compression to UTF-8, UTF-16, or UTF-24/32... (there's not much difference between compressing UTF-24 or UTF-32 with generic compression algorithms like deflate or bzip2). The UTF-24 thing seems a reasonably sensible question though. Is it just that we don't like it because some processors have alignment restrictions or something? There does exists, even still today, 4-bit processors, and 1-bit processors, where the smallest addressable memory unit is smaller than 8-bit. They are used for lowcost micro-devices, notably to build automated robots for the industry, or even for many home/kitchen devices. I don't know whever they do need Unicode to represent international text, given that they often have a very limited user interface, incapable of inputing or output text, but who knows? May be they are used in some mobile phones, or within smart keyboards or tablets or other input devices connected to PCs... There also exists systems where the smallest addressable memory cell is a 9-bit byte. This is more an issue here, because the Unicode standard does not specify whever encoding schemes (that serialize code points to bytes) should set the 9th bit of each byte to 0, or should fill every 8 bit of memory, even if this means that 8-bit bytes of UTF-8 will not be synchronized with memory 9-bit bytes. Somebody already introduced UTF-9 in the past for 9-bit systems. A 36-bit processor could as well address the memory by cells of 36 bits, where the 4 highest bits would be either used for CRC control bits (generated and checked automatically by the processor or a memory bus interface within memory regions where this behavior would be allowed), or either used to store supplementary bits of actual data (in unchecked regions that fit in reliable and fast memory, such as the internal memory cache of the CPU, or static CPU registers). For such things, the impact of the transformation of addressable memory widths through interfaces is for now not discussed in Unicode, which supposes that internal memory is necessarily addressed in a power of 2 and a multiple of 8 bits, and then interchanged or stored using this byte unit. Today, we assist to the constant expansion of bus widths to allow parallel processing instead of multiplying the working frequency (and the energy spent and temperature, which generates other environmental problems), so why the 8-bit byte unit would remain the most efficient universal unit? If you look at IEEE floatting point formats, they are often implemented in FPU working on 80-bit units, and a 80-bit memory cell could as well become tomorrow a standard (compatible with the increasingly used 64-bit architectures of today) which would no longer be a power of 2 (even if this stays a multiple of 8 bits). On a 80-bit system, the easiest solution for handling UTF-32 without using too much space would be a unit of 40-bits (i.e. two code points per 80-bit memory cell). But if you consider that 21 bits only are used in Unicode,
Re: proposals I wrote (and also, didn't write)
From: E. Keown [EMAIL PROTECTED] I wrote 3 Hebrew diacritics proposals between May-July. (...) 1. Proposal to add Samaritan Pointing to the UCS http://www.lashonkodesh.org/samarpro.pdf WG2 number: N2748 2. Proposal to add Palestinian Pointing to ISO/IEC 10646 http://www.lashonkodesh.org/palpro.pdf 3. Proposal to add Babylonian Pointing to ISO/IEC 10646 http://www.lashonkodesh.org/bavelpro.pdf (...) Other Items Supporting the Pointing Proposals Above: Letter Requesting 'Hebrew Extended' Block (7/2004) http://www.lashonkodesh.org/roadm08.pdf The Aramaic and Hebrew Character Sets (June 2004) http://www.lashonkodesh.org/hprelist.doc Hello Ellaine, In all your searches and in your proposals, di you try to segregate the proposed additional characters into two separate categories: those needed for inclusion within many modern studies, and those only used in very old scripts with many unknown or ambiguous properties? I ask you that because not all the Hebrew Extended chracters may need an allocation in the BMP (in row U+08xx as suggested), and some may be placed in the SMP, in a separate Hebrew-Aramaic-Mandaic Extended block (including notably some punctuations signs or old numerals, or other diacritics needed for Phoenician and other extinct branches or variants). Philippe.
Re: Nicest UTF
From: D. Starner [EMAIL PROTECTED] If you're talking about a language that hides the structure of strings and has no problem with variable length data, then it wouldn't matter what the internal processing of the string looks like. You'd need to use iterators and discourage the use of arbitrary indexing, but arbitrary indexing is rarely important. I fully concur to this point of view. Almost all (if not all) string processing can be performed in terms of sequential enumerators, instead of through random indexing (which has also the big disavantage of not allowing with rich context dependant processing behaviors, something you can't ignore when handling international texts). So internal storage of string does not matter for the programming interface of parsable string objects. In terms of efficiency and global application performance, using compressed encoding schemes is highly recommanded for large databases of text, because the negative impact of the decompressing overhead is extremely small face to the huge benefits you get when reducing the load on system resources, on data locality and on memory caches, on the system memory allocator, on the memory fragmentation level, on reduced VM swaps and on file or database I/O (which will be the only effective limitation for large databases).
Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...
From: Kenneth Whistler [EMAIL PROTECTED] Yes, and pigs could fly, if they had big enough wings. Once again, this is a creative comment. As if Unicode had to be bound on architectural constraints such as the requirement of representing code units (which are architectural for a system) only as 16-bit or 32-bit units, ignoring the fact that technologies do evolve and will not necessarily keep this constraint. 64-bit systems already exist today, and even if they have, for now, the architectural capability of handling efficiently 16-bit and 32-bit code units so that they can be addressed individually, this will possibly not be the case in the future. When I look at the encoding forms such as UTF-16 and UTF-32, they just define the value ranges in which code units will be be valid, but not necessarily their size. You are mixing this with encoding schemes, which is what is needed for interoperability, and where other factors such as bit or byte ordering is also important in addition to the value range. I won't see anything wrong if a system is set so that UTF-32 code units will be stored in 24-bit or even 64-bit memory cells, as long as they respect and fully represent the value range defined in encoding forms, and if the system also provides an interface to convert them with encoding schemes to interoperable streams of 8-bit bytes. Are you saying that UTF-32 code units need to be able to represent any 32-bit value, even if the valid range is limited, for now to the 17 first planes? An API on a 64-bit system that would say that it requires strings being stored with UTF-32 would also define how UTF-32 code units are represented. As long as the valid range 0 to 0x10 can be represented, this interface will be fine. If this system is designed so that two or three code units will be stored in a single 64-bit memory cell, no violation will occur in the valid range. More interestingly, there already exists systems where memory is adressable by units of 1 bit, and on these systems, an UTF-32 code unit will work perfectly if code units are stored by steps of 21 bits of memory. On 64-bit systems, the possibility of addressing any groups individual bits will become an interesting option, notably when handling complex data structures such as bitfields, data compressors, bitmaps, ... No more need to use costly shifts and masking. Nothing would prevent such system to offer interoperability with 8-bit byte based systems (note also that recent memory technologies use fast serial interfaces instead of parallel buses, so that the memory granularity is less important). The only cost for bit-addressing is that it just requires 3 bits of address, but in a 64-bit address, this cost seems very low becaue the global addressable space will still be... more than 2.3*10^18 bytes, much more than any computer will manage in a single process for the next century (according to the Moore's law which doubles the computing capabilities every 3 years). Even such scheme would not limit the performance given that memory caches are paged, and these caches are always increasing, eliminating most of the costs and problems related to data alignment experimented today on bus-based systems. Other territories are also still unexplored in microprocessors, notably the possibility of using non-binary numeric systems (think about optical or magnetic systems which could outperform the current electric systems due to reduced power and heat caused by currents of electrons through molecular substrates, replacing them by shifts of atomic states caused by light rays, and the computing possibilities offered by light diffraction through cristals). The lowest granularity of information in some future may be larger than a dual-state bit, meaning that todays 8-bit systems would need to be emulated using other numerical systems... (Note for example that to store the range 0..0x10, you would need 13 digits on a ternary system, and to store the range of 32-bit integers, you would need 21 ternary digits; memry technologies for such systems may use byte units made of 6 ternary digits, so programmers would have the choice between 3 ternary bytes, i.e. 18 ternary digits, to store our 21-bit code units, or 4 ternary bytes, i.e. 24 ternary digits or more than 34 binary bits, to be able to store the whole 32-bit range.) Nothing there is impossible for the future (when it will become more and more difficult to increase the density of transistors, or to reduce further the voltage, or to increase the working frequency, or to avoid the inevitable and random presence of natural defects in substrates; escaping from the historic binary-only systems may offer interesting opportunities for further performance increase).
Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)
RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)I know wht you mean here: most Linux/Unix filesystems (as well as many legacy filesystems for Windows and MacOS...) do not track the encoding with which filenames were encoded and, depending on local user preferences when that user created that file, filenames on such systems seem to have unpredictable encodings. However the problem comes, most often, when interchanging data from one system to another, through removeable volumes or shared volumes. Needless to say, these systems were badly designed at their origin, and newer filesystems (and OS APIs) offer much better alternative, by either storing explicitly on volumes which encoding it uses, or by forcing all user-selected encodings to a common kernel encoding such as Unicode encoding schemes (this is what FAT32 and NTFS do on filenames created under Windows, since Windows 98 or NT). I understand that there may exist situations, such as Linux/Unix UFS-like filesystems where it will be hard to decide which encoding was used for filenames (or simply for the content of plain-text files). For plain-text files, which have long-enough data in them, automatic identification of the encoding is possible, and used with success in many applications (notably in web browsers). But foir filenames, which are generally short, automatic identification is often difficult. However, UTF-16 remains easy to identify, most often, due to the very unusual frequency of low-values in byte sequences on every even or odd position. UTF-8 is also easy to identify due to its strict rules (without these strict rules, that forbid some sequences, automatic identification of the encoding becomes very risky). If the encoding cannot be identified precisely and explicitly, I think that UTF-16 is much better than UTF-8 (and it also offers a better compromize for total size for names in any modern language). However, it's true that UTF-16 cannot be used on Linux/Unix due to the presence of null bytes. The alternative is then UTF-8, but it is often larger than legacy encodings. An alternative can then be a mixed encoding selection: - choose a legacy encoding that will most often be able to represent valid filenames without loss of information (for example ISO-8859-1, or Cp1252). - encode the filename with it. - try to decode it with a *strict* UTF-8 decoder, as if it was UTF-8 encoded. - if there's no failure, then you must reencode the filename with UTF-8 instead, even if the result is longer. - if the strict UTF-8 decoding fails, you can keep the filename in the first 8-bit encoding... When parsing files: - try decoding filenames with *strict* UTF-8 rules. If this does not fail, then the filename was effectively encoded with UTF-8. - if the decoding failed, decode the filename with the legacy 8-bit encoding. But even with this scheme, you will find interoperability problems because some applications will only expect the legacy encoding, or only the UTF-8 encoding, without deciding...
Re: Re: Word dividers, was: proposals I wrote (and also, didn't write)
De : Michael Everson But there is already in the pipeline a PHOENICIAN WORD SEPARATOR [...] The glyphs for all of these seem indistinguishable, and so are the functions. The only difference seems to be the scripts they are associated with, but punctuation marks are supposed to be not tied to individual scripts. Read the proposal. It is not always a dot. John said: We already have gobs of dots. It's one of those things: on the other hand, Unicode unifies all the Indic dandas, for example. Not for long, one hopes. And other Brahmic dandas are not unified. Why would there be too many dots in Unicode? Unicode does not encode glyphs, but abstract characters nearly independantly of their glyph. The need to encode them is justified by distinct semantics, distinct layout rules, and the need to make each encoded script coherent with itself, with appropraite character properties not wildly and abusively borrowed from other scripts that have their own rules... It's true with the exception of Latin/Greek/Cyrillic or Hiragana/Katakana that have so many interactions that they share the same set of diacritics (for now they are in a block considered generic, but in fact I really think that this genericity should not be abused, and that possibly Unicode could define more precisely to which script family they apply; I see for example little interest in considering the COMBINING DOT ABOVE useful for something else than Greek/Cyrillic/Latin (possibly a few other historic scripts), and that if another script needs a ombining dot above, it should be encoded separately for that script, with its own name and its own properties. There are probably lots of missing properties for combining characters, notably layout interaction properties that are not accurately represented by combining classes (which just define accurately the canonical equivalences, but not the significant equivalences). For me it's part of the Unicode job to document and standardize them. Same thing for Hangul jamos (notably the historic ones, but also SSANG-letters) which should have additional normative properties related to their actual composition and layout.
Re: IUC27 Unicode, Cultural Diversity, and Multilingual Computing / Africa is forgotten once again.
Probably the first thing to do for Africa is to extend the support of softwares with localized contents that can ALREADY be performed with existing encoded scripts. But even there, software companies are not progressing much, even if this causes no technical problems with the existing Unicode repertoire (for example: Xholof, Yoruba, Kenyarwanda, ... and even Arabic, or already used Latin-based transliterations of these languages). If only such localisation efforts were made, there would exist business opportunities in Africa to support other native scripts as well. When you see that even the famous libraries in rich countries can't support the cost of maintaining their database or conserve so many books and arts, imagine what African countries can do when there's not even a version of Windows or Linux supporting these languages for the common user interface needed by everyone at the first basic stages of litteracy and computer knowledge. Thanks, Microsoft has now opened his system to African languages (it was waited since long). I won't blame the richest man on earth to give money to support litteracy and development of culture in Africa, as a fondamental step to the economic development of these areas, but also as a way to fight against ignorance which has caused so much damages in Africa (in terms of security with wars, abuses against children, in terms of freedom with conditions of women, or in terms of health with the tragic pandemies of A.I.D.S., tuberculose...). I really think that the conditions for the developement of Africa will come from education of Africa with tools and methods made for and by African users. But instead of only selling arms or giving military assistance, or giving food, we, in rich countries, should be able to promote donate to support education with the now very cheap technologies, and donations to cheap cultural programs such as the localization of softwares. There's no gain for now trying to sell costly solutions and overprotecting them for now in Africa (even if this means that we should tolerate software piracy in Africa, in order to let its population get their basic rights to knowledge). Whever these countries will choose Windows or Linux does not matter (I think that even promoting Linux usage in Africa would expand the market for proprietary softwares like Windows or Unix distributions; Africa is not Asia, and the conditions for a parallel development are still not there). So let's think about really getting out of our rich country ghettos, and give some efforts to organize technological events and meeting in places which are less costly for African communities.Some places are favorable, without major conflicts or security risks, with reasonnable equipments, and cumfortable accessibility by airlines: Morocco, Tunisia, Egypt, South Africa), but also in the Middle-East (Arab Emirates, Oman?);it's probably too difficult to organize something for now in the currently unsecure Western Africa despite of its cultural interest (however West-African communities are extremely present in Europe). But more than temporary events, there's a need for a more permanent working group in this area. Why not seeking collaboration with the newcoming AfriNIC with its permanent bureaux in South Africa, Egypt and Mauricius? - Original Message - From: Azzedine Ait Khelifa To: [EMAIL PROTECTED] Sent: Wednesday, December 08, 2004 11:08 PM Subject: IUC27 Unicode, Cultural Diversity, and Multilingual Computing / Africa is forgotten once again. Hello All, The subject of this conference is really interesting and very usefull.But once again Africa is forgotten. I want to know, if we can have the same conference "Africa Oriented"scheduled ?If Not, What should we do to have this conference scheduled in a city accesible for african community (like Paris).Thank you all. AAK Découvrez le nouveau Yahoo! Mail : 250 Mo d'espace de stockage pour vos mails !Créez votre Yahoo! MailAvec Yahoo! faites un don et soutenez le Téléthon !
Re: Nicest UTF
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] Ok, so it's the conversion from raw text to escaped character references which should treat combining characters specially. What about with combining acute, which doesn't have a precomposed form? A broken opening tag or a valid text character? Also a broken opening tag for HTML/XML documents (which are NOT plain text documents, and must be first parsed as HTML/XML, before parsing the many text sections contained in text elements, element names, attribute names, attribute values (etc...) as plain-text under the restrictions specified in the HTML or XML specifications (which contain restriction for example on which characters are allowed in names). The XML/HTML core syntax is defined with fixed behavior of some individual characters like '', '', quotation marks, and with special behavior for spaces. This core structure is not plain-text, and cannot be overriden, even by Unicode grapheme clusters. Note that HTML/XML do NOT mandate the use or even the support of Unicode, just the support of a character repertoire that contains some required characters, and the acceptance of at least the ISO/10646 repertoire under some conditions, however the encoding to code points itself is not required for something else than numeric character references, which are more symbolic in a way similar to other named character entities in SGML, than absolute as implying the required support of the repertoire with a single code! So you can as well create fully conforming HTML or XML documents using a character set which includes characters not even defined in Unicode/ISO/IEC 10646, or characters defined only symbolically with just a name. Whever this name will map or not to one or more Unicode characters does not change the validity of the document itself. And all the XML/HTML behavior ignores almost all Unicode properties (including normalization properties, because XML and HTML treat different strings, which are still canonically equivalent, as completely distinct; an important feature for cases like XML Signatures, where normalization of documents should not be applied blindly as it would break the data signature). If you want to normalize XML documents, you should not do it with a normalizer working on the whole document as if it was plain-text. Instead you must normalize the individual strings that are in the XML InfoSet, as accessible when browsing the nodes of its DOM tree, and then you can serialize the normalized tree to create a new document (using CDATA sections and/or character references, if needed to escape some syntaxic characters reserved by XML that would be present in the string data of DOM tree nodes). Note also that a XML document containing references to Unicode non-characters would still be well-formed, because these characters may be part of a non-Unicode charset. XML document validation is a separate and optional problem from XML parsing which checks well-formedness and builds a DOM tree: validation is only performed when matching the DOM tree according to a schema definition, DTD or XSD, in which additional restrictions on allowed characters may be checked, or in which additional symbolic-only characters may be defined and used in the XML document with parsable named entities similar to: gt;. (An example: the schema may contain a definition for a character representing a private company logo, mapped to a symbolic name; the XML document can contain such references, but the DTD may also define an encoding for it in a private charset, so that the XML document will directly use that code; the Apple logo in Macintosh charsets is an example, for which an internal mapping to Unicode PUAs is not sufficient to allow correct processing of multiple XML documents, where PUAs used in each XML documents have no equivalence; the conversion of such documents to Unicode with these PUAs is a lossy conversion, not suitable for XML data processing).
Re: Nicest UTF
From: D. Starner [EMAIL PROTECTED] Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] writes: If it's a broken character reference, then what about A#769; (769 is the code for combining acute if I'm not mistaken)? Please start adding spaces to your entity references or something, because those of us reading this through a web interface are getting very confused. No confusion possible if using any classic mail reader. Blame your ISP (and other ISPs as well like AOL that don't respect the interoperable standards for plain-text emails) for its poor webmail interface, that does not properly escape the characters used in plain-text emails you receive (and that are NOT containing any html entities), but that get inserted blindly within the HTML page they create in their webmail interface. Not only such webmail interface is bogous, but it is also dangerous as it allows arbitrary HTML code to run from plain-text emails. Ask for support and press your ISP to correct its server-side scripts so that it will correctly support plain-text emails !
Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)
From: Antoine Leca [EMAIL PROTECTED] Err, not really. MS-DOS *need to know* the encoding to use, a bit like a *nix application that displays filenames need to know the encoding to use the correct set of glyphs (but constrainst are much more heavy.) Also Windows NT Unicode applications know it, because it can't be changed :-). But when it comes to other Windows applications (still the more common) that happen to operate in 'Ansi' mode, they are subject to the hazard of codepage translations. Even if Windows 'knows' the encoding used for the filesystem (as when it uses NTFS or Joliet, or VFAT on NT kernels; in the other cases it does not even know it, much like with *nix kernels), the only usable set is the _intersection_ of the set used to write and the set used to read; that is, usually, it is restricted to US ASCII, very much like the usable set in *nix cases... True, but this applies to FAT-only filesystems, which happen to store filenames with a OEM charset which is not stored explicitly on the volume. This is a known caveat even for Unix, when you look at the tricky details of the support of Windows file sharing through Samba, when the client requests a file with a short 8.3 name, that a partition used by Windows is supposed to support. In fact, this nightmare comes from the support in Windows of the compatibility with legacy DOS applications which don't know the details and don't use the Win32 APIs with Unicode support. Note that DOS applications use a OEM charset which is part of the user settings, not part of the system settings (see the effects of the command CHCP in a DOS command prompt). FAT32 and NTFS help reconciliate these incompatible charsets because these filesystems also store a LFN (Long File Name) for the same files (in that case the short name, encoded in some ambiguous OEM charset, is just an alias, acting exactly like a hard link on Unix created in the same directory that references the same file). LFN names are UTF-16 encoded and support mostly the same names as in NTFS volumes. However, on FAT32 volumes, the short names are mandatory, unlike on NTFS volumes where they can be created on the fly by the filesystem driver, according to the current user settings for the selected OEM charset, without storing them explicitly on the volume. Windows contains, in CHKDSK, a way to verify that short names of FAT32 filesystems are properly encoded with a coherent OEM charset, using the UTF-16 encoded LFN names as a reference. If needed, corrections for the OEM charset can be applied... This nightmare of incompatible OEM charsets do happen on Windows 98/98SE/ME, when the autoexec.bat file that defines the current user profile is not executing as it should the proper CHCP command, or when this autoexec.bat file has been modified or erased: in that case, the default OEM charset (codepage 437) is used, and short filenames are incorrectly encoded. Another complexity is that Win32 applications, that use a fixed (not user-settable) ANSI charset, and that don't use the Unicode API depend on the conversion from the ANSI charset to the current OEM charset. But if a file is handled through some directory shares via multiple hosts, that have distinct ANSI charsets (i.e. Windows hosts running different localization of Windows, such as a US installation and a French version in the same LAN), the charsets viewed by these hosts will create incompatible encodings on the same shared volume. So the only stable subset for short names, that is not affected by OS localization or user settings is the intersection of all possible ANSI and OEM charsets that can be set in all versions of Windows! No need to say, this designates only the printable ASCII charset for short 8.3 names. Long filenames are not affected by this problem. Conclusion: to use international characters out of ASCII in filenames used by Windows, make sure that the the name is not in a 8.3 short format, so that a long filename, in UTF-16, will be created on FAT32 filesystems or on SMBFS shares (Samba on Unix/Linux, Windows servers)... Or use NTFS (but then resolve the interoperability problems with Linux/Unix client hosts that can't access reliably, for now, to these filesystems, and that are not completely emulated by Unix filesystems used by Samba, due to the limitation on the LanMan sharing protocol, and limitations of Unix filesystems as well that rarely use UTF-8 as their prefered encoding...)
Re: Software support costs (was: Nicest UTF
From: Carl W. Brown [EMAIL PROTECTED] Philippe, Also a broken opening tag for HTML/XML documents In addition to not having endian problems UTF-8 is also useful when tracing intersystem communications data because XML and other tags are usually in the ASCII subset of UTF-8 and stand out making it easier to find the specific data you are looking for. If you are working on XML documents without parsing them first, at least at the DOM level (I don't say after validation), then any generic string handling will likely fail, because you may break the XML wellformed-ness of the document. Note however that you are not required to split the document into many string objects: you could as well create a DOM tree with nodes referencing pairs of offsets in the source document, if you had not to convert also the numeric character references. If not doing so, you'll need to create subnodes within text elements, i.e. working at a level below the normal leaf level in DOM. But anyway, this is what you need to do when there are references to named entities that break the text level; but for simplicity, you would still need to parse CDATA sections to recreate single nodes that may be splitted by CDATA end/start markers inserted in a text stream that contains the ]] sequence of three characters. Clearly, the normative syntax of XML comes first before any other interpretation of the data in individual parsed nodes as plain-text. So in this case, you'll need to create new string instances to store the parsed XML nodes in the DOM tree. Under this consideration, the encoding of the XML document itself plays a very small role, and as you'll need to create a separate copy for the parsed text, the encoding you'll choose for parsed nodes with which you can create a DOM tree can become independant of the encoding actually used in the source XML data, notably because XML allows many distinct encodings in multiple documents that have cross-references. This means that implementing a conversion of the source encoding to the working encoding for DOM tree nodes cannot be avoided, unless you are limiting your parser to handle only some classes of XML documents (remember that XML uses UTF-8 as the default encoding, so you can't ignore it in any XML parser, even if you later decide to handle the parsed node data with UTF-16 or UTF-32). Then a good question is which prefered central encoding you'll use for the parsed nodes: this depends on the Java parser API you use: if this API is written for C with byte-oriented null-terminated strings, UTF-8 will be that best representation (you may choose GB18030). if this API uses a wide-char C interface, UTF-16 or UTF-32 will most often be the only easy solution. In both cases, because the XML document may contain nodes with null bytes (represented by numeric character references like #0;), your API will need to return an actual string length. Then what your application will do with the parsed nodes (i.e. whever it will build a DOM tree, or it will use nodes on the fly to create another document) is the application choice. If a DOM tree is built, an important factor will be the size of XML documents that you can represent and work with in memory for the global DOM tree nodes. Whever these nodes, built by the application, will be left in UTF-8 or UTF-16 or UTF-32, or stored with a more compact representation like SCSU is an application design. If XML documents are very large, the size of the DOM tree will become also very large, and if your application then needs to perform complex transformation on the DOM tree, the constant needs to navigate in the tree will mean that therer will be frequent random accesses to the tree nodes. If the whole tree does not fit well in memory, this may sollicitate a lot the system memory manager, meaning many swaps on disk. Compressing nodes will help reduce the I/O overhead and will improve the data locality, meaning that the overhead of decompression costs will become much lower than the gain in performance caused by reduced system resource usage. However, within the program itself UTF-8 presents a problem when looking for specific data in memory buffers. It is nasty, time consuming and error prone. Mapping UTF-16 to code points is a snap as long as you do not have a lot of surrogates. If you do then probably UTF-32 should be considered. This is not demonstrated by experience. Parsing UTF-8 or UTF-16 is not complex, even in the case of random accesses to the text data, because you always have a bounded and small limit to the number of steps needed to find the beginning offset of a fully encoded code point: for UTF-16, this means at most 1 range test and 1 possible backward step. For UTF-8, this limit for random accesses is at most 3 range tests and 3 possible backward steps. UTF-8 and UTF-16 are very easily supporting backwards and forwards enumerators; so what else do you need to perform any string
Re: Nicest UTF
From: Philippe Verdy [EMAIL PROTECTED] From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] Philippe Verdy [EMAIL PROTECTED] writes: The XML/HTML core syntax is defined with fixed behavior of some individual characters like '', '', quotation marks, and with special behavior for spaces. The point is: what characters mean in this sentence. Code points? Combining character sequences? Something else? See the XML character model document... XML ignores combining sequences. But for Unicode and for XML a character is an abstract character with a single code allocated in a *finite* repertoire. The repertoire of all possible combining characters sequences is already infinite in Unicode, as well as the number of default grapheme clusters they can represent. Note there is some differently relaxed definitions of what constitutes a character for XML. If you look at the XML 1.0 Second Edition, it specifies that the document is a text (defined only as a sequence of characters, which may represent markup or character data) will only contain characters in this set: Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x1-#x10] But the comment following it specifies: any Unicode character, excluding the surrogate blocks, FFFE, and . which is considerably weaker (because it would include ALL basic controls in the range #x0 to #x1F, and not only TAB, LF, CR); the restrictive definition of Char above also includes the whole range of C1 controls (#x80..#x9F), so I can't understand why the Char definition is so restrictive on controls; in addition the definition of Char also *includes* many non-characters (it only excludes surrogates, and U+FFFE and U+, but forgets to exclude U+1FFFE and U+1, U+2FFFE and U+2, ..., U+10FFFE and U+10). So XML does allow Unicode/ISO10646 non-characters... But not all. Apparently many XML parsers seem to ignore the restriction of Char above, notably in CDATA sections The alternative is then to use numeric character references, as defined by this even weaker production (in 4.1. Character and Entity References): CharRef ::= '#' [0-9]+ ';' | '#x' [0-9a-fA-F]+ ';' but with this definition: A character reference refers to a specific character in the ISO/IEC 10646 character set, for example one not directly accessible from available input devices. Which is exactly the purpose of encoding something like #1; to encode a SOH character U+0001 (which after all is a valid Unicode/ISO/IEC10646 character), or even a NUL character. The CharRef production however is annotated by a Well-Formedness Constraint, Legal Character: Characters referred to using character references must match the production for Char. Note however that nearly all XML parsers don't seem to honor this constraint (like SGML parsers...)! This was later amended in an errata for XML 1.0 which now says that the list of code points whose use is *discouraged* (but explicitly *not* forbidden) for the Char production is now: [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF], [#x1FFFE-#x1], [#x2FFFE-#x2], [#x3FFFE-#x3], [#x4FFFE-#x4], [#x5FFFE-#x5], [#x6FFFE-#x6], [#x7FFFE-#x7], [#x8FFFE-#x8], [#x9FFFE-#x9], [#xAFFFE-#xA], [#xBFFFE-#xB], [#xCFFFE-#xC], [#xDFFFE-#xD], [#xEFFFE-#xE], [#xE-#xF], [#x10FFFE-#x10]. This clause is not really normative, but just adds to the confusion...Then comes XML 1.1, that extends the restrictive Char production:Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x1-#x10]with the same comment any Unicode character, excluding the surrogate blocks, FFFE, and .So in XML 1.0, the comment was accurate, not the formal production...In XML 1.1, all C0 and C1 controls (except NUL) are now allowed, but some of them their use is restricted in some cases: RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F] What is even worse is that XML 1.1 now reallows NUL for system identifiers and URIs, through escaping mechanisms. Clearly, the XML specification is inconsistent there, and this would explain why most XML parsers are more permissive than what is given in the Char production of the XML specification, and that they simply refer to the definition of valid codepoints for Unicode and ISO/IEC 10646, excluding only surrogate code points (a valid code point can be a non-character, and can also be a NUL...): the XML parser will accept those code points, but will let the validity control to the application using the parsed XML data, or will offer some tuning options to enable this Char filter (that depends on XML version...). See also the various erratas for XML 1.1, related to RestrictedChar... Or to the list of characters whose use is discouraged (meaning explicitly not forbidden, so allowed...): [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF], [#x1FFFE-#x1], [#x2FFFE-#x2], [#x3FFFE-#x3], [#x4FFFE-#x4], [#x5FFFE-#x5
Re: Please RSVP... (was: US-ASCII)
From: Kenneth Whistler [EMAIL PROTECTED] That it has been morphological reanalyzed is demonstrated by the fact that it takes regular English verb endings, as in: I RSVPed yesterday, right after I got the email. As I said, it is now a bona fide English verb, and most English speakers will treat it as such. Didn't know that. Is this a very recent use? In France, I think that RSVP was introduced and widely used at end of telegraphic messages (that contained lots of conventional acronyms), it survived at the time of telex, but now it is renewed with SMS messages on cellular phones, but is rarely used in emails. May be this was introduced in English at the old time of telegraphs as a useful abbreviation, but with a different meaning when it is used as a verb for saying reply as requested?
Re: Nicest UTF
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] Regarding A, I see three choices: 1. A string is a sequence of code points. 2. A string is a sequence of combining character sequences. 3. A string is a sequence of code points, but it's encouraged to process it in groups of combining character sequences. I'm afraid that anything other than a mixture of 1 and 3 is too complicated to be widely used. Almost everybody is representing strings either as code points, or as even lower-level units like UTF-16 units. And while 2 is nice from the user's point of view, it's a nightmare from the programmer's point of view: Consider that the normalized forms are trying to approach the choice number 2, to create more predictable combining character sequences which can still be processed with algorithms just streams of code points. Remember that the total number of possible code points is finite; but not the total number of possible combining sequences, meaning that text handling will necessarily have to make decisions based on a limited set of properties. Note however that for most Unicode strings, the composite character properties are those of the base character in the sequence. Note also that for some languages/scripts, the linguistically correct unit of work is the grapheme cluster; Unicode just defines default grapheme clusters, which can span several combining sequences (see for example the Hangul script, written with clusters made of multiple combining sequences, where the base character is a Unicode jamo, itself made somtimes of multiple simpler jamos that Unicode do not allow to decompose as canonically equivalent strings, despite this decomposition is inherent of the script itself in its structure, and not bound to the language which Unicode will not standardize). It's hard to create a general model that will work for all scripts encoded in Unicode. There are too many differences. So Unicode just appears to standardize a higher level of processing with combining sequences and normalization forms that are better approaching the linguistic and semantic of the scripts. Consider this level as an intermediate tool that will help simplify the identification of processing units. The reality is that a written language is actually more complex than what can be approached in a single definition of processing units. For many other similar reasons, the ideal working model will be with simple and enumerable abstract characters with a finite number of code points, and with which actual and non-enumerable characters can be composed. But the situation is not ideal for some scripts, notably ideographic ones due to their very complex and often inconsistent composition rules or layout and that require allocating many code points, one for each combination. Working with ideographic scripts requires much more character properties than with other scripts (see for example the huge and various properties defined in UniHan, which are still not standardized due to the difficulty to represent them and the slow discovery of errors, omissions, or contradictions found in various sources for this data...)
Re: Roundtripping in Unicode
From: Doug Ewell [EMAIL PROTECTED] Lars Kristan wrote: I am sure one of the standardizers will find a Unicodally correct way of putting it. I can't even understand that paragraph, let alone paraphrase it. My understanding of his question and my reponse to his problem is that you MUST not use VALID Unicode codepoints to represent INVALID byte sequences found in some text with alleged UTF encoding. The only way is to use INVALID codepoints, out of the Unicode space, and then design an encoding scheme that contains and extends the Unicode UTF, and make sure that there will be no possible interaction between such encoded binary data and encoded plain text (so the conversion between the encoding scheme of the bytes stream and the encoding form with code units or codepoints in memory must be fully bijective; it is hard to design if you have to also support multiple UTF encoding schemes, because the invalid byte sequences of these UTF schemes are not the same, and must then be represented with distinct invalid codepoints or code units for each external UTF!) I won't support the idea of reserving some valid codepoint in the Unicode space to allow storing something which is already considered invalid character data, notably because the Unicode standard is evolving, and such private encoding form which would work now could become incompatible with a later version of the Unicode standard, or a later standardized Unicode encoding scheme, meaning that interoperability would be lost... The only thing for which you have a guarantee that Unicode will not assign a mandatory behavior is the codepoint space after U+10 (I'm not sure about the permanent invalidity of some code unit spaces in UTF-8 and UTF-16 encoding forms; also I'm not sure that there will be enough free space in later standard encoding forms or schemes, see for example SCSU or BOCU-1, or with other already used private encoding forms like the modified UTF-8 extended encoding scheme defined by Sun in Java).
Re: Please RSVP... (was: US-ASCII)
From: Séamas Ó Brógáin [EMAIL PROTECTED] John wrote: As far as I know, they were first used in formal invitations (to weddings, funerals, dances, etc.) in the corner of the card, as both shorter and more fancy than the older phrase The favor of your reply is requested. This is correct. The practice dates from the end of the nineteenth century. At that time, transmission of text on long distances was with telegraphic systems, where texts had to be short because they were expensive, and because the available bandwidths were very limited to support many customers, notably for long distance and international communications. I would not be surprized if this acronym was defined in some internationally accepted set of abbreviations used by telegraphists, so that their clients became exposed to these acronyms when reading telegrams received from their local post office that did not take the time to reconvert these acronyms to full words... I have read some articles about the existence in telegraphic standards of such list of abbreviations. Isn't there a remaining, possibly deprecated, ISO standard about them? (For example there has existed the 5-bit system, because it was important to limit the available charset, and to limit the bandwidth required to transmit messages, at a time were searches on data compression was not as advanced and successful as today, and the computing resources or human capabilities to decode complex compression schemes would have been too much expensive or impossible to satisfy on a large scale).
Re: infinite combinations, was Re: Nicest UTF
From: Peter R. Mueller-Roemer [EMAIL PROTECTED] For a fixed length of combining character sequence (base + 3 combining marks is the most I have seen graphically distinguishable) the repertore is still finite. I do think that you are underestimating the repertoire. Also Unicode does NOT define an upper bound for the length of combining sequences, and also not on the length of default grapheme clusters (which can be composed of multiple combining sequences, for example in the Hangul or Tibetan scripts) Your estimations also ignores various layouts found in Asian texts, and the particular structures of historic texts which can use many diacritics on top of a single base letter starting a combining sequence. The model of these scripts (for example Hebrew) imply the justaposition of up to 13 or 15 levels of diacritics for the same base letter! In practice, it's impossible to enumerate all existing combinations (and ensure that they will be assigned a unique code within a reasonnably limited code point), and that's why a simpler model based on more basic but combinable code points is used in Unicode: it frees Unicode from having to encode all of them (this is already a difficult task for the Han script which could have been encoded with combining sequences, if the algorithms needed to create the necesssary layout had not needed the use of so many complex rules and so many exceptions...)
Re: Please RSVP... (was: US-ASCII)
From: Michael Everson [EMAIL PROTECTED] Nonsense. You might as well try to explain SPQR on the same basis. I won't. I know that SPQR was used on architectural constructions as a symbol of the Roman Empire, and it was a wellknown acronym of a Latin expression. It largely predates the invention of the telegraph. My only comment was related to the date of origin of the acronym. It's a coincidental, may be accidental, analysis. And it ignores the fact that RSVP was printed on posted invitation cards; such invitations were not, as a rule, sent by telegraph. And another site gives other historic context of this expression: the etiquette of the French court of King Louis XIV in the 16th century, and the use of the French etiquette throughout Europe and in the United States up to the 19th century : http://people.howstuffworks.com/question450.htm So the etiquette would have continued to be used in the wellknown acronym as a convenience when telegrams were invented. I just discovered after some searches an old notice of the French Poste, with acronyms and abbreviations to be used preferably by telegraphists... RSVP is present in that list, among other abbreviations used to encode the routing and delivery options of the telegram itself. Probably an interesting example of the first communication protocol standards, to limit false interpretations.
Re: Roundtripping in Unicode
RE: Roundtripping in UnicodeMy view about this problem of roundtripping is that if data, supposed to contain only valid UTF-8 sequences, contains some invalid byte sequences that still need to be roundtripped to some code point for internal management that can be roundtripped later to the original invalid byte sequence, then these invalid bytes MUST NOT be converted to valid code points. An implementation based on internal UTF-32 code units representation could use, privately only, only the range which is NOT assigned to valid Unicode code points; so such application would need to convert these bytes into code points higher than 0x10; but the same application will no longer be conforming to strict UTF-32 requirements: the application will represent this way binary data which is NOT bound to Unicode rules and which can't be valid plain-text. For example, {0xFF+n} where n is the byte value to encapsulate. Don't call it UTF-32, because it MUST remain for private use only! This will be more complex if the application uses UTF-16 code units, because there are only TWO code units that can be used to recognize such invalid-text data within a text stream. It is possible to do that, but with MUCH care: For example encoding 0xFFFE before each byte value converted to some 16-bit code unit. The problem is that backward parsing of strings just check that a code unit is a low surrogate, to see if a second backward step is needed to get the first high surrogate, and so U+FFFE would need to be used (privately only) as another lead high surrogate with special (internal) meaning for round trip compatibility, and so the best choice for the code unit encoding the invalid byte value would be to use a standard low surrogate to store this byte. So a qualifying internal representation would be {0xFFFE, 0xDC00+n} where n is the byte value to encapsulate. Don't call this UTF-16, because it is not UTF-16. An implementation that uses UTF-8 for valid string could use the invalid ranges for lead bytes to encapsultate invalid byte values. Note however that invalid bytes you would need to represent have 256 possible values, but the UTF-8 lead bytes have only 2 reserved values (0xC0 and 0xC1) each for 64 codes, if you want to use an encoding on two bytes. The alternative would be to use the UTF-8 lead byte values which have initially been assigned to byte sequences longer than 4 bytes, and that are now unassigned/invalid in standard UTF-8. For example: {0xF8+(n/64); 0x80+(n%64)}. Here also it will be a private encoding, that should NOT be named UTF-8, and the application should clearly document that it will not only accept any valid Unicode string, but also some invalid data which will have some roundtrip compatibility. So what is the problem: suppose that the application, internally, starts to generate strings containing any occurences of such private sequences, then it will be possible for the application to generate on its output a byte stream that would NOT have roundtrip compatibility, back to the private representation. So roundtripping would only be guaranteed for streams converted FROM an UTF-8 where some invalid sequences are present and must be preserved by the internal representation. So the transformation is not bijective as you would think, and this potentially creates lots of possible security issues. So for such application, it would be much more appropriate to use different datatypes and structures to represent either streams of binary bytes, or streams of characters, and recognize them independantly. The need of a bijective representation means that the input stream will contain an encapsultation to recognize *exactly* if the stream is text or binary. If the application is a filesystem storing filenames and there's no place in the filesystem to encode if a filename is binary or text, then you are left without any secured solution! So the best thing you can do to secure your application, is to REJECT/IGNORE all files whose names do not match the strict UTF-8 encoding rules that your application expect (all will happen as if those files were not present, but this may still create security problems if an application that does not see any file in a directory wants to delete that directory, assuming it is empty... In that case the application must be ready to accept the presence of directories without any content, and must not depend on the presence of a directory to determine that it has some contents; anyway, on secured filesystems, such things could happen due to access restrictions, completely unrelated to the encoding of filenames, and it is not unreasonnable to prepare the application so that it will behave correctly face to inaccessible files or directories, so that the application will also correctly handle the fact that the same filesystem will contain non plain-text and inaccessible filenames). Anyway, the exposed solutions above demonstrate
Re: RE: Roundtripping in Unicode
Lars Kristan wrote: What I was talking about in the paragraph in question is what happens if you want to take unassigned codepoints and give them a new status. You don't need to do that. No Unicode application must assign semantics to unassigned codepoints. If a source sequence is invalid, and you want to preserve it,then this sequencemust remain invalid if you change its encoding. So there's no need for Unicode to assign valid code points for invalid source data. There's enough space *assigned* as invalid (or assigned to non-characters) in all UT forms, that allow an application to create a local conversion scheme which will perform a bijective conversion of invalid sequences: - for example in UTF-8: trailing bytes 0x80 to 0xBFisolated or in excess, or even the invalid lead bytes 0xF8 to 0xFF - for example in UTF-16: 0XFFFE, 0x - for example in UTF-32: same as UTF-16, plus all code units above 0x10 Using PUA space or some unassigned space in Unicode to represent invalid sequences present in a source textwill be a severe designerror in all cases, because that conversion will not be bejective and could map invalid sequences to valid ones without further notice, changing the status of the original text which should be kept as incorrectly encoded, until explicitly corrected or until the source text is reparsed with another more appriate encoding. (In fact I also think that mapping invalid sequences to U+FFFD is also an error, because U+FFFD is valid, and the presence of the encoding error in the sourceis lost, and will not throw exceptions in further processings of the remapped text, unless the application constantly checks for the presence of U+FFFD in the text stream, and all modules in the application explicitly forbids U+FFFD within its interface...)
Re: Roundtripping in Unicode
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] Lars Kristan [EMAIL PROTECTED] writes: Hm, here lies the catch. According to UTC, you need to keep processing the UNIX filenames as BINARY data. And, also according to UTC, any UTF-8 function is allowed to reject invalid sequences. Basically, you are not supposed to use strcpy to process filenames. No: strcpy passes raw bytes, it does not interpret them according to the locale. It's not an UTF-8 function. Correct: [wc]strcpy() handles string instances, but not all string instances are plain-text, so they don't need to obey to UTF encoding rules (they just obey to the convention of null-byte termination, with no restriction on the string length, measured as a size in [w]char[_t] but not as a number of Unicode characters). This is true for the whole standard C/C++ string libraries, as well as in Java (String and Char objects or native char datatype), and as well in almost all string handling libraries of common programming languages. A locale defined as UTF-8 will experiment lots of problems because of the various ways applications will behave face to encoding errors encountered in filenames: exceptions thrown aborting the program, substitution by ? or U+FFFD causing wrong files to be accessed, some files not treated because their name was considered invalid althoug they were effectively created by some user of another locale... Filenames are identifiers coded as strings, not as plain-text (even if most of these filename strings are plain-text). The solution if then to use a locale based on a relaxed version of UTF-8 (some spoke about defining a NOT-UTF-8 and NOT-UTF-16 encodings to allow any sequence of code units, but nobody has thought about how to make NOT-UTF-8 and NOT-UTF-16 mutually fully reversible; now add NOT-UTF-32 to this nightmare and you will see that NOT-UTF-32 needs to encode 2^32 distinct NOT-Unicode-codepoints, and that they must map bijectively to exactly all 2^32 sequences possible in NOT-UTF-16 and NOT-UTF-8; I have not found a solution to this problem, and I don't know if such solution even exists; if such solution exists, it should be quite complex...).