Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)
On Thu, Jul 4, 2013 at 12:07 PM, Michael Everson ever...@evertype.com wrote: On 4 Jul 2013, at 03:56, Phillips, Addison addi...@lab126.com wrote: I don't disagree with the potential need for changing the decomposition. That discussion seems clear and is only muddled by talking about variant, language sensitive rendering. That isn't the only consideration, right? No, Addison, we can't change the decomposition, That would invalidate all the data everywhere in Latvia. I disagree that language tagging is not a valid means of getting language specific shaping (which could solve a specific problem). This is hardly confined to CJK or Latvian. Minority languages can, in fact, take advantage of it, within reason (documentation is a problem and it presupposes that glyph support is available). In fact, in some ways, language based glyph selection is possibly easier to achieve because the number of implementations is relatively small. The problem is in pretending that a cedilla and a comma below are equivalent because in some script fonts in France or Turkey routinely write some sort of undifferentiated tick for ç. :-) Sure they are not equivalent, but stop pretending it is only in some script fonts, the page http://typophile.com/node/49347 has plenty of examples where it is not in script fonts. In some languages the cedilla can have a shape similar to that of a comma, it's a fact. Any native speaker will tell you the comma-like form and others are acceptable. Just look at lemonde.fr or zaman.com.tr, both very popular newspapers use webfonts with non classic cedilla (Le Monde uses TheMix —even in print it uses TheAntiqua with their comma-like cedilla— and Zaman uses a custom font with an attached tick-like cedilla). This is not the majority but it is frequent enough. As far as I can see the only solution is: Mandate that only the comma-below shape is appropriate for Ḑḑ Ģģ Ķķ Ļļ Ņņ Ŗŗ despite their decomposition to cedilla. Encode a set of undecomposable Dd Gg Kk Ll Nn Rr with invariant cedilla for display of that glyph with those base letters. The only strangeness here is that D̦d̦ G̦g̦ K̦k̦ L̦l̦ N̦n̦ R̦r̦ with genuine combining comma below are confusable with the Latvian/Livonian letters, but that is already the case. None of this addresses the problem of pain text representation or the potential need to represent what are apparently different characters with a single encoding. But if it is just presentation we're talking about... how does this differ from, for example, Serbian vs Russian? What, the italic lowercase т? That is really not comparable to this issue. Michael Everson * http://www.evertype.com/ -- Denis Moyogo Jacquerye African Network for Localisation http://www.africanlocalisation.net/ Nkótá ya Kongó míbalé --- http://info-langues-congo.1sd.org/ DejaVu fonts --- http://www.dejavu-fonts.org/
Re: writing in an alphabet with fewer letters: letter replacements
On Thu, 04 Jul 2013 22:19:11 -0700 Stephan Stiller stephan.stil...@gmail.com wrote: I know of standards for transcribing foreign alphabets (by /target/ locale – Are they relevant here? If so, which?), but This may well depend on both source and target locale! How often will locale have to be broken down on a non-local basis? Different newspapers in the same city may have different conventions! ...there must also be popular practices (by /source/ locale) that have developed among each locale's residents for traveling. Also old conventions for limited telegraphic equipment and computer systems. For example, there is a Romanian tradition of converting combining squiggle below to a following 'z'. I've not seen that tradition used by British newspapers - the squiggle is just dropped. I've seen French comments in Fortran that just drop all the accents - most disconcerting to read! Richard.
Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)
All this discussion if going to nowhere. What would be more decisive would the fact that these shapes for celillas had constrasting uses in any language. As far as I can tell, this has not been demonstrated (not even in Romanian). So the proposal is to disunify characters that are already encoded, with the addition of new confusables. Do we need that for any distinction in any language ? Does Marshallese really care about the shape of its cedillla or comma below when they are perceived as equivalent, or used interchangeably ? For now we've not seen any assertion stating that the use of the wrong prefered shape was an orthographic error. The orthography is flexible enough to not care about such visual differences, and that's why font styles can also be changed without chaing the actual meaning of text. That's why nobody ever complained that LeMonde.fr used a comma-like cedilla for French. People don't care about this visual detail, it is just a stylistic choice. But people will care about stable orthographies if plan text searches or matches don't find the text that it is supposed to find. That's why for French we now need a collation rule that will equate all these shape variants (when they are cont canoncally equivalent). But adding more visual confusables will just impact French now, by forcing us to add an additional collation rule to equate these non canonically equivalents. This done, we will continue to ignore these differences in French (with a very minor binary difference for searches). But for exact matches (when the encoded text is used as an identifier, such as filenames), we will still want to make sure that encoded strings are canoncally equivalent. This won't be possible with the new proposed characters (meaning that they won't be used in French. But for Marshallese, these new recommended alternatives will create new difficulties, without really solving the problem. I don't think that language tagging is even necessary : a Marshaese user that will want to see comma or cedillas as he wants for these characters can just have its own personal preferences in his user profile, stating that that he prefers reading text like in Marshallese, and will just represent these encoded cedillas/comma below as expected for Marshallese, even if the encoded text is not written in Marshallese. And I don't think that any one will complain (except if that user cannot use a Marshallese locale, because it is not supported by his software environment... something that can change... in which case he will just see documents, web sites and OS interfaces localized in another language, in which the visual rendering of Marshallese will still be coherent for the context of another language which has different visual preferences). I think that this situation is similar for the visual representation of sinograms, depending on user's locale preferences, if he's Japanese, Korean, continental Chinese, Southern Chinese (in Hong Kong or Macau), Singaporian, or lives in another country anywhere else in the world... The situation can be solved by adding user preferences in his local environment to select the prefered set of shapes when explicit language tagging of documents is not applicable. 2013/7/5 Denis Jacquerye moy...@gmail.com On Thu, Jul 4, 2013 at 12:07 PM, Michael Everson ever...@evertype.com wrote: On 4 Jul 2013, at 03:56, Phillips, Addison addi...@lab126.com wrote: I don't disagree with the potential need for changing the decomposition. That discussion seems clear and is only muddled by talking about variant, language sensitive rendering. That isn't the only consideration, right? No, Addison, we can't change the decomposition, That would invalidate all the data everywhere in Latvia. I disagree that language tagging is not a valid means of getting language specific shaping (which could solve a specific problem). This is hardly confined to CJK or Latvian. Minority languages can, in fact, take advantage of it, within reason (documentation is a problem and it presupposes that glyph support is available). In fact, in some ways, language based glyph selection is possibly easier to achieve because the number of implementations is relatively small. The problem is in pretending that a cedilla and a comma below are equivalent because in some script fonts in France or Turkey routinely write some sort of undifferentiated tick for ç. :-) Sure they are not equivalent, but stop pretending it is only in some script fonts, the page http://typophile.com/node/49347 has plenty of examples where it is not in script fonts. In some languages the cedilla can have a shape similar to that of a comma, it's a fact. Any native speaker will tell you the comma-like form and others are acceptable. Just look at lemonde.fr or zaman.com.tr, both very popular newspapers use webfonts with non classic cedilla (Le Monde uses TheMix —even in print it uses TheAntiqua with their comma-like cedilla— and Zaman uses
Re: writing in an alphabet with fewer letters: letter replacements
Hi Richard, I know of standards for transcribing foreign alphabets (by /target/ locale – Are they relevant here? If so, which?) [...] This may well depend on both source and target locale! How often will locale have to be broken down on a non-local basis? Different newspapers in the same city may have different conventions! It also depends on the time/era, and sometimes there's just a mess. I recall the chaotic variation in the rendering of names of Eastern European composers in German (and that of foreign names in general). And I think it's futile to try to precisely describe journalistic practice in this domain. What I had in mind was more specific: Germans are supposed to convert [ä,ö,ü,ß] to [ae,oe,ue,ss], though I don't know what's considered best/legal wrt documents required for entering the US, for example. I was thinking that there might be a similar Icelandic tradition of mapping non-{A, ..., Z}-letters into the {A, ..., Z}∪punctuation space, for the purpose of filling out forms in another country and such. In that regard, I was wondering whether any of the numerous transcription schemes that are floating around (and are sometimes backed by standardization bodies) play a role here and are prescriptive or (if they are not prescriptive) followed to some extent. For example, there is a Romanian tradition of converting combining squiggle below to a following 'z'. squiggle – you're reminding me of /that other thread/ going on right now ;-) drop all the accents That (more generally: finding a root / approximation / approximating digraph) might be the most common method (my wild guess based on casual observation, and it's not exactly a particularly difficult guess to make), but for ð/þ it's not clear what people will do. I'll save the list the obvious speculation and let someone who knows answer directly. Stephan
Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)
On 2013/07/05 16:04, Denis Jacquerye wrote: On Thu, Jul 4, 2013 at 12:07 PM, Michael Eversonever...@evertype.com wrote: The problem is in pretending that a cedilla and a comma below are equivalent because in some script fonts in France or Turkey routinely write some sort of undifferentiated tick for ç. :-) Can we make sure we have covered this from the other side? Are there any languages where there is a letter where both the form with a cedilla and the form with a comma below are used, and are distinguished? In other words, are there any languages where a user seeing a wrong form would be confused as to what the letter is, rather than being potentially surprised or annoyed at the details of the shape? Regards, Martin.
Re: writing in an alphabet with fewer letters: letter replacements
2013/7/5 Richard Wordingham richard.wording...@ntlworld.com I've seen French comments in Fortran that just drop all the accents - most disconcerting to read! This is an old problem. It first appeared because lack of Unicode support in famous historic programming languages (and it persists today in common languages like C, C++, Basic, Fortran, Cobol, Ada, Pascal...), but now it should be noted that many French programmers have a very poor level of orthography and don't know how to select the proper accents (they also often don't care about correct capitalization, using English rules in their programs, or copying the capitalization of English in programs UI, such as capitals on every word, even where it would not even be used notammly in English; correct punctuatin is also frequently ignored. In propram comments, nobody cares about that, because users of applications will notmally not see them, and because I18n and L10n is handled elsewhere in code and data). If you are trying to learn French, you won't understand the contents found in many French-speaking popular talk areas or forums (and it will be worse in many short Facebook status, Tweets, SMS, and chat: if you're not cumfortable with the most common deviations, you will hate these places, and will not use it very often for some time, then will learn how to use these only for selected users and communities...). They are generally horroble and there's even an widely accepted policy to not complain about the orthography used by others (because this raises flaming and pollutes more). Yes we are horrified, but most will silently adapt to that fact. After all this is a living language, and languages will evolve with such simplifications, or initial abuses that will be integrated sooner or later. As ong as nobody complains against those that use the standard orthography, and there's still an effort to make it understood, the orthography will accept some of these simplified forms, if they don't introduce too much confusion. We remember the battles about the 1999 French ortthographic reform: it was highly criticized, but anyway now, people accept it indifferently... except some administrations that still want to use an official jargon with standardized words, expressions, and orthographies.
Re: writing in an alphabet with fewer letters: letter replacements
On 2013/07/05 17:25, Stephan Stiller wrote: What I had in mind was more specific: Germans are supposed to convert [ä,ö,ü,ß] to [ae,oe,ue,ss], though I don't know what's considered best/legal wrt documents required for entering the US, for example. I have always used Duerst on plane tickets and the like. On the customs form that you have to fill in when entering the US (the green one), I always just write Dürst; paper is patient. I have added Durst as an additional alternate spelling on a long-term visa application form once, just in case. My impression is that US customs officials are either quite knowledgeable or quite tolerant on such issues (or a mixture of both). The same applies to customs officials in other countries I have traveled to, and other people at airports and such. I guess they get used to these cases quite quickly, seeing so many passports each day. Regards, Martin.
RE: writing in an alphabet with fewer letters: letter replacements
Google is your friend – some clues: http://www.forbes.com/profile/johanna-sigurdardottir/ http://en.wikipedia.org/wiki/Althingi Jony From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Stephan Stiller Sent: יום ו 05 יולי 2013 11:25 To: unicode@unicode.org Subject: Re: writing in an alphabet with fewer letters: letter replacements Hi Richard, I know of standards for transcribing foreign alphabets (by /target/ locale – Are they relevant here? If so, which?) [...] This may well depend on both source and target locale! How often will locale have to be broken down on a non-local basis? Different newspapers in the same city may have different conventions! It also depends on the time/era, and sometimes there's just a mess. I recall the chaotic variation in the rendering of names of Eastern European composers in German (and that of foreign names in general). And I think it's futile to try to precisely describe journalistic practice in this domain. What I had in mind was more specific: Germans are supposed to convert [ä,ö,ü,ß] to [ae,oe,ue,ss], though I don't know what's considered best/legal wrt documents required for entering the US, for example. I was thinking that there might be a similar Icelandic tradition of mapping non-{A, ..., Z}-letters into the {A, ..., Z}∪punctuation space, for the purpose of filling out forms in another country and such. In that regard, I was wondering whether any of the numerous transcription schemes that are floating around (and are sometimes backed by standardization bodies) play a role here and are prescriptive or (if they are not prescriptive) followed to some extent. For example, there is a Romanian tradition of converting combining squiggle below to a following 'z'. squiggle – you're reminding me of that other thread going on right now ;-) drop all the accents That (more generally: finding a root / approximation / approximating digraph) might be the most common method (my wild guess based on casual observation, and it's not exactly a particularly difficult guess to make), but for ð/þ it's not clear what people will do. I'll save the list the obvious speculation and let someone who knows answer directly. Stephan
Re: writing in an alphabet with fewer letters: letter replacements
Hi Jonathan, I definitely appreciate the partial datapoints from your links, but Google is your friend by itself doesn't lead us closer to a real answer, and in this case I think that there are at least some good answers, and in any case some answers will be better than others. This reminds me of former South Korean president 이승만 (not exactly a sympathetic figure), whose most common English rendering (Syngman Rhee) doesn't follow any system of transcription I'm aware of. (For Chinese, historical figures seem to be predominantly rendered in pinyin now, though I haven't tried to do a thorough check including TW etc, and Sun Yat-sen is a famous exception. I think Korean figures mostly follow the Revised Romanization now, but Rhee persists and stands out.) Another interesting case I know is that of a Bhutanese gentleman I met in an airport: the name in his passport wasn't listed in the original Dzongkha (with Bhutanese Tibetan writing) at all (and nowhere in the passport, according to him) but only with Latin letters. Stephan
RE: writing in an alphabet with fewer letters: letter replacements
The fallback for ETH (ð,Ð) is normally d,D and the fallback for THORN (þ,Þ) is normally th,Th. I’m not aware of any authoritative source for all of the fallbacks. Several years ago there was a CEN project trying to define the European fallbacks, but the project team could not deliver something generally acceptable. Regards, Erkki Lähettäjä: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] Puolesta Stephan Stiller Lähetetty: 5. heinäkuuta 2013 8:19 Vastaanottaja: Unicode Public Aihe: writing in an alphabet with fewer letters: letter replacements Hi folks, For languages whose alphabets aren't too far apart (I'm thinking mostly of the set of Latin-derived alphabets), what is a good place for finding out how letter replacements for letters that are missing in a different country/locale are done? For example, how will an Icelander normally write his name on a form in a foreign country that is lacking ð and þ? I know of standards for transcribing foreign alphabets (by target locale – Are they relevant here? If so, which?), but there must also be popular practices (by source locale) that have developed among each locale's residents for traveling. Are there comprehensive resources? If not, are efforts underway? Stephan
Re: writing in an alphabet with fewer letters: letter replacements
My impression is that US customs officials are either quite knowledgeable or quite tolerant on such issues (or a mixture of both). The same applies to customs officials in other countries I have traveled to, and other people at airports and such. Thanks. (And, I don't have the knowledge to agree or disagree.) I can't resist mentioning the case of Edward Snowden's middle name in Hong Kong :-) The issue there was a different one from what I am asking about, though, and you never know whether such things actually make a difference. But in general, trying to figure things out and then being consistent will be good advice. Stephan
RE: writing in an alphabet with fewer letters: letter replacements
Hi Stephan, Tell me about it. The official transliteration for Hebrew to the Latin script is obsolete, and the situation in this country is a mess. Best regards, Jonathan (Jony) Rosenne From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Stephan Stiller Sent: יום ו 05 יולי 2013 12:21 To: unicode@unicode.org Subject: Re: writing in an alphabet with fewer letters: letter replacements Hi Jonathan, I definitely appreciate the partial datapoints from your links, but Google is your friend by itself doesn't lead us closer to a real answer, and in this case I think that there are at least some good answers, and in any case some answers will be better than others. This reminds me of former South Korean president 이승만 (not exactly a sympathetic figure), whose most common English rendering (Syngman Rhee) doesn't follow any system of transcription I'm aware of. (For Chinese, historical figures seem to be predominantly rendered in pinyin now, though I haven't tried to do a thorough check including TW etc, and Sun Yat-sen is a famous exception. I think Korean figures mostly follow the Revised Romanization now, but Rhee persists and stands out.) Another interesting case I know is that of a Bhutanese gentleman I met in an airport: the name in his passport wasn't listed in the original Dzongkha (with Bhutanese Tibetan writing) at all (and nowhere in the passport, according to him) but only with Latin letters. Stephan
Re: writing in an alphabet with fewer letters: letter replacements
Hey Jonathan, The official transliteration for Hebrew to the Latin script is obsolete What is the latest recommended scheme? and the situation in this country is a mess Let me guess: it has to do with the number of spelling variants in names of /aliyah/ immigrants? I've always been wondering whether someone named Kahn will be spelled כהן as a new citizen of Israel. Stephan
RE: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)
And I'm sorry for having supported you then, since the Romanians claimed at the time that they could not live with a font variation, since they needed to be able to have a distinction between s and t with cedilla and comma below in the same text, only to come up with a national standard with only the comma below. Furthermore, the Romanian texts are not at all consistent in this area. Sincerely, Erkki -Alkuperäinen viesti- Lähettäjä: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] Puolesta Michael Everson Lähetetty: 5. heinäkuuta 2013 12:11 Vastaanottaja: unicode Unicode Discussion Aihe: Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below) ... I fought this battle back when I supported the Romanian disunification of their letters from the Turkish ones. We're just finishing the job now, as far as I can see. Michael Everson * http://www.evertype.com/
Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)
On 5 Jul 2013, at 11:27, Erkki I Kolehmainen e...@iki.fi wrote: And I'm sorry for having supported you then, since the Romanians claimed at the time that they could not live with a font variation, since they needed to be able to have a distinction between s and t with cedilla and comma below in the same text, They do, Erkki, when a text contains both Romanian and Turkish. Each language should be represented correctly. only to come up with a national standard with only the comma below. I don't think we have to worry about that old 8-bit code table at this point. Furthermore, the Romanian texts are not at all consistent in this area. No, and the fault lies with 8859. More and more new texts however conform to the expected encoding. By the way, what we did for the Romanians at the time was to add precomposed Ș and ș to the standard in addition to the existing precomposed Ş ş. Note that the former could at that time already be supported by the UCS, as S s with combining comma below. Michael Everson * http://www.evertype.com/
Re: writing in an alphabet with fewer letters: letter replacements
Philippe Verdy verd...@wanadoo.fr writes: 2013/7/5 Richard Wordingham richard.wording...@ntlworld.com I've seen French comments in Fortran that just drop all the accents - most disconcerting to read! This is an old problem. It first appeared because lack of Unicode support in famous historic programming languages (and it persists today in common languages like C, C++, Basic, Fortran, Cobol, Ada, Pascal...), but now it should be noted that many French programmers This is no longer true of Ada, at least, both the standard and the popular GNAT implementation support Unicode: with Ada.Text_Io; procedure Testu is Grüß : constant String := Grüezi mitenand; begin Ada.Text_Io.Put_Line(Grüß); end Testu; -- Ian ◎
RE: writing in an alphabet with fewer letters: letter replacements
The official transliteration for Hebrew to the Latin script was in my opinion is based on German. Thus ו was w etc. It was revised in 2011 but the revised version is not in common use. Kahn would normally be קאהן. Here is a press article about it (in Hebrew): http://www.nrg.co.il/online/1/ART1/438/793.html The “up to date” official version may be found at: http://hebrew-academy.huji.ac.il/hahlatot/TheTranscription/Documents/ATAR1.pdf Best regards, Jonathan (Jony) Rosenne From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Stephan Stiller Sent: יום ו 05 יולי 2013 12:57 To: unicode@unicode.org Subject: Re: writing in an alphabet with fewer letters: letter replacements Hey Jonathan, The official transliteration for Hebrew to the Latin script is obsolete What is the latest recommended scheme? and the situation in this country is a mess Let me guess: it has to do with the number of spelling variants in names of aliyah immigrants? I've always been wondering whether someone named Kahn will be spelled כהן as a new citizen of Israel. Stephan
ISO 2955
A topic that is different but related to the current discussion writing in an alphabet with fewer letters: letter replacements is the question about writing units with limited character sets. This is not a somehow academical question but a real existing problem in some situations. You might remember there was a standard named ISO 2955-1983 Information processing -- Representaion of SI and other units in systems with limited character sets. It has given some rules for writing deg for °, u for µ, m2 for m², and so on. However it was withdrawn in 2001. Does anybody know whether there is a successor standard? If not, does someone know the reason? DIN 66030 gives a solution for German, are there similar standards for other countries, especially Japan and China? Let me briefly outline the context of the question: Even if ISO 2955 was designed with a focus on systems using ISO 646 (ASCII) as character set, there are certain issues even in Unicode based systems, especially in Japanese and Chinese context. Such a context can briefly described as follows: - Multi language user interface (language switching is possible) - Only one font per language, plain text only (no font linking, no font fallback, no formatted text) - East Asian languages use standard third party fonts, like TrueType fonts. Any kind of embedded devices are good examples. The simple and often followed rule for East Asian fonts is covering a well-known language-specific encoding standard, like GB2312 or GBK for Simplified Chinese, Shift-JIS for Japanese, EUC-KR or UHC for Korean and BIG5 for Traditional Chinese. Since these standards contain several alphabets like Cyrillic, Latin, Greek, Bopomofo, Kana, and so on, it is the easiest way for font manufacturers to use the encoding standard as a guideline for the character set coverage even for fonts designed for Unicode based systems. Now you buy and install a standard font and your Japanese or Chinese user interface looks beautiful on the target system. All are happy. However, this is true only until you have texts with units like km·m/s² or µA/mm². The translation was done on a Unicode and SI basis, but the font covers only Shift-JIS, GBK, or BIG5, respectively. Thus U+00B2 Superscript Two, U+00B3 Superscript Three and U+00B5 Micro Sign are missing in the font and cannot be displayed. Fonts without these limitations are not easy to obtain. Thanks in advance for any answer. Albrecht
Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)
2013/7/5 Michael Everson ever...@evertype.com On 5 Jul 2013, at 08:04, Denis Jacquerye moy...@gmail.com wrote: The problem is in pretending that a cedilla and a comma below are equivalent because in some script fonts in France or Turkey routinely write some sort of undifferentiated tick for ç. :-) Sure they are not equivalent, but stop pretending it is only in some script fonts, the page http://typophile.com/node/49347 has plenty of examples where it is not in script fonts. In some languages the cedilla can have a shape similar to that of a comma, it's a fact. Yes, well, if there are non-script fonts which have this feature, it nevertheless derived from handwriting. Would any French primer for young children routinely use a full-formed C WITH COMMA BELOW C̦ c̦ regularly throughout? No. Would readers of Le Monde notice if all the fonts one day shifted to C̦ c̦? Of course they would -- and I'd wager €100 they would protest, and loudly. Any native speaker will tell you the comma-like form and others are acceptable. Not by any means in all contexts. In genuine taste-tests, Ç ç would be universally selected as the more correct form by French users. C̦ c̦ would not be. Actually users will only protest if the shape is not attached or displayed too much below. But the exact shape does not matter much: an attached vertical tick, or comma touching the bottom of letters (9-shaped, or )-shaped diagonal rectangular stroke, or diagonal triangular), or the 5-shaped standard cedilla will be accepted. it will also be accepted if it's a small mirrored c, or right half circle, not connected to the base of the letter with some vertical or diagonal thin stroke. As long as this is coherent with the general font style and it is clearly visible and not confused with a dot below. Handwritten French texts frequently do not use the standard 5-shaped glyph, but some attached diagonal stroke connected to the center bottom of the c letter. [notes] With the exception of untranslated foreign toponyms and of people names or possible trademarks, only the letters c and C have a cedilla in French, and most users cannot type the cedilla below the capital letter C with their standard keyboard layout, e.g. on Windows, MacOS, or Linux, without complication or without using personal customized layouts. Word processors or spell correctors for web browsers are proposing the correction on the frequent word Ça, the most common case where the cedilla is missing below C. There also the expression Ç’a which is the contraction of Ça followed by the auxiliary conjugated verb a, used in the compound past (passé composé) time of conjugated verbs, but this rare form is avoided by most users who use the imperfect (imparfait) or simple past (passé simple) time, i.e. non-compound times, or will use the synonym Cela before the compound past. (Some more advanced French users will avoid it because of the phonologic alliteration of Cela a..., where Ç’a ... is still preferable to respect the correct time matching of sentences and correct phonology including the contraction; some users are also not using the contraction, and say or write Ça a...). Many users avoid the difficulty caused by Ça on their keyboard by writing its synonym Cela. Other cases for capital C with cedillas only include those where text written in all-caps styles (not to be abused, but generally limited to some paragraphs for strong notices, like denials of responsabilities in contracts and licences, or for strengthening a single word like GARÇON(S) vs. FILLE(S) within long sentences, but not in isolated cases like data column headers which should still write Garçon(s) or garçon(s)). Words containing a cedilla are frequent only because of the word ça, but the presence of a cedilla outside the word ça is still low in French, most of them are in conjugated verbs whose infinite ending in -cer like nous enlaçons) In other words, the real difficulty of the cedilla in French is to have it properly displayed below non-initial lowercase letters c, most of these cases are in conjugated verbs and a few common nouns like garçon(s), glaçon(s), or less frequent words limaçon(s) and colimaçon(s), plus some wellknown toponyms like Curaçao (the island in Dutch Antillas, or the name of an alcohol wellknown for its blue color)... In all these cases, the shape of the cedilla does not really matter, as long as some some mark is present below c for correct reading. The initial forms Ca and C’, instead of the correct forms Ça and Ç’ is frequent, but does not cause a reading problem when it occurs at the beginning of a sentence. In fact, this is perceived as a typographic problem more than an orthographic problem (just like the shape of the apostrophe, which is preferably the 9-shaped high comma ’ but also absent from keyboards, that offer only the vertical tick of an encoded ASCII apostrophe-quote).
Re: ISO 2955
2013-07-05 17:01, Dreiheller, Albrecht wrote: A topic that is different but related to the current discussion writing in an alphabet with fewer letters: letter replacements is the question about writing units with limited character sets. This is not a somehow academical question but a real existing problem in some situations. It is, even though many Unicode-minded people would like it to be definitely water under the bridge. But the problem is less common than many other people think. In most cases, it is a matter of not knowing how to type the correct characters. There are many tools for the purpose, but no really universal way, so people need to learn specific methods, which may vary by situation. It’s usually not the character repertoire that is limited but people’s ability to type characters. You might remember there was a standard named ISO 2955-1983 Information processing -- Representaion of SI and other units in systems with limited character sets. It seems to be now freely available at http://freepdfdb.org/pdf/international-standard-68243320.html Whether that’s legal or not is fairly irrelevant in practice, as the standard has historical relevance only. However it was withdrawn in 2001. Does anybody know whether there is a successor standard? There isn’t. If not, does someone know the reason? It was not useful. In the rare cases where the repertoire is really limited, people will use some ad hoc notations and hopefully explain them, or avoid the problem. In text, you can simply write “micrometers” if you cannot type “µm”. A standard on such a marginal issue might be seen even as misleading: it might make people think that the specific notations there are widely known and understood. They aren’t. So if you need to use some notations due to lack of some special characters, it is better that you realize that there is no generally known method to do so, and you realize that you need to explain your notations. Yucca
Re: writing in an alphabet with fewer letters: letter replacements
On Thu, 04 Jul 2013 22:19:11 -0700 Stephan Stiller stephan.stil...@gmail.com wrote: Hi folks, For languages whose alphabets aren't too far apart (I'm thinking mostly of the set of Latin-derived alphabets), what is a good place for finding out how letter replacements for letters that are missing in a different country/locale are done? A good source might be the rules for the name in the first line of the 'machine readable passports'. Unfortunately, a quick hunt on the internet failed to find any such rules. Richard.
Re: writing in an alphabet with fewer letters: letter replacements
It seems that each country metting a passport has its own national rules for transliterating people names on passports. They will display the national alphabet, just extended with some national transliteration rules for other alphabets (to Basic Latin with few extensions, or using just letters of Latin alphabets commonly used in regional languages; in Japan or China, they could possibly use the simplified bopomofo alphabet or kana syllabaries), or they will ask to people to define and register their own transliteration in the national registry. Passports also contain other transliterated items, such as toponyms and country names (but countries are also encoded or shown at least in English; European passports contain a few static preprinted texts in a dozen of languages, using Latin, Greek and Cyrillic alphabets for the national languages; but not everything is translated). They contain also dates and numbers, using Western Arabic digits. But official seals will display anything. I have absolutely no information about what is encoded in the machine readable part of my passport, or in accompanying leaflets or stickers applied on it like visas, or if this data is updated when crossing a border. But I know that this data contains now some biometric data (more to come) and a digitally signed photograph and personal signature (the content of this data is subject to changes, notably because of US demands, some travelers need a new accompanying form added to their existing passport for travelling to US or via US, that they'll get when requesting a visa, some old passports are also refused and need to be changed). Not everything is in the passport, and travel agencies will also request other information that will be transmitted before authorizing the trip. My opinion is that there's no stable standard, countries are changing their rules every year by mutual negociations or additional national restrictions or after special international events (they will inform the travel agencies). Travelers should be informed by travel agencies about the procedures for visas and their scope (if the visas or national identity cards are valid across multiple countries within a free travel area whose member countries apply the same rules). Some day will come where some countries will request ADN identification data realized in the origin country and certified by its approved national labs, or will take sealed ADN samples when entering their country (for short touristic trips, it may be analyzed later if the person does not leave the country after expiration of the visa or does not take his return flight sold by the travel agency, to save costs of labs analysis in the visited country). But digitaly signed biometric data or photos are completely out of scope of Unicode. 2013/7/5 Richard Wordingham richard.wording...@ntlworld.com On Thu, 04 Jul 2013 22:19:11 -0700 Stephan Stiller stephan.stil...@gmail.com wrote: Hi folks, For languages whose alphabets aren't too far apart (I'm thinking mostly of the set of Latin-derived alphabets), what is a good place for finding out how letter replacements for letters that are missing in a different country/locale are done? A good source might be the rules for the name in the first line of the 'machine readable passports'. Unfortunately, a quick hunt on the internet failed to find any such rules. Richard.
Re: writing in an alphabet with fewer letters: letter replacements
On Fri, 5 Jul 2013 21:36:24 +0200 Philippe Verdy verd...@wanadoo.fr wrote: I have absolutely no information about what is encoded in the machine readable part of my passport, See http://www.icao.int/publications/Documents/9303_p1_v1_cons_fr.pdf , especially Appendice 8 (p IV-50). The English version is available as http://www.icao.int/publications/Documents/9303_p1_v1_cons_en.pdf , especially Appendix 8 (p IV-47). Richard.
Re: writing in an alphabet with fewer letters: letter replacements
See http://www.icao.int/publications/Documents/9303_p1_v1_cons_fr.pdf , especially Appendice 8 (p IV-50). The English version is available as http://www.icao.int/publications/Documents/9303_p1_v1_cons_en.pdf , especially Appendix 8 (p IV-47). I suppose you can't go wrong with what your own passport says :-) Some obligatory comments: * The -XX variants (Ñ → NXX (N) and Ü → UXX (UE)) can't be intended for human use. * Ŋ → N (shown there with lower-case ŋ – if the implication is that it can't occur word-initially, it's not stated and it's not clear all the other letters can) is surprising. * disallowed: Ä↛A , Ö↛O , Ü↛U (as are: Å↛A , Ø↛O) Stephan
Re: writing in an alphabet with fewer letters: letter replacements
On Fri, 5 Jul 2013 20:49:23 +0100 Richard Wordingham richard.wording...@ntlworld.com wrote: The English version is available as http://www.icao.int/publications/Documents/9303_p1_v1_cons_en.pdf , especially Appendix 8 (p IV-47). And there are recommended foldings in Appendix 9! Richard.
Re: writing in an alphabet with fewer letters: letter replacements
I suppose you can't go wrong with what your own passport says On second thought ... * disallowed: Ä↛A , Ö↛O , Ü↛U (as are: Å↛A , Ø↛O) ... I have a Turkish friend for whom it is Ö→O, not OE. This calls into question the general applicability of these rules. A few years ago he also told me that it's nice that Germany has recommended replacement rules because Turkey doesn't. The linked-to document is dated 2006, but he told me this after 2006. His knowledge might have been out of date (maybe Turkey now does Ö→OE), but in the light of this the extent to which these rules reflect popular usage remains very much unclear, and we all seem to be agreeing that it'd be unlikely if practice were uniform anyways. Stephan
RE: writing in an alphabet with fewer letters: letter replacements
Latin and Cyrillic? That's it? Best regards, Jonathan (Jony) Rosenne -Original Message- From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Richard Wordingham Sent: יום ו 05 יולי 2013 22:49 To: Unicode Public Subject: Re: writing in an alphabet with fewer letters: letter replacements On Fri, 5 Jul 2013 21:36:24 +0200 Philippe Verdy verd...@wanadoo.fr wrote: I have absolutely no information about what is encoded in the machine readable part of my passport, See http://www.icao.int/publications/Documents/9303_p1_v1_cons_fr.pdf , especially Appendice 8 (p IV-50). The English version is available as http://www.icao.int/publications/Documents/9303_p1_v1_cons_en.pdf , especially Appendix 8 (p IV-47). Richard.
Re: writing in an alphabet with fewer letters: letter replacements
The only relevant part is: [quote]Élément biométrique vérifiable par machine. Élément physique d’identification personnelle unique (par exemple motif de l’iris, empreinte digitale ou caractéristiques faciales), stocké sur un document de voyage dans une forme lisible et vérifiable par machine.[/quote] No more details about the data stored in the magnetic tape or in the RFID chip. The document only gives info about the printed text readable by OCR, plus a few other mechanical security systems for building them. There is also some small data perforated on some pages, no idea if it is secured or contains something else than a unique ID of the passport itself, the rest of the data being accessed by computer networks. 2013/7/5 Richard Wordingham richard.wording...@ntlworld.com On Fri, 5 Jul 2013 21:36:24 +0200 Philippe Verdy verd...@wanadoo.fr wrote: I have absolutely no information about what is encoded in the machine readable part of my passport, See http://www.icao.int/publications/Documents/9303_p1_v1_cons_fr.pdf , especially Appendice 8 (p IV-50). The English version is available as http://www.icao.int/publications/Documents/9303_p1_v1_cons_en.pdf , especially Appendix 8 (p IV-47). Richard.
Re: writing in an alphabet with fewer letters: letter replacements
On Fri, Jul 5, 2013 at 10:42 AM, Richard Wordingham richard.wording...@ntlworld.com wrote: This is no longer true of Ada, at least, both the standard and the popular GNAT implementation support Unicode: Grüß : constant String := Grüezi mitenand; That only demonstrates Latin-1 support! It seems the current Ada standard supports the original ISO 10646 31 bits. That may only demonstrate Latin-1 support, but full Unicode is still there. The base library includes one non-ASCII character in the include files: Pi : constant := 3.14159_26535_89793_23846_26433_83279_50288_41971_69399_37511; π : constant := Pi; If you're looking at the current Ada standard, Ada 2012, it explicitly references ISO/IEC 10646:2011. -- Kie ekzistas vivo, ekzistas espero.
Re: weltformel.c asking for peer review
Chere Philippe, merci pour vous observationes, critiques et recommendationes: On Fri, Jul 5, 2013 at 8:59 PM, Philippe Verdy verd...@wanadoo.fr wrote: Why did it reach the Unicode Sarasvati list ? Because Unicode is about Finding The Best Representation Of The Universe's Codes (=Languages). If you ask for help about what's wrong about this undocumented program, may be you should consider just these loops: for(w=0;--w;)for(z=0;z--;)for(y=0;y--;)for(x=0;x--;) - the w loop will run from -1 to -127 then 128 to 1, so it will not run w=0 - the z, y, x loops will never run (the initial value is 0, you test it in the condition using post-decrementation, so this is the initial value your test) Right, my Heureka was a premature ejeculation due to the compiler accepting my program as syntactically correct and my psyche's reasoning feeling overly confident that this was already it. I got suspicious after the computation process taking 99% CPU load and a slim 384k for over a couple of hours, and eventually found and fixed the second problem myself but not the w=0. Talking to Doctor http://de.wikipedia.org/wiki/Marc_Benecke who certified my weltformel seems lupenrein doktorabel to him, I decided to have to rename w,x,y,z into h,i,j,k honor of http://de, wikipedia.org/wiki/Plancksches_Wirkungsquantum and http://en.wikipedia.org/wiki/Quaternion_group and am still pondering whether to - go from signed char to unsigned char and just manually work with 2-complement - go from char to int but count h from 1 to 127 and i,j,k from -127 to 127 - loopholing my miniature universe so that information from 127 propagates in 1c to -128 even though the real universe seems to need at least (2^2^256h)^4 if not infinite timespace In other words, the initialization makes nothing, and leaves all cells to their initial value 0 (from calloc) and only the following line sets it differently. b(10,0,0,0)=7; /* urknall! 10 is just random choice */ This is a case of over-optimization, using some untested assumptions. Assuming my initialization had succesfully filled with an equal oscillating load of minuses and pluses, I just wanted to visualize the electromagnetic vacuum wave before urknall for ten steps. These could of cause also be reduced to zero, but I need exactly one mutating urknall to turn the perfectly ordered harmony into a germ of life and chaos that can evolve and create biological structures like hydrogen and Brad Pitt and Angelina Jolie. Use integers in loops, and unsigned chars for your 4GB work buffer containing the hypercube: main(){int w,x,y,z; unsigned char*c=calloc(1L32,sizeof(unsigned char)); for(w=256;--w;)for(z=256;--z;)for(y=256;--y;)for(256=0;--x;)b(w,x,y, z)=(x+y+z)%2?1:4; Anyway this program is unnecessarily slow for computing 6 sets of 256x256 PPM bitmaps with RGB color space. I realized I tried to recompute each tomographic slice 64k times which definitely shows another of the prototypes immaturity. You can certainly avoid allocating 4GB of memory (many systems have a total of 4GB including for the OS and shared memory space for transfering textures to video accelerators). You mean something like a growing cycle realloc(=4);ppm();free();? You should also avoid creating so many files per directory (the filesystem or your shell does not like sorting so many files, the implemetners of web browser caches know that and distribute files in separate directories. First, you have 6 distinct sets, which could have their own folder, then you have 65536 files per set, which you can split into 256 subsets of 256 files. - The files set p(w, x, 256, 256) should be in files wx/w/x.extension - The files set p(w, 256, y, 256) should be in files wy/w/y.extension - The files set p(w, 256, 256, z) should be in files wy/w/z.extension - The files set p(256, x, y, 256) should be in files xy/x/y.extension - The files set p(256, x, 256, z) should be in files xz/x/z.extension - The files set p(256, 256, y, z) should be in files yz/y/z.extension Perfect, I love your purely NoSQL model and will incorporate it and credit you! This just means creating the 6 subdirectories wx, wy, wz, xy, xz and xz, each one containing 256 subdirectories. And each of these 256 subdirs contain 256 subdirs containing 256 files containing 256 rows of 256 colors. They can each further by compressed by ppmlabel -text $filename $filename | pamenlarge 4 | pnmtopng $filename.png. And any 256 further by combining the screenshots into three video/mp4 thru ppmtompeg $CONFIG ffmpeg $ARGS These bitmaps are also over large (PPM files for RGB color space with 1 bit per pixel are 8 times larger than necessary, using 1 full storage byte per pixel and per color plane, instead of just 1 bit). So each 256x256 bitmap takes 192KB + 13 bytes for the magic header. As you store all bitmaps in the same current directory, you'll get 256x256*6=393216 files, taking 193KB each, i.e. a giant storage space of
Re: weltformel.c asking for peer review
Oops, total URL fuckup, here's the the tested errata: https://en.wikipedia.org/wiki/Mark_Benecke https://en.wikipedia.org/wiki/Wolfram%27s_2-state_3-symbol_Turing_machine https://en.wikipedia.org/wiki/Half-integers https://en.wikipedia.org/wiki/Quaternion_group https://en.wikipedia.org/wiki/Planck_constant https://en.wikipedia.org/wiki/Speed_of_light https://en.wikipedia.org/wiki/Theory_of_everything Ik schreef: Chere Philippe, merci pour vous observationes, critiques et recommendationes: On Fri, Jul 5, 2013 at 8:59 PM, Philippe Verdy verd...@wanadoo.fr wrote: Why did it reach the Unicode Sarasvati list ? Because Unicode is about Finding The Best Representation Of The Universe's Codes (=Languages). If you ask for help about what's wrong about this undocumented program, may be you should consider just these loops: for(w=0;--w;)for(z=0;z--;)for(y=0;y--;)for(x=0;x--;) - the w loop will run from -1 to -127 then 128 to 1, so it will not run w=0 - the z, y, x loops will never run (the initial value is 0, you test it in the condition using post-decrementation, so this is the initial value your test) Right, my Heureka was a premature ejeculation due to the compiler accepting my program as syntactically correct and my psyche's reasoning feeling overly confident that this was already it. I got suspicious after the computation process taking 99% CPU load and a slim 384k for over a couple of hours, and eventually found and fixed the second problem myself but not the w=0. Talking to Doctor http://de.wikipedia.org/wiki/Marc_Benecke who certified my weltformel seems lupenrein doktorabel to him, I decided to have to rename w,x,y,z into h,i,j,k honor of http://de, wikipedia.org/wiki/Plancksches_Wirkungsquantum and http://en.wikipedia.org/wiki/Quaternion_groupand am still pondering whether to - go from signed char to unsigned char and just manually work with 2-complement - go from char to int but count h from 1 to 127 and i,j,k from -127 to 127 - loopholing my miniature universe so that information from 127 propagates in 1c to -128 even though the real universe seems to need at least (2^2^256h)^4 if not infinite timespace In other words, the initialization makes nothing, and leaves all cells to their initial value 0 (from calloc) and only the following line sets it differently. b(10,0,0,0)=7; /* urknall! 10 is just random choice */ This is a case of over-optimization, using some untested assumptions. Assuming my initialization had succesfully filled with an equal oscillating load of minuses and pluses, I just wanted to visualize the electromagnetic vacuum wave before urknall for ten steps. These could of cause also be reduced to zero, but I need exactly one mutating urknall to turn the perfectly ordered harmony into a germ of life and chaos that can evolve and create biological structures like hydrogen and Brad Pitt and Angelina Jolie. Use integers in loops, and unsigned chars for your 4GB work buffer containing the hypercube: main(){int w,x,y,z; unsigned char*c=calloc(1L32,sizeof(unsigned char)); for(w=256;--w;)for(z=256;--z;)for(y=256;--y;)for(256=0;--x;)b(w,x,y, z)=(x+y+z)%2?1:4; Anyway this program is unnecessarily slow for computing 6 sets of 256x256 PPM bitmaps with RGB color space. I realized I tried to recompute each tomographic slice 64k times which definitely shows another of the prototypes immaturity. You can certainly avoid allocating 4GB of memory (many systems have a total of 4GB including for the OS and shared memory space for transfering textures to video accelerators). You mean something like a growing cycle realloc(=4);ppm();free();? You should also avoid creating so many files per directory (the filesystem or your shell does not like sorting so many files, the implemetners of web browser caches know that and distribute files in separate directories. First, you have 6 distinct sets, which could have their own folder, then you have 65536 files per set, which you can split into 256 subsets of 256 files. - The files set p(w, x, 256, 256) should be in files wx/w/x.extension - The files set p(w, 256, y, 256) should be in files wy/w/y.extension - The files set p(w, 256, 256, z) should be in files wy/w/z.extension - The files set p(256, x, y, 256) should be in files xy/x/y.extension - The files set p(256, x, 256, z) should be in files xz/x/z.extension - The files set p(256, 256, y, z) should be in files yz/y/z.extension Perfect, I love your purely NoSQL model and will incorporate it and credit you! This just means creating the 6 subdirectories wx, wy, wz, xy, xz and xz, each one containing 256 subdirectories. And each of these 256 subdirs contain 256 subdirs containing 256 files containing 256 rows of 256 colors. They can each further by compressed by ppmlabel -text $filename $filename | pamenlarge 4 | pnmtopng $filename.png. And any 256 further by combining the screenshots into