Default bidi ranges
I tried to find something like a normative description of the default bidi class of unassigned code points. In UTR #9, it says (http://www.unicode.org/reports/tr9/tr9-23.html#Bidirectional_Character_Types): Unassigned characters are given strong types in the algorithm. This is an explicit exception to the general Unicode conformance requirements with respect to unassigned characters. As characters become assigned in the future, these bidirectional types may change. For assignments to character types, see DerivedBidiClass.txt [DerivedBIDI] in the [UCD]. The DerivedBidiClass.txt file, as far as I understand, is mainly a condensation of bidi classes into character ranges (rather than giving them for each codepoint independently as in UnicodeData.txt). I.e. it can at any moment be derived automatically from UnicodeData.txt, and is as such not normative. Why is it then that the default class assignments are only given in this file (unless I have overlooked something)? And why is it that they are only given in comments? I'm trying to create a program that takes all the bidi assignments (including default ones) and creates the data part of a bidi algorithm implementation, but I don't feel confident to code against stuff that's in comments. Any advice? Is it possible that this could be fixed (making it more normative, and putting it in a form that's easier to process automatically)? Regards, Martin.
Re: Arabic alif-lam ligature
11/8/2011 7:24 PM, Andreas Prilop wrote: There is a non-standard alif-lam ligature in the Arabic script. The logo of Al Arabiya shows an example. The logo as on page http://www.alarabiya.net looks like a rather special way of writing the name, but that’s what logos are. Which fonts have such an alif-lam ligature? Do some fonts have it, and does the ligature appear in text rendering, as opposite to display of logos? I would expect it to be a special rendering style, much like in handwriting we produce combinations of letters that correspond to ligatures. Should I write U+0627 ZWJ U+0644 to obtain the ligature? Or should I write U+0627 ZWNJ U+0644 to prevent the ligature? Those would be the character-level tools. But normally I would expect people to use higher-level protocols, such as commands in a typesetting program or style sheets applied to entire blocks of text. Or is alif-lam outside the scope of Unicode and just regarded as a logo? It’s not a logo as such, but any use that is restricted to logos should probably be considered as external to Unicode. If there are fonts that contain an alif-lam ligature, then I would expect it to be regarded as a possible rendering of a character pair. Typographic ligatures are normally encoded as characters in Unicode only if they exist as characters in some other character code in use. Yucca
Re: Default bidi ranges
On 11/9/2011 1:18 AM, Martin J. Dürst wrote: I tried to find something like a normative description of the default bidi class of unassigned code points. In UTR #9, it says (http://www.unicode.org/reports/tr9/tr9-23.html#Bidirectional_Character_Types): Unassigned characters are given strong types in the algorithm. This is an explicit exception to the general Unicode conformance requirements with respect to unassigned characters. As characters become assigned in the future, these bidirectional types may change. For assignments to character types, see DerivedBidiClass.txt [DerivedBIDI] in the [UCD]. The DerivedBidiClass.txt file, as far as I understand, is mainly a condensation of bidi classes into character ranges (rather than giving them for each codepoint independently as in UnicodeData.txt). I.e. it can at any moment be derived automatically from UnicodeData.txt, and is as such not normative. Why is it then that the default class assignments are only given in this file (unless I have overlooked something)? And why is it that they are only given in comments? Because the UnicodeData.txt file has no header (for historical compatibility). Because, like the practice of putting style in HTML inside comments, these things (@missing) are in comments to protect older parsers. I'm trying to create a program that takes all the bidi assignments (including default ones) and creates the data part of a bidi algorithm implementation, but I don't feel confident to code against stuff that's in comments. Any advice? Is it possible that this could be fixed (making it more normative, and putting it in a form that's easier to process automatically)? I've confidently parsed these comments for years now. The one things that's worse than parsing these comments is to move to an incompatible scheme. That said, apparently, for some properties the default information is contained in the PropertyValuieAliases.txt file, where it is inconveniently located for people who want to parse just one property, but conveniently located for those who want to assemble the whole database. (And, worse, where it adds a code-point dependency to the information in that file that wasn't there from the beginning - but at least the @missing syntax hasn't changed too much). A./
Re: Default bidi ranges
On 11/9/2011 9:30 AM, Asmus Freytag wrote: On 11/9/2011 1:18 AM, Martin J. Dürst wrote: I tried to find something like a normative description of the default bidi class of unassigned code points. In UTR #9, it says (http://www.unicode.org/reports/tr9/tr9-23.html#Bidirectional_Character_Types): Unassigned characters are given strong types in the algorithm. This is an explicit exception to the general Unicode conformance requirements with respect to unassigned characters. As characters become assigned in the future, these bidirectional types may change. For assignments to character types, see DerivedBidiClass.txt [DerivedBIDI] in the [UCD]. That *is* the normative description of the default Bidi_Class for unassigned code points. The DerivedBidiClass.txt file, as far as I understand, is mainly a condensation of bidi classes into character ranges (rather than giving them for each codepoint independently as in UnicodeData.txt). I.e. it can at any moment be derived automatically from UnicodeData.txt, and is as such not normative. Because the default values for Bidi_Class are complicated, and cannot be derived simply by parsing the values for *assigned* characters in UnicodeData.txt, the listing of the default values for Bidi_Class in DerivedBidiClass.txt have to be taken as normative for those values. Why is it then that the default class assignments are only given in this file (unless I have overlooked something)? And why is it that they are only given in comments? Because the UnicodeData.txt file has no header (for historical compatibility). Because, like the practice of putting style in HTML inside comments, these things (@missing) are in comments to protect older parsers. And to go beyond what Asmus said there, the @missing hack was created as a syntax for specifying *the* default values for properties where it makes sense that they have a *single* default value. It doesn't work for specifying multiple default values differing by code point range. Hence no addition of the @missing comment in DerivedBidiClass.txt (or its potential addition to PropertyValueAliases.txt) doesn't suffice for the entire definition. I'm trying to create a program that takes all the bidi assignments (including default ones) and creates the data part of a bidi algorithm implementation, but I don't feel confident to code against stuff that's in comments. Any advice? Use the values in the comments. Remember that this is not *code* with comments that get stripped out before compiling. These are text data files for parsing. The fact that people are already parsing the @missing statements indicates that those are being treated normatively now. You could say the same thing for the titles, dates, and copyright notices on these data files: they aren't optional content to be ignored. Is it possible that this could be fixed (making it more normative, and putting it in a form that's easier to process automatically)? This is part of a very large problem for creating a more complete and machine-parseable means of accessing *all* of the Unicode character property data, including data about the *status* of properties and their default values. It won't, IMO, be fixed by individual file fixes one at a time, although incremental improvement can be helpful. Note that the UCD in XML was created to address this problem in part, but it still cannot answer many questions about the status of properties, their full derivations, their interactions, and their functions. --Ken
tips on writing character proposal
Hello! I'm new here, but have already read some of the online documentation for proposing new characters. I'm still a bit unsure how to go about it. Or even who can do it. Can individuals submit ideas, or do you need to be the representative of some agency or group? How much supporting background information is deemed sufficient? Where do I find details (more than just the pipeline table) of current pending proposals? Here are my ideas in very abbreviated form. If these are non-starters from the beginning, I'd as soon know it sooner rather than later. These first several self-descriptive shapes are simply things I've seen suggested and wished for online for some time. 2B5ACLOCKWISE SPIRAL 2B5BCOUNTER-CLOCKWISE SPIRAL 2B5CCLOCKWISE DOUBLE SPIRAL 2B5DCOUNTER-CLOCKWISE DOUBLE SPIRAL The next several are a response to a perceived deficiency in standardization of religious symbols. I suggest starting these cultural symbols at 2BC0 to distinguish them from the generic/geometric symbols earlier in the block. Very brief description/background given. 2BC0ICHTHYS =Jesus fish, symbol used by ancient Christians for identification, denotes non-denominational and inter-denominational Christianity in modern times 2BC1TRIQUETRA =three-lobed vesicae piscis, used in Christianity and ancient/modern paganism 2BC2MENORAH =7-branched temple lamp, ancient symbol of Judaism 2BC3HANUKIAH =9-branched Hanukkah lamp Thank you, Tim
Re: tips on writing character proposal
11/9/2011 10:58 PM, Larson, Timothy E. wrote: I'm new here, but have already read some of the online documentation for proposing new characters. I think that a key statement that you have missed is at the end of http://www.unicode.org/pending/symbol-guidelines.html which says: “The fact that a symbol merely ‘seems to be useful or potentially useful’ is precisely not a reason to code it. Demonstrated usage, or demonstrated demand, on the other hand, does constitute a good reason to encode the symbol.” Note that the usage or demand needs to relate to use in text, not as standalone symbols. Moreover, demonstrated actual usage in texts tends to have much better chances than even well-described demand. Yucca
Re: tips on writing character proposal
From: Larson, Timothy E. TELarson_at_west.com Hello! I'm new here, but have already read some of the online documentation for proposing new characters. I'm still a bit unsure how to go about it. Or even who can do it. Can individuals submit ideas, or do you need to be the representative of some agency or group? How much supporting background information is deemed sufficient? Where do I find details (more than just the pipeline table) of current pending proposals? You absolutely do not need to be a representative of any company, government, organization, or group. I am in no way associated with any associated entity and successfully proposed a script with ~150 characters. All it takes is a dedication to serious research, a large amount of time to dedicate to the process, and the tenacity and perseverance to see a long and arduous process through to the end. The ability to produce PDFs is helpful, but not necessary, too. You can take a look at a large number of proposal documents from June by following links at the document register http://std.dkuug.dk/JTC1/SC2/WG2/docs /n4000.pdf . Note that many of the documents are commentaries, opinions, or discussions of proposals. Look for any documents called something like Proposal to encode X or Preliminary Proposal to encode X. Note that preliminary proposals will necessarily be incomplete. [snip] Thank you, Tim You're welcome, Van
editorial: definitively broken link on CLDR online tools to external Unicode Fonts for Ancient Scripts
The CLDR online tools include a footer that suggests finding Unicode fonts for Ancient scripts from a web site (greekfonts.teilar.gr) which is no longer available. Now it redirects to a parking page without contents. There's an archive of this page in the Google Cache, which shows that the site is not just temporarily unavailable, but that it has been closed indefinitely: http://webcache.googleusercontent.com/search?q=cache:1oTvcjcKed4J:greekfonts.teilar.gr/+Unicode+Fonts+for+Ancient+Scriptscd=1hl=frct=clnkgl=fr Can the online CLDR tools (referenced not just by the CLDR project documentation and examples, but as well in some technical references of the Unicode standard) suppress this link Unicode Fonts for Ancient Scripts appearing at the bottom of pages (for example http://unicode.org/cldr/utility/bidi.jsp), or suggest another good site guide for available fonts for old/rare scripts, if possible not commercial (i.e. not a foundry site directly selling their own fonts) ? For example I can propose Gallery of Unicode Fonts on the WAZU JAPAN site (http://www.wazu.jp/) as a complement to the existing Large, multi-script Unicode fonts for Windows computers on the Alan Wood's Unicode Reference site (http://www.alanwood.net/unicode/fonts.html) : this would be the second largest online database with good contents and neutral to font vendors, that we should better reference and keep now, for the eventual case where the WAZU page would ever disappear (We should better to have a second one available now if the only working one that remains ever has problems). -- Philippe.
Re: tips on writing character proposal
On 11/09/2011 03:58 PM, Larson, Timothy E. wrote: Hello! I'm new here, but have already read some of the online documentation for proposing new characters. I'm still a bit unsure how to go about it. Or even who can do it. Can individuals submit ideas, or do you need to be the representative of some agency or group? How much supporting background information is deemed sufficient? Where do I find details (more than just the pipeline table) of current pending proposals? There are others here who will throw even more cold water on some of these ideas, but I can suggest that you read http://www.unicode.org/pending/symbol-guidelines.html for some ideas about what is encodable and what isn't. You'll probably find plenty of exceptions, but it's a start. Here are my ideas in very abbreviated form. If these are non-starters from the beginning, I'd as soon know it sooner rather than later. These first several self-descriptive shapes are simply things I've seen suggested and wished for online for some time. 2B5ACLOCKWISE SPIRAL 2B5BCOUNTER-CLOCKWISE SPIRAL 2B5CCLOCKWISE DOUBLE SPIRAL 2B5DCOUNTER-CLOCKWISE DOUBLE SPIRAL These might well be non-starters. Think about the first question you'd be asked: Why should these be encoded? Is there any reason we should be considering these symbols plain text that need to be encoded as such? Or is it just because they're common simple geometric symbols? While it is true that a lot of simple geometric symbols have been encoded, it generally has not been *because* they are simple geometric symbols, but rather because they were encoded in some other standard once before, or because they are used as plain text in some settings. The next several are a response to a perceived deficiency in standardization of religious symbols. I suggest starting these cultural symbols at 2BC0 to distinguish them from the generic/geometric symbols earlier in the block. Very brief description/background given. 2BC0ICHTHYS =Jesus fish, symbol used by ancient Christians for identification, denotes non-denominational and inter-denominational Christianity in modern times 2BC1TRIQUETRA =three-lobed vesicae piscis, used in Christianity and ancient/modern paganism 2BC2MENORAH =7-branched temple lamp, ancient symbol of Judaism 2BC3HANUKIAH =9-branched Hanukkah lamp Apply the same question. What makes these symbols plain text? To be sure, there are other religious symbols in Unicode, particularly in the MISCELLANEOUS SYMBOLS and DINGBATS blocks, but those are mainly there because they were formerly encoded in, say, Zapf Dingbats, or are commonly used as map symbols. (You might actually be able to find some support for these, though, but don't ask me where.) It's a very common mistake, in coming to Unicode, to think Oh, it would be *so great* if these things were encoded! But Unicode isn't about encoding what would be neat to encode. It's about encoding _text_, (including things that have been encoded before). ~mark
Re: tips on writing character proposal
On 11/9/2011 6:08 PM, Mark E. Shoulson wrote: On 11/09/2011 03:58 PM, Larson, Timothy E. wrote: Hello! I'm new here, but have already read some of the online documentation for proposing new characters. I'm still a bit unsure how to go about it. Or even who can do it. Can individuals submit ideas, or do you need to be the representative of some agency or group? How much supporting background information is deemed sufficient? Where do I find details (more than just the pipeline table) of current pending proposals? There are others here who will throw even more cold water on some of these ideas, but I can suggest that you read http://www.unicode.org/pending/symbol-guidelines.html for some ideas about what is encodable and what isn't. You'll probably find plenty of exceptions, but it's a start. Timothy, Before you get totally discouraged, I'd like to point out that there are few open and shut cases in character encoding. Chances to get your proposed characters improver, the better the use case and the better the documented examples of actual use (usually in print or in examples that should be convertable to print). The fact that you think a character is missing is evidence that there's at least one potential user. Your task, in writing a proposal, would be to document that you are not alone (far from it) and that these symbols are used in text(s) on equal footing with other symbols. Doing the research and writing a proposal does take some work, and critics will be hovering to point out all shortcomings. But that should help improve your proposal. Here are my ideas in very abbreviated form. If these are non-starters from the beginning, I'd as soon know it sooner rather than later. These first several self-descriptive shapes are simply things I've seen suggested and wished for online for some time. 2B5ACLOCKWISE SPIRAL 2B5BCOUNTER-CLOCKWISE SPIRAL 2B5CCLOCKWISE DOUBLE SPIRAL 2B5DCOUNTER-CLOCKWISE DOUBLE SPIRAL These might well be non-starters. Think about the first question you'd be asked: Why should these be encoded? Is there any reason we should be considering these symbols plain text that need to be encoded as such? Or is it just because they're common simple geometric symbols? While it is true that a lot of simple geometric symbols have been encoded, it generally has not been *because* they are simple geometric symbols, but rather because they were encoded in some other standard once before, or because they are used as plain text in some settings. Before you see this as a definite answer, let me give you a suggestion of a different opinion. A common usage of these symbols in text is in non-verbal speech bubbles in cartoons. While these bubbles may look hand-drawn, they are very often actually typeset. The one exception being just those strings of symbols. Since, in the examples that I am thingking of, they are presented as text and their layout (on a line) is in no way different than text presentation, it's not possible to simply rule these out categorically. When symbols, however arbirtrary, can be demonstrated as being used as part of writing, there's no good rationale to refuse their encoding. Doing so would simply send the message that arbitrary symbols are fine if they occur in just a subset of (more formal, e.g. mathematical) texts or on electronic platforms, but not elsewhere. That seems in violation of precedent and in violation of the universal scope of the standard. Now, you may not find examples of all types of spiral. Unless logically required by formal notation, I would, in that case, propose only those that can be found as in use. Completion of the set can be an argument in favor of encoding, but not everything is member of a set worth completing. The next several are a response to a perceived deficiency in standardization of religious symbols. I suggest starting these cultural symbols at 2BC0 to distinguish them from the generic/geometric symbols earlier in the block. Very brief description/background given. 2BC0ICHTHYS =Jesus fish, symbol used by ancient Christians for identification, denotes non-denominational and inter-denominational Christianity in modern times 2BC1TRIQUETRA =three-lobed vesicae piscis, used in Christianity and ancient/modern paganism 2BC2MENORAH =7-branched temple lamp, ancient symbol of Judaism 2BC3HANUKIAH =9-branched Hanukkah lamp Apply the same question. What makes these symbols plain text? To be sure, there are other religious symbols in Unicode, particularly in the MISCELLANEOUS SYMBOLS and DINGBATS blocks, but those are mainly there because they were formerly encoded in, say, Zapf Dingbats, or are commonly used as map symbols. (You might actually be able to find some support for these, though, but don't ask me where.) I think these are great research candidates. I concur with the skeptics here that the mere existence of