Default bidi ranges

2011-11-09 Thread Martin J. Dürst
I tried to find something like a normative description of the default 
bidi class of unassigned code points.


In UTR #9, it says 
(http://www.unicode.org/reports/tr9/tr9-23.html#Bidirectional_Character_Types):


Unassigned characters are given strong types in the algorithm. This is 
an explicit exception to the general Unicode conformance requirements 
with respect to unassigned characters. As characters become assigned in 
the future, these bidirectional types may change. For assignments to 
character types, see DerivedBidiClass.txt [DerivedBIDI] in the [UCD].


The DerivedBidiClass.txt file, as far as I understand, is mainly a 
condensation of bidi classes into character ranges (rather than giving 
them for each codepoint independently as in UnicodeData.txt). I.e. it 
can at any moment be derived automatically from UnicodeData.txt, and is 
as such not normative.


Why is it then that the default class assignments are only given in this 
file (unless I have overlooked something)? And why is it that they are 
only given in comments? I'm trying to create a program that takes all 
the bidi assignments (including default ones) and creates the data part 
of a bidi algorithm implementation, but I don't feel confident to code 
against stuff that's in comments. Any advice? Is it possible that this 
could be fixed (making it more normative, and putting it in a form 
that's easier to process automatically)?


Regards,   Martin.



Re: Arabic alif-lam ligature

2011-11-09 Thread Jukka K. Korpela

11/8/2011 7:24 PM, Andreas Prilop wrote:


There is a non-standard alif-lam ligature in the Arabic script.
The logo of Al Arabiya shows an example.


The logo as on page http://www.alarabiya.net looks like a rather special 
way of writing the name, but that’s what logos are.



Which fonts have such an alif-lam ligature?


Do some fonts have it, and does the ligature appear in text rendering, 
as opposite to display of logos? I would expect it to be a special 
rendering style, much like in handwriting we produce combinations of 
letters that correspond to ligatures.


 Should I write U+0627 ZWJ  U+0644 to obtain the ligature? Or
 should I write U+0627 ZWNJ U+0644 to prevent the ligature?

Those would be the character-level tools. But normally I would expect 
people to use higher-level protocols, such as commands in a typesetting 
program or style sheets applied to entire blocks of text.



Or is alif-lam outside the scope of Unicode and just
regarded as a logo?


It’s not a logo as such, but any use that is restricted to logos should 
probably be considered as external to Unicode. If there are fonts that 
contain an alif-lam ligature, then I would expect it to be regarded as a 
possible rendering of a character pair. Typographic ligatures are 
normally encoded as characters in Unicode only if they exist as 
characters in some other character code in use.


Yucca




Re: Default bidi ranges

2011-11-09 Thread Asmus Freytag

On 11/9/2011 1:18 AM, Martin J. Dürst wrote:
I tried to find something like a normative description of the default 
bidi class of unassigned code points.


In UTR #9, it says 
(http://www.unicode.org/reports/tr9/tr9-23.html#Bidirectional_Character_Types):


Unassigned characters are given strong types in the algorithm. This is 
an explicit exception to the general Unicode conformance requirements 
with respect to unassigned characters. As characters become assigned 
in the future, these bidirectional types may change. For assignments 
to character types, see DerivedBidiClass.txt [DerivedBIDI] in the [UCD].


The DerivedBidiClass.txt file, as far as I understand, is mainly a 
condensation of bidi classes into character ranges (rather than giving 
them for each codepoint independently as in UnicodeData.txt). I.e. it 
can at any moment be derived automatically from UnicodeData.txt, and 
is as such not normative.


Why is it then that the default class assignments are only given in 
this file (unless I have overlooked something)? And why is it that 
they are only given in comments?


Because the UnicodeData.txt file has no header (for historical 
compatibility).


Because, like the practice of putting style in HTML inside comments, 
these things (@missing) are in comments to protect older parsers.
I'm trying to create a program that takes all the bidi assignments 
(including default ones) and creates the data part of a bidi algorithm 
implementation, but I don't feel confident to code against stuff 
that's in comments. Any advice? Is it possible that this could be 
fixed (making it more normative, and putting it in a form that's 
easier to process automatically)?


I've confidently parsed these comments for years now.

The one things that's worse than parsing these comments is to move to an 
incompatible scheme.


That said, apparently, for some properties the default information is 
contained in the PropertyValuieAliases.txt file, where it is 
inconveniently located for people who want to parse just one property, 
but conveniently located for those who want to assemble the whole database.
(And, worse, where it adds a code-point dependency to the information in 
that file that wasn't there from the beginning - but at least the 
@missing syntax hasn't changed too much).


A./



Re: Default bidi ranges

2011-11-09 Thread Ken Whistler

On 11/9/2011 9:30 AM, Asmus Freytag wrote:

On 11/9/2011 1:18 AM, Martin J. Dürst wrote:
I tried to find something like a normative description of the default 
bidi class of unassigned code points.


In UTR #9, it says 
(http://www.unicode.org/reports/tr9/tr9-23.html#Bidirectional_Character_Types):


Unassigned characters are given strong types in the algorithm. This 
is an explicit exception to the general Unicode conformance 
requirements with respect to unassigned characters. As characters 
become assigned in the future, these bidirectional types may change. 
For assignments to character types, see DerivedBidiClass.txt 
[DerivedBIDI] in the [UCD].


That *is* the normative description of the default Bidi_Class for 
unassigned code points.




The DerivedBidiClass.txt file, as far as I understand, is mainly a 
condensation of bidi classes into character ranges (rather than 
giving them for each codepoint independently as in UnicodeData.txt). 
I.e. it can at any moment be derived automatically from 
UnicodeData.txt, and is as such not normative.


Because the default values for Bidi_Class are complicated, and cannot be 
derived
simply by parsing the values for *assigned* characters in 
UnicodeData.txt, the
listing of the default values for Bidi_Class in DerivedBidiClass.txt 
have to be

taken as normative for those values.



Why is it then that the default class assignments are only given in 
this file (unless I have overlooked something)? And why is it that 
they are only given in comments?


Because the UnicodeData.txt file has no header (for historical 
compatibility).


Because, like the practice of putting style in HTML inside comments, 
these things (@missing) are in comments to protect older parsers.


And to go beyond what Asmus said there, the @missing hack was created as
a syntax for specifying *the* default values for properties where it 
makes sense
that they have a *single* default value. It doesn't work for specifying 
multiple
default values differing by code point range. Hence no addition of the 
@missing
comment in DerivedBidiClass.txt (or its potential addition to 
PropertyValueAliases.txt)

doesn't suffice for the entire definition.

I'm trying to create a program that takes all the bidi assignments 
(including default ones) and creates the data part of a bidi 
algorithm implementation, but I don't feel confident to code against 
stuff that's in comments. Any advice?


Use the values in the comments.

Remember that this is not *code* with comments that get stripped out 
before compiling.
These are text data files for parsing. The fact that people are already 
parsing the
@missing statements indicates that those are being treated normatively 
now. You
could say the same thing for the titles, dates, and copyright notices on 
these data

files: they aren't optional content to be ignored.

Is it possible that this could be fixed (making it more normative, 
and putting it in a form that's easier to process automatically)?


This is part of a very large problem for creating a more complete and 
machine-parseable
means of accessing *all* of the Unicode character property data, 
including data about
the *status* of properties and their default values. It won't, IMO, be 
fixed by individual

file fixes one at a time, although incremental improvement can be helpful.

Note that the UCD in XML was created to address this problem in part, 
but it still
cannot answer many questions about the status of properties, their full 
derivations,

their interactions, and their functions.

--Ken






tips on writing character proposal

2011-11-09 Thread Larson, Timothy E.
Hello!

I'm new here, but have already read some of the online documentation for 
proposing new characters.  I'm still a bit unsure how to go about it.  Or even 
who can do it.  Can individuals submit ideas, or do you need to be the 
representative of some agency or group?  How much supporting background 
information is deemed sufficient?  Where do I find details (more than just the 
pipeline table) of current pending proposals?

Here are my ideas in very abbreviated form.  If these are non-starters from the 
beginning, I'd as soon know it sooner rather than later.

These first several self-descriptive shapes are simply things I've seen 
suggested and wished for online for some time.

2B5ACLOCKWISE SPIRAL
2B5BCOUNTER-CLOCKWISE SPIRAL
2B5CCLOCKWISE DOUBLE SPIRAL
2B5DCOUNTER-CLOCKWISE DOUBLE SPIRAL

The next several are a response to a perceived deficiency in standardization of 
religious symbols. I suggest starting these cultural symbols at 2BC0 to 
distinguish them from the generic/geometric symbols earlier in the block.  Very 
brief description/background given.

2BC0ICHTHYS =Jesus fish, symbol used by ancient Christians for 
identification, denotes non-denominational and inter-denominational 
Christianity in modern times
2BC1TRIQUETRA =three-lobed vesicae piscis, used in Christianity and 
ancient/modern paganism
2BC2MENORAH =7-branched temple lamp, ancient symbol of Judaism
2BC3HANUKIAH =9-branched Hanukkah lamp

Thank you,
Tim




Re: tips on writing character proposal

2011-11-09 Thread Jukka K. Korpela

11/9/2011 10:58 PM, Larson, Timothy E. wrote:


I'm new here, but have already read some of the online documentation

 for proposing new characters.

I think that a key statement that you have missed is at the end of
http://www.unicode.org/pending/symbol-guidelines.html
which says:
“The fact that a symbol merely ‘seems to be useful or potentially 
useful’ is precisely not a reason to code it. Demonstrated usage, or 
demonstrated demand, on the other hand, does constitute a good reason to 
encode the symbol.”


Note that the usage or demand needs to relate to use in text, not as 
standalone symbols. Moreover, demonstrated actual usage in texts tends 
to have much better chances than even well-described demand.


Yucca





Re: tips on writing character proposal

2011-11-09 Thread vanisaac
From: Larson, Timothy E. TELarson_at_west.com

 Hello!
 
 I'm new here, but have already read some of the online documentation for 
 proposing new characters. I'm still a bit unsure how to go about it. Or even 
 who can do it. Can individuals submit ideas, or do you need to be the 
 representative of some agency or group? How much supporting background 
 information is deemed sufficient? Where do I find details (more than just the 
 pipeline table) of current pending proposals?

You absolutely do not need to be a representative of any company, government, 
organization, or group. I am in no way associated with any associated entity 
and successfully proposed a script with ~150 characters. All it takes is a 
dedication to serious research, a large amount of time to dedicate to the 
process, and the tenacity and perseverance to see a long and arduous process 
through to the end. The ability to produce PDFs is helpful, but not necessary,
too.

You can take a look at a large number of proposal documents from June by 
following links at the document register http://std.dkuug.dk/JTC1/SC2/WG2/docs
/n4000.pdf . Note that many of the documents are commentaries, opinions, or 
discussions of proposals. Look for any documents called something like 
Proposal to encode X or Preliminary Proposal to encode X. Note that 
preliminary proposals will necessarily be incomplete.

[snip]

 Thank you,
 Tim 

You're welcome,
Van




editorial: definitively broken link on CLDR online tools to external Unicode Fonts for Ancient Scripts

2011-11-09 Thread Philippe Verdy
The CLDR online tools include a footer that suggests finding Unicode
fonts for Ancient scripts from a web site (greekfonts.teilar.gr) which
is no longer available. Now it redirects to a parking page without
contents.

There's an archive of this page in the Google Cache, which shows that
the site is not just temporarily unavailable, but that it has been
closed indefinitely:

http://webcache.googleusercontent.com/search?q=cache:1oTvcjcKed4J:greekfonts.teilar.gr/+Unicode+Fonts+for+Ancient+Scriptscd=1hl=frct=clnkgl=fr

Can the online CLDR tools (referenced not just by the CLDR project
documentation and examples, but as well in some technical references
of the Unicode standard) suppress this link Unicode Fonts for Ancient
Scripts appearing at the bottom of pages (for example
http://unicode.org/cldr/utility/bidi.jsp), or suggest another good
site guide for available fonts for old/rare scripts, if possible not
commercial (i.e. not a foundry site directly selling their own fonts)
?

For example I can propose Gallery of Unicode Fonts on the WAZU JAPAN
site (http://www.wazu.jp/) as a complement to the existing Large,
multi-script Unicode fonts for Windows computers on the Alan Wood's
Unicode Reference site (http://www.alanwood.net/unicode/fonts.html) :
this would be the second largest online database with good contents
and neutral to font vendors, that we should better reference and keep
now, for the eventual case where the WAZU page would ever disappear
(We should better to have a second one available now if the only
working one that remains ever has problems).

-- Philippe.



Re: tips on writing character proposal

2011-11-09 Thread Mark E. Shoulson

On 11/09/2011 03:58 PM, Larson, Timothy E. wrote:

Hello!

I'm new here, but have already read some of the online documentation for 
proposing new characters.  I'm still a bit unsure how to go about it.  Or even 
who can do it.  Can individuals submit ideas, or do you need to be the 
representative of some agency or group?  How much supporting background 
information is deemed sufficient?  Where do I find details (more than just the 
pipeline table) of current pending proposals?


There are others here who will throw even more cold water on some of 
these ideas, but I can suggest that you read 
http://www.unicode.org/pending/symbol-guidelines.html for some ideas 
about what is encodable and what isn't.  You'll probably find plenty of 
exceptions, but it's a start.




Here are my ideas in very abbreviated form.  If these are non-starters from the 
beginning, I'd as soon know it sooner rather than later.

These first several self-descriptive shapes are simply things I've seen 
suggested and wished for online for some time.

2B5ACLOCKWISE SPIRAL
2B5BCOUNTER-CLOCKWISE SPIRAL
2B5CCLOCKWISE DOUBLE SPIRAL
2B5DCOUNTER-CLOCKWISE DOUBLE SPIRAL


These might well be non-starters.  Think about the first question you'd 
be asked: Why should these be encoded?  Is there any reason we should be 
considering these symbols plain text that need to be encoded as such?  
Or is it just because they're common simple geometric symbols?  While it 
is true that a lot of simple geometric symbols have been encoded, it 
generally has not been *because* they are simple geometric symbols, but 
rather because they were encoded in some other standard once before, or 
because they are used as plain text in some settings.


The next several are a response to a perceived deficiency in standardization of 
religious symbols. I suggest starting these cultural symbols at 2BC0 to 
distinguish them from the generic/geometric symbols earlier in the block.  Very 
brief description/background given.

2BC0ICHTHYS =Jesus fish, symbol used by ancient Christians for 
identification, denotes non-denominational and inter-denominational Christianity in 
modern times
2BC1TRIQUETRA =three-lobed vesicae piscis, used in Christianity and 
ancient/modern paganism
2BC2MENORAH =7-branched temple lamp, ancient symbol of Judaism
2BC3HANUKIAH =9-branched Hanukkah lamp
Apply the same question.  What makes these symbols plain text?  To be 
sure, there are other religious symbols in Unicode, particularly in the 
MISCELLANEOUS SYMBOLS and DINGBATS blocks, but those are mainly there 
because they were formerly encoded in, say, Zapf Dingbats, or are 
commonly used as map symbols.  (You might actually be able to find some 
support for these, though, but don't ask me where.)


It's a very common mistake, in coming to Unicode, to think Oh, it would 
be *so great* if these things were encoded!  But Unicode isn't about 
encoding what would be neat to encode.  It's about encoding _text_, 
(including things that have been encoded before).


~mark




Re: tips on writing character proposal

2011-11-09 Thread Asmus Freytag

On 11/9/2011 6:08 PM, Mark E. Shoulson wrote:

On 11/09/2011 03:58 PM, Larson, Timothy E. wrote:

Hello!

I'm new here, but have already read some of the online documentation 
for proposing new characters.  I'm still a bit unsure how to go about 
it.  Or even who can do it.  Can individuals submit ideas, or do you 
need to be the representative of some agency or group?  How much 
supporting background information is deemed sufficient?  Where do I 
find details (more than just the pipeline table) of current pending 
proposals?


There are others here who will throw even more cold water on some of 
these ideas, but I can suggest that you read 
http://www.unicode.org/pending/symbol-guidelines.html for some ideas 
about what is encodable and what isn't.  You'll probably find plenty 
of exceptions, but it's a start.


Timothy,

Before you get totally discouraged, I'd like to point out that there are 
few open and shut cases in character encoding. Chances to get your 
proposed characters improver, the better the use case and the better the 
documented examples of actual use (usually in print or in examples that 
should be convertable to print). The fact that you think a character 
is missing is evidence that there's at least one potential user.


Your task, in writing a proposal, would be to document that you are not 
alone (far from it) and that these symbols are used in text(s) on equal 
footing with other symbols. Doing the research and writing a proposal 
does take some work, and critics will be hovering to point out all 
shortcomings. But that should help improve your proposal.




Here are my ideas in very abbreviated form.  If these are 
non-starters from the beginning, I'd as soon know it sooner rather 
than later.


These first several self-descriptive shapes are simply things I've 
seen suggested and wished for online for some time.


2B5ACLOCKWISE SPIRAL
2B5BCOUNTER-CLOCKWISE SPIRAL
2B5CCLOCKWISE DOUBLE SPIRAL
2B5DCOUNTER-CLOCKWISE DOUBLE SPIRAL


These might well be non-starters.  Think about the first question 
you'd be asked: Why should these be encoded?  Is there any reason we 
should be considering these symbols plain text that need to be 
encoded as such?  Or is it just because they're common simple 
geometric symbols?  While it is true that a lot of simple geometric 
symbols have been encoded, it generally has not been *because* they 
are simple geometric symbols, but rather because they were encoded in 
some other standard once before, or because they are used as plain 
text in some settings.


Before you see this as a definite answer, let me give you a suggestion 
of a different opinion.


A common usage of these symbols in text is in non-verbal speech 
bubbles in cartoons. While these bubbles may look hand-drawn, they are 
very often actually typeset. The one exception being just those strings 
of symbols.


Since, in the examples that I am thingking of, they are presented as 
text and their layout (on a line) is in no way different than text 
presentation, it's not possible to simply rule these out categorically.


When symbols, however arbirtrary, can be demonstrated as being used as 
part of writing, there's no good rationale to refuse their encoding. 
Doing so would simply send the message that arbitrary symbols are fine 
if they occur in just a subset of (more formal, e.g. mathematical) texts 
or on electronic platforms, but not elsewhere. That seems in violation 
of precedent and in violation of the universal scope of the standard.


Now, you may not find examples of all types of spiral. Unless logically 
required by formal notation, I would, in that case, propose only those 
that can be found as in use. Completion of the set can be an argument 
in favor of encoding, but not everything is member of a set worth 
completing.





The next several are a response to a perceived deficiency in 
standardization of religious symbols. I suggest starting these 
cultural symbols at 2BC0 to distinguish them from the 
generic/geometric symbols earlier in the block.  Very brief 
description/background given.


2BC0ICHTHYS =Jesus fish, symbol used by ancient Christians for 
identification, denotes non-denominational and inter-denominational 
Christianity in modern times
2BC1TRIQUETRA =three-lobed vesicae piscis, used in Christianity 
and ancient/modern paganism

2BC2MENORAH =7-branched temple lamp, ancient symbol of Judaism
2BC3HANUKIAH =9-branched Hanukkah lamp
Apply the same question.  What makes these symbols plain text?  To be 
sure, there are other religious symbols in Unicode, particularly in 
the MISCELLANEOUS SYMBOLS and DINGBATS blocks, but those are mainly 
there because they were formerly encoded in, say, Zapf Dingbats, or 
are commonly used as map symbols.  (You might actually be able to find 
some support for these, though, but don't ask me where.)


I think these are great research candidates. I concur with the skeptics 
here that the mere existence of