Re: Concise term for non-ASCII Unicode characters

2015-09-29 Thread Mark Davis ☕️
I think the term "non-ASCII Unicode" is just fine, and we don't need
anything beyond that. It is clearly those Unicode characters that aren't
(2) in http://unicode.org/glossary/#ASCII.


Mark 

*— Il meglio è l’inimico del bene —*

On Tue, Sep 29, 2015 at 6:20 PM, Sean Leonard 
wrote:

> On 9/21/2015 5:17 PM, Peter Constable wrote:
>
>> If you think it's a serious problem that there isn't one conventional
>> term for "characters outside the ASCII repertoire" or "UTF-8
>> multi-code-unit encoded representations" (since different authors could
>> devise different terminology solutions), then I suggest you submit a
>> document to UTC explaining why it's a problem, documenting inconsistent or
>> unclear terminology that's been used in some standards / public
>> specifications, and requesting that Unicode formally define terminology for
>> these concepts. I can't guarantee that UTC will do it, but I can predict
>> with confidence that it _won't_ do anything of that nature if nobody
>> submits such a document. Peter
>>
>
> I am of the mind to do just that, then. I have seen different documents,
> standards, and standards bodies that have invented terminology around this
> term, and they are not always the same. Since these standards depend on
> Unicode, it would make a lot of sense for Unicode formally to define
> terminology for these concepts. With the proliferation of UTF-8 (among
> other things), the boundary between 0x7F - 0x80 is more significant than
> the boundary between 0x - 0x1.
>
> Since this will be my first submission I would appreciate a co-author on
> this topic. Is anyone willing to help? Thanks in advance. Also, it is not
> clear if such a document is destined to become a Unicode Technical Report
> (UTR / PDUTR etc.), or if it should just be an informal write-up. I am
> guessing this is supposed to be somewhat informal but at the same time it
> (or the results of it) ought to appear in the UTC Document Search.
>
> The current terminology that I am considering pursuing is "beyond ASCII",
> in various permutations, such as "beyond the ASCII range", "characters
> beyond ASCII", "code points beyond ASCII", etc. The term "beyond" implies a
> certain directionality, and to that extent, implies the Unicode repertoire
> as well as a Unicode encoding. We have seen on this list the blackflips
> required to clarify "non-ASCII", since things that are not ASCII literally
> could be a wide range of things.
>
> I think there is some confusion about whether the term "Basic Latin"
> excludes the C0 control character range. Formally the standard seems clear
> enough to me that it is co-terminus with ASCII, but there is still
> confusion if you don't pore through the Standard. My thought is that maybe
> the Blocks.txt data should be modified to say "ASCII (Basic Latin)" instead
> of just "Basic Latin". (If we "go there", I would appreciate the wisdom of
> an experienced Unicode co-author. I am not confident touching that just by
> myself.)
>
> Sean
>


Re: Concise term for non-ASCII Unicode characters

2015-09-29 Thread Daniel Bünzli
I would say there's already enough terminology in the Unicode world to add more 
to it. This thread already hinted at enough ways of expressing what you'd like, 
the simplest one being "scalar values greater than U+001F". This is the 
clearest you can come up with and anybody who has basic knowledge of the 
Unicode standard will immediately understand what you are talking about without 
having to lookup further definitions. 

Best, 

Daniel




Re: Concise term for non-ASCII Unicode characters

2015-09-29 Thread Sean Leonard

On 9/21/2015 5:17 PM, Peter Constable wrote:
If you think it's a serious problem that there isn't one conventional 
term for "characters outside the ASCII repertoire" or "UTF-8 
multi-code-unit encoded representations" (since different authors 
could devise different terminology solutions), then I suggest you 
submit a document to UTC explaining why it's a problem, documenting 
inconsistent or unclear terminology that's been used in some standards 
/ public specifications, and requesting that Unicode formally define 
terminology for these concepts. I can't guarantee that UTC will do it, 
but I can predict with confidence that it _won't_ do anything of that 
nature if nobody submits such a document. Peter 


I am of the mind to do just that, then. I have seen different documents, 
standards, and standards bodies that have invented terminology around 
this term, and they are not always the same. Since these standards 
depend on Unicode, it would make a lot of sense for Unicode formally to 
define terminology for these concepts. With the proliferation of UTF-8 
(among other things), the boundary between 0x7F - 0x80 is more 
significant than the boundary between 0x - 0x1.


Since this will be my first submission I would appreciate a co-author on 
this topic. Is anyone willing to help? Thanks in advance. Also, it is 
not clear if such a document is destined to become a Unicode Technical 
Report (UTR / PDUTR etc.), or if it should just be an informal write-up. 
I am guessing this is supposed to be somewhat informal but at the same 
time it (or the results of it) ought to appear in the UTC Document Search.


The current terminology that I am considering pursuing is "beyond 
ASCII", in various permutations, such as "beyond the ASCII range", 
"characters beyond ASCII", "code points beyond ASCII", etc. The term 
"beyond" implies a certain directionality, and to that extent, implies 
the Unicode repertoire as well as a Unicode encoding. We have seen on 
this list the blackflips required to clarify "non-ASCII", since things 
that are not ASCII literally could be a wide range of things.


I think there is some confusion about whether the term "Basic Latin" 
excludes the C0 control character range. Formally the standard seems 
clear enough to me that it is co-terminus with ASCII, but there is still 
confusion if you don't pore through the Standard. My thought is that 
maybe the Blocks.txt data should be modified to say "ASCII (Basic 
Latin)" instead of just "Basic Latin". (If we "go there", I would 
appreciate the wisdom of an experienced Unicode co-author. I am not 
confident touching that just by myself.)


Sean


Re: Concise term for non-ASCII Unicode characters

2015-09-29 Thread Daniel Bünzli
Le mardi, 29 septembre 2015 à 21:03, Richard Wordingham a écrit :
> Too wordy and clearly prone to error!

Yes and maybe that "average engineer" does not understand negation. So clearly 
any of non-ASCII, non-Basic Latin or greater than U+007F cannot fit. Bring in 
the bureaucrats, new terminology is needed, there are not enough useless 
definitions in the Unicode standard, let's add a few more.

Daniel





Re: Concise term for non-ASCII Unicode characters

2015-09-29 Thread Richard Wordingham
On Tue, 29 Sep 2015 20:27:28 +0100
Daniel Bünzli  wrote:

> Le mardi, 29 septembre 2015 à 19:50, Ken Whistler a écrit :
> > I agree that "scalar values greater than U+007F" doesn't just trip
> > off the tongue, and while technically accurate, it is bad
> > terminology -- precisely because it begs the question "wtf are
> > 'scalar values'?!" for the average engineer.
> 
> And an average engineer knows how to lookup definitions, that one
> being precise and exceptionally well defined in the Unicode glossary
> — in stark contrast to the shady (and deceiving for the newbie)
> notion of "character" that you use subsequently in your message.

The glossary might fool a 'newbie' (the declared target audience), but
its riddled enough with errors to dispel confidence.  Just looking at
the entries before 'ASCII':

OK: 'Abstract character sequence' (if one has a usable understanding of
'abstract character'); 'accent mark', 'acrophonic', 'akshara' (though
the spelling with neither an 'h' nor a dot below is weird);
'algorithm', 'alphabet' (though saying that modern Lao and pointed
Hebrew use alphabets is probably not very helpful),
'alphabetic' (though it's not obvious to me why ARABIC SUKUN is
alphabetic but potentially visible viramas are not), 'alphabetic
sorting', 'annotation', 'apparatus criticus', 'Arabic Indic
digits' (though are 'European digits' derived from the digits of the
eastern part of the Arab world?)

Dodgy:

'Abjad' (living abjads also mark vowels, with some vowels having
characters dignified as 'letters').  Does normal Egyptian hieroglyphic
writing constitute an abjad?

'Abstract character' - but then the definition makes no sense.

'Abugida' - needs 'consonants' and 'vowels' to be qualified by 'most',
otherwise it won't even work for Classical Sanskrit in Devanagari.
Vowel letters and visarga are the principal problems. 

'ANSI' - I don't think the Windows code pages for UTF-8 and UTF-16 are
'ANSI'. 

'Arabic digits' - aren't the European digits used in western Arabic as
native as the eastern Arabic digits (U+0660 etc.) used in eastern
Arabic?

11 more-or-less OK versus 5 dodgy does not generate a great deal of
confidence in the glossary.

I appreciate that the difference between abjad, abugida and alphabet is
difficult to capture, as abjads and abugidas can evolve into
alphabets.

Richard.



Re: Concise term for non-ASCII Unicode characters

2015-09-29 Thread Sean Leonard

On 9/29/2015 12:27 PM, Daniel Bünzli wrote:

Le mardi, 29 septembre 2015 à 19:50, Ken Whistler a écrit :

I agree that "scalar values greater than U+007F" doesn't just trip off the 
tongue,
and while technically accurate, it is bad terminology -- precisely because it
begs the question "wtf are 'scalar values'?!" for the average engineer.

And an average engineer knows how to lookup definitions, that one being precise and 
exceptionally well defined in the Unicode glossary — in stark contrast to the shady (and 
deceiving for the newbie) notion of "character" that you use subsequently in 
your message.

This is not "bad terminology", it's *precise* terminology and what I would like 
to see used in protocols and standards.

Many programmers I talk to are confused by Unicode because their notion of Unicode 
"character" is a chaotic mix of scalar values, code points and their various 
*encodings* (i.e. byte level considerations).


+1

I like the definition of "character" in ASCII:
3.3 Character. A member of a set of elements used for the organization, 
control, or representation of data.


This, by the way, is the exact same definition as in ISO 646, ISO 2022, 
and yes, even ISO 10646 (2003). It was the best of times...


Sean


Re: Concise term for non-ASCII Unicode characters

2015-09-29 Thread Asmus Freytag (t)

  
  
On 9/29/2015 8:40 PM, Sean Leonard
  wrote:


  I like the definition of "character" in ASCII:
  
  3.3 Character. A member of a set of elements used for the
  organization, control, or representation of data.
  
  
  This, by the way, is the exact same definition as in ISO 646, ISO
  2022, and yes, even ISO 10646 (2003). It was the best of times...
  
  

I've always thought that this was not a "definition" as much as some
necessary part of the description.

There surely must be other "elements" that are used that way, and
that are not characters.

A./



  



Re: Concise term for non-ASCII Unicode characters

2015-09-29 Thread Sean Leonard

On 9/29/2015 11:50 AM, Ken Whistler wrote:



On 9/29/2015 10:30 AM, Sean Leonard wrote:

On 9/29/2015 9:40 AM, Daniel Bünzli wrote:
I would say there's already enough terminology in the Unicode world 
to add more to it. This thread already hinted at enough ways of 
expressing what you'd like, the simplest one being "scalar values 
greater than U+001F". This is the clearest you can come up with and 
anybody who has basic knowledge of the Unicode standard

Uh...I think you mean U+007F? :)


I agree that "scalar values greater than U+007F" doesn't just trip off 
the tongue,
and while technically accurate, it is bad terminology -- precisely 
because it

begs the question "wtf are 'scalar values'?!" for the average engineer.



Perhaps it's because I'm writing to the Unicode crowd, but honestly 
there are a lot of very intelligent software engineers/standards 
folks who do not have the "basic knowledge of the Unicode standard" 
that is being presumed. They want to focus on other parts of their 
systems or protocols, and when it comes to the "text part", they just 
hand-wave and say "Unicode!" and call it a day. ...


Well, from this discussion, and from my experience as an engineer, I 
think this comes down
to people in other standards, practices, and protocols dealing with 
the ages old problem
of on beyond zebra for characters, where the comfortable assumptions 
that byte=character
break down and people have to special case their code and 
documentation. Where buffers
overrun, where black hat hackers rub their hands in glee, and where 
engineers exclaim, "Oh gawd! I

can't just cast this character, because it's actually an array!"

And nowadays, we are in the age of universal Unicode. All (well, much, 
anyway) would be cool
if everybody were using UTF-32, because then at least we'd be back to 
32-bit-word=character,
and the programming would be easier. But UTF-32 doesn't play well with 
existing protocols
and APIs and storage and... So instead, we are in the age of 
"universal Unicode and almost

always UTF-8."

So that leaves us with two types of characters:

1. "Good characters"

These are true ASCII. U+..U+007F. Good because they are all single 
bytes in UTF-8
and because then UTF-8 strings just work like the Computer Science God 
always intended,

and we don't have to do anything special.

2. "Bad characters"

Everything else: U+0080..U+10. Bad because they require multiple 
bytes to represent
in UTF-8 and so break all the simple assumptions about string and 
buffer length.
They make for bugs and more bugs and why oh why do I have to keep 
dealing with
edge cases where character boundaries don't line up with allocated 
buffer boundaries?!!


I think we can agree that there are two types of characters -- and 
that those code point

ranges correctly identify the sets in question.

The problem then just becomes a matter of terminology (in the 
standards sense of
"terminology") -- coming up with usable, clear terms for the two sets. 
To be good
terminology, the terms have to be identifiable and neither too generic 
("good characters"
and "bad characters") or too abstruse or wordy ("scalar values less 
than or equal to U+007F" and

"scalar values greater than U+007F").

They also need to not be confusing. For example, "single-byte UTF-8" 
and "multi-byte UTF-8"
might work for engineers, but is a confusing distinction, because 
UTF-8 as an encoding
form is inherently multi-byte, and such terminology would undermine 
the meaning of UTF-8

itself.

Finally, to be good terminology, the terms needs to have some 
reasonable chance of
catching on and actually being used. It is fairly pointless to have a 
"standardized way"
of distinguishing the #1 and #2 types of characters if people either 
don't know about
that standardized way or find it misleading or not helpful, and 
instead continue groping

about with their existing ad hoc terms anyway.



In the twenty minutes since my last post, I got two different 
responses...and as you pointed out, there are a lot of ways to 
express what one would like. I would prefer one, uniform way (hence, 
"standardized way").


Mark's point was that it is hard to improve on what we already have:

1. ASCII Unicode [characters] (i.e. U+..U+007F)

2. Non-ASCII Unicode [characters] (i.e. U+0080..U+10)

If we just highlight that terminology more prominently, emphasize it 
in the
Unicode glossary, and promote it relentlessly, it might catch on more 
generally,

and solve the problem.

More irreverently, perhaps we could come up with complete neologisms that
might be catchy enough to go viral -- at least among the protocol 
writers and
engineers who matter for this. Riffing on the small/big distinction 
and connecting

it to "u-*nichar*" for the engineers, maybe something along the lines of:

1. skinnichar

2. baloonichar

Well, maybe not those! But you get the idea. I'm sure there is a 
budding terminologist

out there who could improve on that suggestion!

At any rate, 

Re: Concise term for non-ASCII Unicode characters

2015-09-29 Thread Sean Leonard

On 9/29/2015 9:40 AM, Daniel Bünzli wrote:

I would say there's already enough terminology in the Unicode world to add more to it. 
This thread already hinted at enough ways of expressing what you'd like, the simplest one 
being "scalar values greater than U+001F". This is the clearest you can come up 
with and anybody who has basic knowledge of the Unicode standard

Uh...I think you mean U+007F? :)

Perhaps it's because I'm writing to the Unicode crowd, but honestly 
there are a lot of very intelligent software engineers/standards folks 
who do not have the "basic knowledge of the Unicode standard" that is 
being presumed. They want to focus on other parts of their systems or 
protocols, and when it comes to the "text part", they just hand-wave and 
say "Unicode!" and call it a day. In particular there is a flow-down 
effect where terms from one standards body don't match with another 
standards body, perhaps because they got redefined over time for various 
reasons. The distinction between "characters", "abstract characters", 
"code points", and "scalar values" is not intuitively obvious to people 
without specialized knowledge of text processing issues. The fact that 
(modern implementations of) UTF-8 encoders and decoders are not supposed 
to process the surrogate code points (arbitrarily), for example, is a 
rather advanced topic that presumes knowledge of the interaction between 
UTF-16, UTF-8, what surrogate code points actually are, and the security 
implications of so-doing (UTR-36). Furthermore one has to parse the 
distinction between "well-formed" and "ill-formed".


In the twenty minutes since my last post, I got two different 
responses...and as you pointed out, there are a lot of ways to express 
what one would like. I would prefer one, uniform way (hence, 
"standardized way"). Just surveying the various standards that have 
tried to tackle this distinction with their own organic terminology will 
probably be revealing. Evidence-based should be the yardstick.


Best regards,

Sean


Re: Concise term for non-ASCII Unicode characters

2015-09-29 Thread Daniel Bünzli
Le mardi, 29 septembre 2015 à 18:30, Sean Leonard a écrit :
> Uh...I think you mean U+007F? :)

Yes… see how it was easy to point out that the definition was wrong. It would 
also have been, if this was code and we were talking about a protocol whose 
specification was using this notation rather than a new Unicode concept.

> Perhaps it's because I'm writing to the Unicode crowd, but honestly
> there are a lot of very intelligent software engineers/standards folks  
> who do not have the "basic knowledge of the Unicode standard" that is  
> being presumed. They want to focus on other parts of their systems or  
> protocols, and when it comes to the "text part", they just hand-wave and  
> say "Unicode!" and call it a day.

Introducing more terminology and jargon is not going to help in this case. Make 
the definitions as obvious as possible and strive for minimality in the exposed 
concepts.

> The fact that (modern implementations of) UTF-8 encoders and decoders are not 
> supposed to process the surrogate code points (arbitrarily), for example, is a
> rather advanced topic

I wouldn't say this is advanced knowledge, this is basic knowledge any 
programmer dealing with Unicode text should have. FWIW this [1] is the absolute 
minimal knowledge I think programmers should have about Unicode (the last 
section can be skipped it's specific to a programming language). This 
corresponds to maybe 3 to 4 A4 pages. If your programmers are not able to grok 
this small amount of knowledge, hire better ones.

Best,  

Daniel

[1] http://erratique.ch/software/uucp/doc/Uucp.html#uminimal 



Re: Concise term for non-ASCII Unicode characters

2015-09-29 Thread Ken Whistler



On 9/29/2015 10:30 AM, Sean Leonard wrote:

On 9/29/2015 9:40 AM, Daniel Bünzli wrote:
I would say there's already enough terminology in the Unicode world 
to add more to it. This thread already hinted at enough ways of 
expressing what you'd like, the simplest one being "scalar values 
greater than U+001F". This is the clearest you can come up with and 
anybody who has basic knowledge of the Unicode standard

Uh...I think you mean U+007F? :)


I agree that "scalar values greater than U+007F" doesn't just trip off 
the tongue,
and while technically accurate, it is bad terminology -- precisely 
because it

begs the question "wtf are 'scalar values'?!" for the average engineer.



Perhaps it's because I'm writing to the Unicode crowd, but honestly 
there are a lot of very intelligent software engineers/standards folks 
who do not have the "basic knowledge of the Unicode standard" that is 
being presumed. They want to focus on other parts of their systems or 
protocols, and when it comes to the "text part", they just hand-wave 
and say "Unicode!" and call it a day. ...


Well, from this discussion, and from my experience as an engineer, I 
think this comes down
to people in other standards, practices, and protocols dealing with the 
ages old problem
of on beyond zebra for characters, where the comfortable assumptions 
that byte=character
break down and people have to special case their code and documentation. 
Where buffers
overrun, where black hat hackers rub their hands in glee, and where 
engineers exclaim, "Oh gawd! I

can't just cast this character, because it's actually an array!"

And nowadays, we are in the age of universal Unicode. All (well, much, 
anyway) would be cool
if everybody were using UTF-32, because then at least we'd be back to 
32-bit-word=character,
and the programming would be easier. But UTF-32 doesn't play well with 
existing protocols
and APIs and storage and... So instead, we are in the age of "universal 
Unicode and almost

always UTF-8."

So that leaves us with two types of characters:

1. "Good characters"

These are true ASCII. U+..U+007F. Good because they are all single 
bytes in UTF-8
and because then UTF-8 strings just work like the Computer Science God 
always intended,

and we don't have to do anything special.

2. "Bad characters"

Everything else: U+0080..U+10. Bad because they require multiple 
bytes to represent
in UTF-8 and so break all the simple assumptions about string and buffer 
length.
They make for bugs and more bugs and why oh why do I have to keep 
dealing with
edge cases where character boundaries don't line up with allocated 
buffer boundaries?!!


I think we can agree that there are two types of characters -- and that 
those code point

ranges correctly identify the sets in question.

The problem then just becomes a matter of terminology (in the standards 
sense of
"terminology") -- coming up with usable, clear terms for the two sets. 
To be good
terminology, the terms have to be identifiable and neither too generic 
("good characters"
and "bad characters") or too abstruse or wordy ("scalar values less than 
or equal to U+007F" and

"scalar values greater than U+007F").

They also need to not be confusing. For example, "single-byte UTF-8" and 
"multi-byte UTF-8"
might work for engineers, but is a confusing distinction, because UTF-8 
as an encoding
form is inherently multi-byte, and such terminology would undermine the 
meaning of UTF-8

itself.

Finally, to be good terminology, the terms needs to have some reasonable 
chance of
catching on and actually being used. It is fairly pointless to have a 
"standardized way"
of distinguishing the #1 and #2 types of characters if people either 
don't know about
that standardized way or find it misleading or not helpful, and instead 
continue groping

about with their existing ad hoc terms anyway.



In the twenty minutes since my last post, I got two different 
responses...and as you pointed out, there are a lot of ways to express 
what one would like. I would prefer one, uniform way (hence, 
"standardized way").


Mark's point was that it is hard to improve on what we already have:

1. ASCII Unicode [characters] (i.e. U+..U+007F)

2. Non-ASCII Unicode [characters] (i.e. U+0080..U+10)

If we just highlight that terminology more prominently, emphasize it in the
Unicode glossary, and promote it relentlessly, it might catch on more 
generally,

and solve the problem.

More irreverently, perhaps we could come up with complete neologisms that
might be catchy enough to go viral -- at least among the protocol 
writers and
engineers who matter for this. Riffing on the small/big distinction and 
connecting

it to "u-*nichar*" for the engineers, maybe something along the lines of:

1. skinnichar

2. baloonichar

Well, maybe not those! But you get the idea. I'm sure there is a budding 
terminologist

out there who could improve on that suggestion!

At any rate, any formal contribution that suggests coming 

Re: Concise term for non-ASCII Unicode characters

2015-09-29 Thread Richard Wordingham
On Tue, 29 Sep 2015 17:40:47 +0100
Daniel Bünzli  wrote:

> I would say there's already enough terminology in the Unicode world
> to add more to it. This thread already hinted at enough ways of
> expressing what you'd like, the simplest one being "scalar values
> greater than U+001F".

Too wordy and clearly prone to error!

Richard.



Re: Concise term for non-ASCII Unicode characters

2015-09-28 Thread Sean Leonard

To follow up on this thread:

It appears that ASCII is in fact a defined term in the Unicode glossary, 
and this term is sufficiently broad.


http://unicode.org/glossary/#ASCII

ASCII is sufficient to identify the range 0 - 127, whether that is 
simply a "range", "characters", "code points", or "scalar values". 
(Since they are all the same in that range 0 - 127.)


This leaves open the question of how to define the range that is not 0 - 
127, but is 128 -> onwards. An e-mail will follow on the topic...


Sean

***
ASCII. (1)The American Standard Code for Information Interchange, a 
7-bit coded character set for information interchange. It is the U.S. 
national variant of ISO/IEC 646 and is formally the U.S. standard ANSI 
X3.4. It was proposed by ANSI in 1963 and finalized in 1968. (2) The set 
of 128 Unicode characters from U+ to U+007F, including control codes 
as well as graphic characters. (3) ASCII has been incorrectly used to 
refer to various 8-bit character encodings that include ASCII characters 
in the first 128 code points.


Re: Concise term for non-ASCII Unicode characters

2015-09-22 Thread Richard Wordingham
On Tue, 22 Sep 2015 08:34:14 -0700
"Doug Ewell"  wrote:
 
> That's why I wrote "non  Basic  Latin."
> 
> But I realize that not all fonts will show this clearly, and that the
> distinction is lost in speech anyway.

I think the difference is actually clearer in speech.

Richard.


Re: Concise term for non-ASCII Unicode characters

2015-09-22 Thread Sean Leonard

On 9/21/2015 9:24 PM, Janusz S. Bien wrote:
Quote/Cytat - Sean Leonard  (Mon 21 Sep 
2015 10:51:42 PM CEST):



Related question as I am researching this:

How can I acquire (cheaply or free) the latest and most official copy 
of US-ASCII, namely, the version that Unicode references?


[...]


Thanks to all. I was able to locate a copy of ANSI X3.4-1986 (R1997) 
[hereinafter ASCII]. (See my subsequent e-mail about the term "ASCII".)




I've never seen the ASCII standard, but I think is it (almost?) 
identical to ISO/IEC 646, which in turn  is identical to the freely 
available ECMA-6:


http://www.ecma-international.org/publications/standards/Ecma-006.htm


Having just read both standards documents in some detail, I can attest 
that they are not the same. However, the practical effect for purposes 
of Unicode is the same.


ECMA-6 (1991) is indeed identical to ISO/IEC 646 (as far as I can tell; 
hereinafter ECMA-6). ECMA-6 "specifies a 7-bit coded character set with 
a number of options" (Clause 1.2). Specifically, the following positions 
are ambiguous or subject to national assignment:

2/3 NUMBER SIGN or POUND SIGN
2/4 DOLLAR SIGN or CURRENCY SIGN
4/0
5/11
5/12
5/13
5/14
6/0
7/11
7/12
7/13
7/14

ECMA-6 specifies an International Reference Version (IRV), which 
exercises the "options". The IRV fills in the graphic characters 
consistent with ASCII. However, ECMA-6 sort of leaves the C0 region 
blank...and the IRV (in Annex A, normative) says "if the C0 set [...] is 
used, it shall be the C0 set of Standard ECMA-48." Sort of fudging. 
Anyway, the IRV C0 set / ECMA-48 set is the same as ASCII.


Overall, the takeaway is that specifying ISO/IEC 646 / ECMA-6 is not 
sufficient; you need to include "IRV" as well, or ISO IR No. 6 for the 
G0 set and ISO IR No. 6 for the C0 set.


In contrast, if you say ASCII (ANSI X3.4-1986), all positions are fully 
defined.


Regards,

Sean


Re: Concise term for non-ASCII Unicode characters

2015-09-22 Thread Richard Wordingham
On Sun, 20 Sep 2015 16:52:29 +
Peter Constable  wrote:

> You already have been using "non-ASCII Unicode", which is about as
> concise and sufficiently accurate as you'll get. There's no term
> specifically defined in any standard or conventionally used for this.

As to standards, UTS#18 'Unicode Regular Expression' Requirement RL1.2
requires the support of the 'property' it calls 'ASCII', which is
defined in Section 1.2.1 as the property of being in the range U+ to
U+007F. This implicitly makes 'not ASCII' a derived property held by all
the other codepoints. If you fear that your audience will think that
Latin-1 characters are ASCII, you'll just have to go for the clumsy
'not 7-bit ASCII'  and accept that there isn't an unambiguous way in
English of turning that into an adjective or noun.

If a term were invented, you'd generally have to explain it, and you
would do better just to remind readers what ASCII is.

Richard.


Re: Concise term for non-ASCII Unicode characters

2015-09-22 Thread Philippe Verdy
I would not use the "clumsy 7-bit ASCII" due to the confusion created since
long when it could refer to any national version of ISO 646, which reassign
some code positions in the rande 0x00 to 0x07F to other characters outside
the range U+ to U+007F, while still remaining 7-bit encodings.
So insead of "7-bit ASCII" I highly prefer the term "US-ASCII" to make sure
it refers to the encoding of 7-bit code positions effectively to
U+..U+007F.

So for code positions outside 0x00..0x7F, I would call them "not US-ASCII"
(none of them are bound to any Unicode "character" or "code point" or
"scalar value", they are just "code positions" or more precisely "octet
values with their most significant bit set to 1" which is really long: "not
US-ASCII" is fine as a shorter term).

2015-09-22 9:43 GMT+02:00 Richard Wordingham <
richard.wording...@ntlworld.com>:

> On Sun, 20 Sep 2015 16:52:29 +
> Peter Constable  wrote:
>
> > You already have been using "non-ASCII Unicode", which is about as
> > concise and sufficiently accurate as you'll get. There's no term
> > specifically defined in any standard or conventionally used for this.
>
> As to standards, UTS#18 'Unicode Regular Expression' Requirement
> RL1.2 requires the support of the 'property' it calls 'ASCII', which is
> defined in Section 1.2.1 as the property of being in the range U+ to
> U+007F. This implicitly makes 'not ASCII' a derived property held by all
> the other codepoints. If you fear that your audience will think that
> Latin-1 characters are ASCII, you'll just have to go for the clumsy
> 'not 7-bit ASCII'  and accept that there isn't an unambiguous way in
> English of turning that into an adjective or noun.
>
> If a term were invented, you'd generally have to explain it, and you
> would do better just to remind readers what ASCII is.
>
> Richard.
>


Re: Concise term for non-ASCII Unicode characters

2015-09-22 Thread Sean Leonard

On 9/22/2015 2:27 AM, Sean Leonard wrote:
Overall, the takeaway is that specifying ISO/IEC 646 / ECMA-6 is not 
sufficient; you need to include "IRV" as well, or ISO IR No. 6 for the 
G0 set and ISO IR No. 6 for the C0 set.


...which the Unicode Standard does specify, by stating "IRV" explicitly 
(Section 2.8, Section 7.1). Hence, there is no Unicode problem.


[Correction: it's IR No. 1 for the C0 set.]



In contrast, if you say ASCII (ANSI X3.4-1986), all positions are 
fully defined.


Regards,

Sean




Re: Concise term for non-ASCII Unicode characters

2015-09-22 Thread Sean Leonard

On 9/22/2015 1:45 AM, Philippe Verdy wrote:
I would not use the "clumsy 7-bit ASCII" due to the confusion created 
since long when it could refer to any national version of ISO 646, 
which reassign some code positions in the rande 0x00 to 0x07F to other 
characters outside the range U+ to U+007F, while still remaining 
7-bit encodings.
So insead of "7-bit ASCII" I highly prefer the term "US-ASCII" to make 
sure it refers to the encoding of 7-bit code positions effectively to 
U+..U+007F.


So for code positions outside 0x00..0x7F, I would call them "not 
US-ASCII" (none of them are bound to any Unicode "character" or "code 
point" or "scalar value", they are just "code positions" or more 
precisely "octet values with their most significant bit set to 1" 
which is really long: "not US-ASCII" is fine as a shorter term).


Again having just read through ANSI X3.4-1986 (R1997), I would like to 
clarify some things.


The standard itself is titled:
American National Standard for Information Systems - Coded Character 
Sets - 7-Bit American National Standard Code for Information Interchange 
(7-Bit ASCII)


However, Clause 1.1 states:
This standard specifies a set of 128 characters (control characters and 
graphic characters, such as letters, digits, and symbols) with their 
coded representation. The American National Standard Code for 
Information Interchange may also be identified by the acronym ASCII 
(pronounced ask-ee). To explicitly designate a particular (perhaps 
prior) edition of this standard, the last two digits of the year of 
issue may be appended, as in "ASCII 68" or "ASCII 86".



According to the title, "7-Bit ASCII" is proper. However, according to 
the text, "ASCII" is sufficient. The "7-Bit" part really just emphasizes 
the fact that it is a 7-bit standard. The eighth bit is outside the 
scope of the standard (but see clause 2.1.1). (Incidentally, Clause 1.1 
is not Y2K compliant! Thus you should '86 that part of ASCII 86...hehe)


The term "US-ASCII" (see also RFC 2046 for a lot of discussion) is 
similarly redundant. After all, it is the *American* *National* Standard 
Code for Information Interchange. Even if you remove the term "National" 
(which does not appear in ASCII 68 or ASCII 63), it's still American. 
However, ASCII 68 (partially reprinted in RFC 20: 
) actually permits "the notation 
ASCII (pronounced as'-key) or USASCII (pronounced you-sas'-key) [...] to 
mean the code prescribed by the latest issue of the standard". That is 
probably the genesis of US-ASCII. I wasn't alive at the time so I don't 
know. My suspicion is that "US-ASCII" was meant to disambiguate ASCII 86 
from ASCII 68 (which is referred to as "ASCII" in RFC 821) without 
referring to the year, and since 68 and 86 are transposed numerals, 
"US-ASCII" eliminates possible mix-ups.



My conclusion here is that "ASCII" is sufficient when talking about the 
range of (code or character) positions 0 - 127, regardless of how they 
are encoded, so long as they logically evaluate to the bit combinations 
of the 7-bit code described in ANSI X3.4-1986.


"Basic Latin" also works if you want to avoid the historic reference. 
But there are many systems in use that are ASCII-based (including the 
Internet, as RFC 20 is still in force), and the term "ASCII" is peppered 
throughout the Unicode Standard 8.0 with greater frequency than "Basic 
Latin" (which is acknowledged to be a synonym for "ASCII" in Sections 
5.7 and 6.2).


Sean





RE: Concise term for non-ASCII Unicode characters

2015-09-22 Thread Peter Constable
> If a term were invented, you'd generally have to explain it, and you

would do better just to remind readers what ASCII is.



+1





Peter

Sent from Outlook Mail<http://go.microsoft.com/fwlink/?LinkId=550987> for 
Windows 10





From: Richard Wordingham
Sent: Tuesday, September 22, 2015 12:51 AM
To: unicode@unicode.org
Subject: Re: Concise term for non-ASCII Unicode characters


On Sun, 20 Sep 2015 16:52:29 +
Peter Constable <peter...@microsoft.com> wrote:

> You already have been using "non-ASCII Unicode", which is about as
> concise and sufficiently accurate as you'll get. There's no term
> specifically defined in any standard or conventionally used for this.

As to standards, UTS#18 'Unicode Regular Expression' Requirement RL1.2
requires the support of the 'property' it calls 'ASCII', which is
defined in Section 1.2.1 as the property of being in the range U+ to
U+007F. This implicitly makes 'not ASCII' a derived property held by all
the other codepoints. If you fear that your audience will think that
Latin-1 characters are ASCII, you'll just have to go for the clumsy
'not 7-bit ASCII'  and accept that there isn't an unambiguous way in
English of turning that into an adjective or noun.

If a term were invented, you'd generally have to explain it, and you
would do better just to remind readers what ASCII is.

Richard.




Re: Concise term for non-ASCII Unicode characters

2015-09-22 Thread Doug Ewell
Martin J. Dürst wrote:

>> I was thinking that something like "non–Basic-Latin Unicode" might be
>
> Is that non-Basic Latin or not Basic-Latin?
>
>> useful. It avoids the confusion of referring to ASCII as a range of
>> code points instead of a separate encoding standard.
>
> But as a three-component term with unclear structure, it's confusing
> by itself.

That's why I wrote "non  Basic  Latin."

But I realize that not all fonts will show this clearly, and that the
distinction is lost in speech anyway.

--
Doug Ewell | http://ewellic.org | Thornton, CO 




Re: Concise term for non-ASCII Unicode characters

2015-09-21 Thread Janusz S. Bien
Quote/Cytat - Sean Leonard  (Mon 21 Sep  
2015 10:51:42 PM CEST):



Related question as I am researching this:

How can I acquire (cheaply or free) the latest and most official  
copy of US-ASCII, namely, the version that Unicode references?


[...]

I've never seen the ASCII standard, but I think is it (almost?)  
identical to ISO/IEC 646, which in turn  is identical to the freely  
available ECMA-6:


http://www.ecma-international.org/publications/standards/Ecma-006.htm

Regards

Janusz


--
Prof. dr hab. Janusz S. Bień -  Uniwersytet Warszawski (Katedra  
Lingwistyki Formalnej)

Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/



RE: Concise term for non-ASCII Unicode characters

2015-09-21 Thread Tony Jollans
As an interested outsider may I suggest that the term "ASCII", indeed the 
concept of ASCII, is only of historical interest and should not be used in any 
modern context. Computing is riddled with terms, "word" being another in 
similar vein, that are used to mean something they are not and would be best 
forgotten.

These days, it is pretty sloppy coding that cares how many bytes an encoding of 
something requires, although there may be many circumstances where legacy 
support is required. You say that, in some contexts, one needs to be really 
clear that the octets 0x80 - 0xFF are Unicode. Either something "is" Unicode, 
or it isn't. Either something uses a recognised encoding, or it doesn't. Using 
these octets to represent Unicode code points is not ASCII, is not UTF-8, and 
is not UCS-2/UTF-16; it could, perhaps, be EBCDIC. Whatever it is, say so 
clearly and explicitly and, if necessary, say why; don't look for some 
mealy-mouthed expression to avoid so saying.

Just my twopenn'orth, and no offence meant, but I can't help thinking you're 
looking for something that shouldn't exist.

Best regards,
Tony Jollans


-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Sean Leonard
Sent: 21 September 2015 09:22
To: unicode@unicode.org
Subject: Re: Concise term for non-ASCII Unicode characters

First of all, thank you all for the responses thus far.

On 9/20/2015 5:51 PM, Martin J. Dürst wrote:
> Hello Sean,
>
> On 2015/09/20 23:48, Sean Leonard wrote:
>> What is the most concise term for characters or code points
>
> So we already have two different things we might need a term for. 

> [...]
>>
>> The terms "supplementary character" and "supplementary code point" 
>> are defined in the Unicode standard, referring to characters or code 
>> points above U+. I am looking for something like those, but for 
>> characters or code points above U+007F.
> Anyway, what I wanted to show is that depending on what you need it 
> for, there are so many different variations that it doesn't pay off to 
> create specific short terms for all of them, and the term you use 
> currently may be short enough.

Well what I am getting at is that when writing standards documents in various 
SDOs (or any other computer science text, for that matter), it is helpful to 
identify these characters/code points.

I think we can limit our inquiry to "characters" and "code points". Both of 
those are well-defined in Unicode (see <http://unicode.org/glossary/>). A 
[Unicode] code point is any value in the range 0 - 0x10. A [Unicode] 
character is an abstract character that is actually assigned a [Unicode] scalar 
value. Therefore the space is Unicode code point > Unicode scalar value > 
Unicode character.

"supplementary" means outside the BMP, i.e., 0x1 - 0x10.
"BMP" means inside the Basic Multilingual Plane, i.e., 0x0 - 0x.

The problem is that the BMP / supplementary distinction makes sense in a
UCS-2 / UTF-16 universe. But for much interchange these days, UTF-8 is the way 
to go.

I wish that "non-ASCII characters" and "non-ASCII code points" (and non-ASCII 
scalar values) were sufficient for me. Maybe they can be. 
However, in contexts where ASCII is getting extended or supplemented (e.g., in 
the DNS or in e-mail), one needs to be really clear that the octets 0x80 - 0xFF 
are Unicode (specifically UTF-8, I suppose), and not something else.

The expressions "beyond [...] ASCII" or "beyond the ASCII range" (as in, 
characters beyond ASCII, code points beyond ASCII) have some support in the 
Unicode Standard; see, e.g., Section 2.5 "ASCII Transparency" 
paragraph. Additionally as Peter stated, an expression including "Basic Latin 
block" (e.g., characters beyond the Basic Latin block) could work.

FWIW, the term "non-ASCII" is used in e-mail address internationalization 
("EAI") in the IETF; its opposite is "all-ASCII" 
(or simply "ASCII"). (RFCs 6530, 6531, 6532). The term also appears in RFC 2047 
from November 1996 but there it has the more expansive meaning (i.e., not 
limited or targeted to Unicode).

Sean




Re: Concise term for non-ASCII Unicode characters

2015-09-21 Thread Daniel Bünzli
Le lundi, 21 septembre 2015 à 09:22, Sean Leonard a écrit :
> I think we can limit our inquiry to "characters" and "code points". Both
> of those are well-defined in Unicode (see  
> ).  

I wouldn't say so. If you actually have a look at the definition for character 
on this page. There are at least 4 different definitions for the notion of 
character and if you take the one that has formal one attached, i.e. synonym 
for abstract character (D7), then an abstract character can actually be 
represented by a *sequence* of Unicode scalar values.

If you are operating in the context of a standard or technical documentation 
please do use either code points (D9, D10) or scalar values (D76). These 
notions have precise definitions which makes up for saner discussions and 
understandings.  

> I wish that "non-ASCII characters" and "non-ASCII code points" (and  
> non-ASCII scalar values) were sufficient for me. Maybe they can be.  
> However, in contexts where ASCII is getting extended or supplemented  
> (e.g., in the DNS or in e-mail), one needs to be really clear that the  
> octets 0x80 - 0xFF are Unicode (specifically UTF-8, I suppose), and not  
> something else.

So it seems that you want terminology to talk about the *encoding* of Unicode 
scalar values, rather than scalar values themselves. Then I think you should 
specifically avoid terminology like "octets of 0x80-0xFF are Unicode" since 
this doesn't really make sense, there no Unicode property on octets. You should 
rather say something like "these octets may belong to the UTF-8 encoding scheme 
(D95) of Unicode scalar values greater than U+001F".

Best,  

Daniel





Re: Concise term for non-ASCII Unicode characters

2015-09-21 Thread Martin J. Dürst

Hello Doug,

On 2015/09/22 00:42, Doug Ewell wrote:


I was thinking that something like "non–Basic-Latin Unicode" might be


Is that non-Basic Latin or not Basic-Latin?


useful. It avoids the confusion of referring to ASCII as a range of code
points instead of a separate encoding standard.


But as a three-component term with unclear structure, it's confusing by 
itself.


Regards,   Martin.


RE: Concise term for non-ASCII Unicode characters

2015-09-21 Thread Peter Constable
Check here:

http://webstore.ansi.org/RecordDetail.aspx?sku=INCITS+4-1986%5bR2012%5d


-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Sean Leonard
Sent: Monday, September 21, 2015 1:52 PM
To: unicode@unicode.org
Subject: Re: Concise term for non-ASCII Unicode characters

Related question as I am researching this:

How can I acquire (cheaply or free) the latest and most official copy of 
US-ASCII, namely, the version that Unicode references?

The Unicode Standard 8.0 refers to the following document:

ANSI X3.4: American National Standards Institute. Coded character set—7-bit 
American national standard code for information interchange. New York: 1986. 
(ANSI X3.4-1986).

(See page 294.)

A quick Google search did not yield results. There are public/university 
library hard copies but they are hundreds of miles away from my location.

Sean




RE: Concise term for non-ASCII Unicode characters

2015-09-21 Thread Peter Constable
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Sean Leonard
Sent: Monday, September 21, 2015 1:22 AM

> Well what I am getting at is that when writing standards documents in various 
> SDOs (or any other
> computer science text, for that matter), it is helpful to identify these 
> characters/code points.

[snip]

> However, in contexts where ASCII is getting extended or supplemented (e.g., 
> in the DNS or in e-mail), 
> one needs to be really > clear that the octets 0x80 - 0xFF are Unicode 
> (specifically UTF-8, I suppose), 
> and not something else.

Well, if you are writing standards that "extend ASCII", then you need to be 
completely clear that what is being discussed is _not ASCII_. In that sense, I 
agree with Tony Jollans comments: be clear about what it is that is being 
discussed — including what coded character set, or what encoding form for what 
coded character set.


> FWIW, the term "non-ASCII" is used in e-mail address internationalization 
> ("EAI") in the IETF; its 
> opposite is "all-ASCII" (or simply "ASCII"). (RFCs 6530, 6531, 6532). The 
> term also appears in RFC 
> 2047 from November 1996 but there it has the more expansive meaning (i.e., 
> not limited or 
> targeted to Unicode).

Glancing at the Introduction for RFC 6530, it seems to have clear terminology:

" Without the extensions specified in this document, the mailbox name is 
restricted to a subset of 7-bit ASCII [RFC5321].  Though MIME [RFC2045] enables 
the transport of non-ASCII data..."

Here, "ASCII" means ASCII — the 7-bit encoding originally defined as ANSI X3.4. 
And "non-ASCII data" appears to mean data involving any characters other than 
those in the ASCII coded character set, or any data represented in any other 
encoded representation but ASCII. The term "all-ASCII" is used in section 4.2, 
but it is immediately defined: 

"In this document, an address is "all-ASCII", or just an "ASCII address", if 
every character in the address is in the ASCII character repertoire [ASCII]; an 
address is "non-ASCII", or an "i18n-address", if any character is not in the 
ASCII character repertoire."

So, it seems like they had a similar terminology need to what you describe, and 
the handled it in a satisfactory, clear way.


If what you need to describe is UTF-8 sequences of two or more bytes, then I 
would be clear that the context is Unicode UTF-8, not ASCII or any other coded 
character set / encoding form; and I would say, "Unicode UTF-8 code unit 
sequences of two to four bytes" or "Unicode UTF-8 multi-byte sequences" or 
something along those lines.

If you think it's a serious problem that there isn't one conventional term for 
"characters outside the ASCII repertoire" or "UTF-8 multi-code-unit encoded 
representations" (since different authors could devise different terminology 
solutions), then I suggest you submit a document to UTC explaining why it's a 
problem, documenting inconsistent or unclear terminology that's been used in 
some standards / public specifications, and requesting that Unicode formally 
define terminology for these concepts. I can't guarantee that UTC will do it, 
but I can predict with confidence that it _won't_ do anything of that nature if 
nobody submits such a document.



Peter



Re: Concise term for non-ASCII Unicode characters

2015-09-21 Thread Richard Wordingham
On Mon, 21 Sep 2015 20:54:23 +0100
"Tony Jollans"  wrote:

> Windows code pages and their ilk predate Unicode, and I would only
> ever expect to see them used in environments where legacy support is
> needed, and would not expect a significant amount of new
> documentation about them to be written.

So at what version did Windows ditch 'ANSI code pages' as the default
for users' 'plain text'?

> Nor, as
> far as I'm aware, do the 0x80 to 0xFF octets have any special meaning
> in Unicode that would require there to be a recognisable term to
> describe them. 

Such 8-bit *code units* are an unambiguous indicators in that one code
unit = one code point no longer applies.  The 16-bit analogue to ASCII
v. nono-ASCII in scalar values, namely the BMP v. supplementary planes,
has a fair amount of terminology.  Indeed, there is a special
terminology for the 16-bit analogue of octets with high bit set, the
surrogate 'code points'.  The analogy breaks down because of the
existence of the Latin-1 Supplement block - the number 0xC2 serves a
double rôle as U+00C2 LATIN CAPITAL LETTER A WITH CIRCUMFLEX and as a
UTF-8 lead byte.

> Code that processes arbitrary *character* sequences (for legibility
> or any other reason) should, surely, work with characters, which may
> be sequences of code points, each of which may be a sequence of
> bytes. I can think of no reason for chopping up byte sequences except
> where they are going to be recombined later, by the reverse
> treatment, and code, if required, that does so probably has no idea
> of, and need not have any idea of, meaning, and can only, surely,
> work with bytes.

In the case I have in mind, the catch is that the chopped up sequences
are being stored in an intentionally human readable intermediate file.
The reason for the file being readable is to allow debugging, and in
extreme cases, correction.  Now, the application is fairly old, and was
created when lines longer than 132 characters caused problems.
However, lines many thousands of characters long can still cause
problems, and are not amenable to line-by-line differencing.  In
principle, one might rewrite the presentation part of the package to be
aware of Unicode characters (or even grapheme clusters), and that would
cause havoc if the text chopped up contained multibyte characters and
the reading program assumed that each chunk contained no unbroken
characters. 

> The actual octets are, of course, used in combinations, but not
> singly in any way that requires them to be described in Unicode
> terms. Or am I missing something fundamental?

I believe the relevant distinction is simple that such octets are
associated with Unicode characters.  They do not occur in ASCII text.

Richard.



Re: Concise term for non-ASCII Unicode characters

2015-09-21 Thread Doug Ewell
Sean Leonard wrote:

> Additionally as Peter stated, an expression including "Basic Latin
> block" (e.g., characters beyond the Basic Latin block) could work.

I was thinking that something like "non–Basic-Latin Unicode" might be
useful. It avoids the confusion of referring to ASCII as a range of code
points instead of a separate encoding standard.

--
Doug Ewell | http://ewellic.org | Thornton, CO 




RE: Concise term for non-ASCII Unicode characters

2015-09-21 Thread Tony Jollans
Goodness, sorry, no, I didn't mean that at all!!!

What I meant was that a recognised encoding should be used consistently,
regardless of the number of bytes required, and all encodings of Unicode
code points are necessarily potentially multi-byte. Single-byte encodings
may save a little bit of space, and may be Windows-1252, or Windows-1253, or
one of many other encodings but not, in any sense, Unicode encodings.

Windows code pages and their ilk predate Unicode, and I would only ever
expect to see them used in environments where legacy support is needed, and
would not expect a significant amount of new documentation about them to be
written. When it is necessary to describe them, one should do so fully and
properly, which is whatever it is, but they really have no meaning in a
Unicode context. Nor, as far as I'm aware, do the 0x80 to 0xFF octets have
any special meaning in Unicode that would require there to be a recognisable
term to describe them. 

Code that processes arbitrary *character* sequences (for legibility or any
other reason) should, surely, work with characters, which may be sequences
of code points, each of which may be a sequence of bytes. I can think of no
reason for chopping up byte sequences except where they are going to be
recombined later, by the reverse treatment, and code, if required, that does
so probably has no idea of, and need not have any idea of, meaning, and can
only, surely, work with bytes.

The actual octets are, of course, used in combinations, but not singly in
any way that requires them to be described in Unicode terms. Or am I missing
something fundamental?

Best,
Tony

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Richard
Wordingham
Sent: 21 September 2015 19:18
To: unicode@unicode.org
Subject: Re: Concise term for non-ASCII Unicode characters

On Mon, 21 Sep 2015 12:46:48 +0100
"Tony Jollans" <t...@jollans.com> wrote:

> These days, it is pretty sloppy coding that cares how many bytes an 
> encoding of something requires, although there may be many 
> circumstances where legacy support is required.

Wow!  Are you saying that code chopping up arbitrary character sequences for
legibility (and editability!) and to avoid buffering issues should generally
assume it will be read as UTF-8, and avoid splitting well-formed UTF-8
characters?  (If the text is actually Windows-1252, there may be a lot of
apparently ill-formed UTF-8 characters/gibberish.)

> You say that, in some
> contexts, one needs to be really clear that the octets 0x80 - 0xFF are 
> Unicode. Either something "is" Unicode, or it isn't. Either something 
> uses a recognised encoding, or it doesn't. Using these octets to 
> represent Unicode code points is not ASCII, is not UTF-8, and is not 
> UCS-2/UTF-16; it could, perhaps, be EBCDIC.

But most of these octets *are* used to represent non-ASCII scalar values.
It's just that they have to operate in combinations for UTF-8. 

Richard.



Re: Concise term for non-ASCII Unicode characters

2015-09-21 Thread Richard Wordingham
On Mon, 21 Sep 2015 12:46:48 +0100
"Tony Jollans"  wrote:

> These days, it is pretty sloppy coding that cares how many bytes an
> encoding of something requires, although there may be many
> circumstances where legacy support is required.

Wow!  Are you saying that code chopping up arbitrary character sequences
for legibility (and editability!) and to avoid buffering issues should
generally assume it will be read as UTF-8, and avoid splitting
well-formed UTF-8 characters?  (If the text is actually Windows-1252,
there may be a lot of apparently ill-formed UTF-8 characters/gibberish.)

> You say that, in some
> contexts, one needs to be really clear that the octets 0x80 - 0xFF
> are Unicode. Either something "is" Unicode, or it isn't. Either
> something uses a recognised encoding, or it doesn't. Using these
> octets to represent Unicode code points is not ASCII, is not UTF-8,
> and is not UCS-2/UTF-16; it could, perhaps, be EBCDIC.

But most of these octets *are* used to represent non-ASCII scalar
values.  It's just that they have to operate in combinations for UTF-8. 

Richard.


Re: Concise term for non-ASCII Unicode characters

2015-09-21 Thread Philippe Verdy
2015-09-21 21:54 GMT+02:00 Tony Jollans :

> The actual octets are, of course, used in combinations, but not singly in
> any way that requires them to be described in Unicode terms. Or am I
> missing
> something fundamental?
>

The term you are looking for are described in the standard describing the
standard Unicode encoding forms and schemes.

If you're speaking at the octet level, the proper term is "8-bit code unit"
and then look for the definition of "code units", not "code points" and not
"scalar values" or "characters" as well.

"Character" has another definition in programming languages, but Unicode is
not bound normatively to any programming language and their actual storage
size or transport size is not part of the standard, you'll need to look
into the technical documenttion of each programming language or transport
protocol or storage device: this is out of scope of the standard itself,
each environment describing their own API, library or adapter to interface
or convert data correctly with Unicode elements and texts, sometimes with
several competing interfaces or converters: on this list we are only
focused on standard interchange formats, but the problem is solved since
long, notably with Internet standards and RFCs such as MIME which has also
its own definition of "characters", because these standards are not
exclusively bound to Unicode but also support other legacy standards.

But even in this case these definitions are only at an upper layer only and
the lower layer may use other conversions, including data compression
technics, escaping modes, or could even workl with units smaller than
octets or even smaller than binary bits, or could multiplex some bits with
some complex state representation for example in modems working with bits
spread over a matrix of non-binary states with redundancy and
autocorrection. Even the order of bits is not defined in the Unicode
standard or in the internal lower layers of an interface (these are not the
layers concerned for interchange in a large network, they are specific to
each physical or virtual link between specific pairs of
hosts, buses/cables, hubs, switches, or routers and at this level they do
not even have to know if the data is actually containing text or which
upper layer encoding forms are used or implied.

So let's get back to your focus: you're wondering if there's a term for
octets with the high bit set, in the context of texts processed with some
standard Unicode algorithms.
- We have a term for 16-bit code units used in combinations to encode a
single code point : these are "surrogates".
- For 8-bit code units, there are at least 3 encodings described : UTF-8,
CESU-8 and SCSU. Each one has its own subranges of octets values processed
differently. The best way to name these ranges is to look into the standard
documentation of these encoding schemes. And these definitions are
independant of those used in other encoding schemes/forms (including those
defined by TUS), they do not operate at the same level and these
independant levels shuold (must?) be blackboxed (their scope is stronly
defined, and transparent in all other layers of processing, and all ayers
are replaceable by another competing encoding.

Note that initially, even TUS did not define any encoding scheme below the
level of code points and their scalar values. There was then no concept of
"code units", that were stadnardized only because a few encoding schemes
(UTFs) were integrated in a stadnard annexe, then directly in TUS itself as
they became ubiquitous for handling Unicode texts, and outweighted all
other (older) legacy standards (including Internet standards which still
survive with their mandarory or optional support of legacy standards: UTF-8
proved to be the easiest encoding working with a basic level of
compatibility with these older standards).


Re: Concise term for non-ASCII Unicode characters

2015-09-21 Thread Philippe Verdy
You actually don't need any copy to work with it U+ to U+007F are
directly bound to US-ASCII. Unicode describe these characters with
character properties (and representative glyphs only for the range
U+0020..U+007E; the "C0" controls, in U+ to U+001F and U+007F, have a
pseudo-glyph in charts which may only be usable if you work with them in
"visible controls" mode.)

If you need  ANSI X3.4, it's only about the intended usage of controls, but
only a few are prevalent in plain text: TAB, LF, CR, or CR+LF, and FF (NUL
and DEL are used as fillers depending on environments or may be used as
special escapes or terminators for terminal protocols). Most controls in
US-ASCII have their name and most common functions relarted to
console/keyboard/printers protocols and not intended to be used in text
contents. But there are so many competing protocols that even the ANSI X3.4
descriptions are just informative and deprecated: you'll need to look into
each protocol. Unicode (and MIME in Internet protocols) attempt to create
an equivalence for line termination only (with LF, CR, or CR+LF; Unicode
also added NL for the C1 controls, only for compatibility as well with
EBCDIC data converters).



2015-09-21 22:51 GMT+02:00 Sean Leonard :

> Related question as I am researching this:
>
> How can I acquire (cheaply or free) the latest and most official copy of
> US-ASCII, namely, the version that Unicode references?
>
> The Unicode Standard 8.0 refers to the following document:
>
> ANSI X3.4: American National Standards Institute. Coded character
> set—7-bit American
> national standard code for information interchange. New York: 1986. (ANSI
> X3.4-1986).
>
> (See page 294.)
>
> A quick Google search did not yield results. There are public/university
> library hard copies but they are hundreds of miles away from my location.
>
> Sean
>
>


Re: Concise term for non-ASCII Unicode characters

2015-09-21 Thread Sean Leonard

Related question as I am researching this:

How can I acquire (cheaply or free) the latest and most official copy of 
US-ASCII, namely, the version that Unicode references?


The Unicode Standard 8.0 refers to the following document:

ANSI X3.4: American National Standards Institute. Coded character 
set—7-bit American
national standard code for information interchange. New York: 1986. 
(ANSI X3.4-1986).


(See page 294.)

A quick Google search did not yield results. There are public/university 
library hard copies but they are hundreds of miles away from my location.


Sean



Re: Concise term for non-ASCII Unicode characters

2015-09-21 Thread Sean Leonard

First of all, thank you all for the responses thus far.

On 9/20/2015 5:51 PM, Martin J. Dürst wrote:

Hello Sean,

On 2015/09/20 23:48, Sean Leonard wrote:

What is the most concise term for characters or code points


So we already have two different things we might need a term for. 



[...]


The terms "supplementary character" and "supplementary code point" are
defined in the Unicode standard, referring to characters or code points
above U+. I am looking for something like those, but for characters
or code points above U+007F.
Anyway, what I wanted to show is that depending on what you need it 
for, there are so many different variations that it doesn't pay off to 
create specific short terms for all of them, and the term you use 
currently may be short enough.


Well what I am getting at is that when writing standards documents in 
various SDOs (or any other computer science text, for that matter), it 
is helpful to identify these characters/code points.


I think we can limit our inquiry to "characters" and "code points". Both 
of those are well-defined in Unicode (see 
). A [Unicode] code point is any value in 
the range 0 - 0x10. A [Unicode] character is an abstract character 
that is actually assigned a [Unicode] scalar value. Therefore the space 
is Unicode code point > Unicode scalar value > Unicode character.


"supplementary" means outside the BMP, i.e., 0x1 - 0x10.
"BMP" means inside the Basic Multilingual Plane, i.e., 0x0 - 0x.

The problem is that the BMP / supplementary distinction makes sense in a 
UCS-2 / UTF-16 universe. But for much interchange these days, UTF-8 is 
the way to go.


I wish that "non-ASCII characters" and "non-ASCII code points" (and 
non-ASCII scalar values) were sufficient for me. Maybe they can be. 
However, in contexts where ASCII is getting extended or supplemented 
(e.g., in the DNS or in e-mail), one needs to be really clear that the 
octets 0x80 - 0xFF are Unicode (specifically UTF-8, I suppose), and not 
something else.


The expressions "beyond [...] ASCII" or "beyond the ASCII range" (as in, 
characters beyond ASCII, code points beyond ASCII) have some support in 
the Unicode Standard; see, e.g., Section 2.5 "ASCII Transparency" 
paragraph. Additionally as Peter stated, an expression including "Basic 
Latin block" (e.g., characters beyond the Basic Latin block) could work.


FWIW, the term "non-ASCII" is used in e-mail address 
internationalization ("EAI") in the IETF; its opposite is "all-ASCII" 
(or simply "ASCII"). (RFCs 6530, 6531, 6532). The term also appears in 
RFC 2047 from November 1996 but there it has the more expansive meaning 
(i.e., not limited or targeted to Unicode).


Sean


RE: Concise term for non-ASCII Unicode characters

2015-09-20 Thread Phillips, Addison
I agree, although I note that sometimes the additional (redundant) specificity 
of "non-7-bit-ASCII characters" is needed when talking to people unclear on 
what "ASCII" means.

Addison

> -Original Message-
> From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Peter
> Constable
> Sent: Sunday, September 20, 2015 9:52 AM
> To: Sean Leonard; unicode@unicode.org
> Subject: RE: Concise term for non-ASCII Unicode characters
> 
> You already have been using "non-ASCII Unicode", which is about as concise
> and sufficiently accurate as you'll get. There's no term specifically defined 
> in
> any standard or conventionally used for this.
> 
> 
> Peter
> 
> -Original Message-
> From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Sean
> Leonard
> Sent: Sunday, September 20, 2015 7:48 AM
> To: unicode@unicode.org
> Subject: Concise term for non-ASCII Unicode characters
> 
> What is the most concise term for characters or code points outside of the
> US-ASCII range (U+ - U+007F)? Sometimes I have referred to these as
> "extended characters" or "non-ASCII Unicode" but I do not find those terms
> precise. We are talking about the code points U+0080 - U+10. I suppose
> that this also refers to code points/scalar values that are not formally
> Unicode characters, such as U+. Basically, I am looking for a concise term
> for values that would require multiple UTF-8 octets if encoded in UTF-8
> (without referring to UTF-8 encoding specifically).
> "Non-ASCII" is not precise enough since character sets like Shift-JIS are non-
> ASCII.
> 
> Also a citation to a relevant standard (whether Unicode or otherwise) would
> be helpful.
> 
> The terms "supplementary character" and "supplementary code point" are
> defined in the Unicode standard, referring to characters or code points
> above U+. I am looking for something like those, but for characters or
> code points above U+007F.
> 
> Thank you,
> 
> Sean




Re: Concise term for non-ASCII Unicode characters

2015-09-20 Thread Martin J. Dürst

Hello Sean,

On 2015/09/20 23:48, Sean Leonard wrote:

What is the most concise term for characters or code points


So we already have two different things we might need a term for.


outside of
the US-ASCII range (U+ - U+007F)? Sometimes I have referred to these
as "extended characters"


Most of the characters outside the US-ASCII range are perfectly simple 
and basic characters. I don't think the term 'extended' fits well here. 
It gives the impression that everything except US-ASCII is somewhat 
extraordinary, which in this day and age shouldn't be the case anymore.



or "non-ASCII Unicode" but I do not find those
terms precise. We are talking about the code points U+0080 - U+10. I
suppose that this also refers to code points/scalar values that are not
formally Unicode characters, such as U+.


Again we may need different terms depending on whether these are 
included or not.



Basically, I am looking for
a concise term for values that would require multiple UTF-8 octets if
encoded in UTF-8 (without referring to UTF-8 encoding specifically).
"Non-ASCII" is not precise enough since character sets like Shift-JIS
are non-ASCII.


Well, the non-ASCII characters in Shift-JIS are also contained in 
Unicode, so depending on exactly what you want to talk about, Non-ASCII 
characters may be good enough.



Also a citation to a relevant standard (whether Unicode or otherwise)
would be helpful.

The terms "supplementary character" and "supplementary code point" are
defined in the Unicode standard, referring to characters or code points
above U+. I am looking for something like those, but for characters
or code points above U+007F.


And then in some cases, you may want to exclude the C0 area 
(U+-001F), or part of it, or some syntactically significant 
characters (e.g. punctuation) in the remaining part.


Anyway, what I wanted to show is that depending on what you need it for, 
there are so many different variations that it doesn't pay off to create 
specific short terms for all of them, and the term you use currently may 
be short enough.


Regards,   Martin.


Concise term for non-ASCII Unicode characters

2015-09-20 Thread Sean Leonard
What is the most concise term for characters or code points outside of 
the US-ASCII range (U+ - U+007F)? Sometimes I have referred to these 
as "extended characters" or "non-ASCII Unicode" but I do not find those 
terms precise. We are talking about the code points U+0080 - U+10. I 
suppose that this also refers to code points/scalar values that are not 
formally Unicode characters, such as U+. Basically, I am looking for 
a concise term for values that would require multiple UTF-8 octets if 
encoded in UTF-8 (without referring to UTF-8 encoding specifically). 
"Non-ASCII" is not precise enough since character sets like Shift-JIS 
are non-ASCII.


Also a citation to a relevant standard (whether Unicode or otherwise) 
would be helpful.


The terms "supplementary character" and "supplementary code point" are 
defined in the Unicode standard, referring to characters or code points 
above U+. I am looking for something like those, but for characters 
or code points above U+007F.


Thank you,

Sean


Re: Concise term for non-ASCII Unicode characters

2015-09-20 Thread Steve Swales
Exactly. I think the reason that non-ASCII feels non-concise is that there is 
widespread confusion between ASCII and Latin-1/ISO 8859-1 (which in turn is 
widely confused with Windows-1252).

-steve  




Sent from my iPhone


> On Sep 20, 2015, at 10:05 AM, Phillips, Addison <addi...@lab126.com> wrote:
> 
> I agree, although I note that sometimes the additional (redundant) 
> specificity of "non-7-bit-ASCII characters" is needed when talking to people 
> unclear on what "ASCII" means.
> 
> Addison
> 
>> -Original Message-
>> From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Peter
>> Constable
>> Sent: Sunday, September 20, 2015 9:52 AM
>> To: Sean Leonard; unicode@unicode.org
>> Subject: RE: Concise term for non-ASCII Unicode characters
>> 
>> You already have been using "non-ASCII Unicode", which is about as concise
>> and sufficiently accurate as you'll get. There's no term specifically 
>> defined in
>> any standard or conventionally used for this.
>> 
>> 
>> Peter
>> 
>> -Original Message-
>> From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Sean
>> Leonard
>> Sent: Sunday, September 20, 2015 7:48 AM
>> To: unicode@unicode.org
>> Subject: Concise term for non-ASCII Unicode characters
>> 
>> What is the most concise term for characters or code points outside of the
>> US-ASCII range (U+ - U+007F)? Sometimes I have referred to these as
>> "extended characters" or "non-ASCII Unicode" but I do not find those terms
>> precise. We are talking about the code points U+0080 - U+10. I suppose
>> that this also refers to code points/scalar values that are not formally
>> Unicode characters, such as U+. Basically, I am looking for a concise 
>> term
>> for values that would require multiple UTF-8 octets if encoded in UTF-8
>> (without referring to UTF-8 encoding specifically).
>> "Non-ASCII" is not precise enough since character sets like Shift-JIS are 
>> non-
>> ASCII.
>> 
>> Also a citation to a relevant standard (whether Unicode or otherwise) would
>> be helpful.
>> 
>> The terms "supplementary character" and "supplementary code point" are
>> defined in the Unicode standard, referring to characters or code points
>> above U+. I am looking for something like those, but for characters or
>> code points above U+007F.
>> 
>> Thank you,
>> 
>> Sean
> 
> 



RE: Concise term for non-ASCII Unicode characters

2015-09-20 Thread Peter Constable
You already have been using "non-ASCII Unicode", which is about as concise and 
sufficiently accurate as you'll get. There's no term specifically defined in 
any standard or conventionally used for this.


Peter

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Sean Leonard
Sent: Sunday, September 20, 2015 7:48 AM
To: unicode@unicode.org
Subject: Concise term for non-ASCII Unicode characters

What is the most concise term for characters or code points outside of the 
US-ASCII range (U+ - U+007F)? Sometimes I have referred to these as 
"extended characters" or "non-ASCII Unicode" but I do not find those terms 
precise. We are talking about the code points U+0080 - U+10. I suppose that 
this also refers to code points/scalar values that are not formally Unicode 
characters, such as U+. Basically, I am looking for a concise term for 
values that would require multiple UTF-8 octets if encoded in UTF-8 (without 
referring to UTF-8 encoding specifically). 
"Non-ASCII" is not precise enough since character sets like Shift-JIS are 
non-ASCII.

Also a citation to a relevant standard (whether Unicode or otherwise) would be 
helpful.

The terms "supplementary character" and "supplementary code point" are defined 
in the Unicode standard, referring to characters or code points above U+. I 
am looking for something like those, but for characters or code points above 
U+007F.

Thank you,

Sean



RE: Concise term for non-ASCII Unicode characters

2015-09-20 Thread Peter Constable
Well, if the point is to refer to characters that would require two or more 
code units in UTF-8, then _accurate_ expressions would be, "Unicode characters 
beyond the Basic Latin block" or "Unicode characters above U+007F".


Peter 

-Original Message-
From: Steve Swales [mailto:st...@swales.us] 
Sent: Sunday, September 20, 2015 11:00 AM
To: Phillips, Addison <addi...@lab126.com>
Cc: Peter Constable <peter...@microsoft.com>; Sean Leonard 
<lists+unic...@seantek.com>; unicode@unicode.org
Subject: Re: Concise term for non-ASCII Unicode characters

Exactly. I think the reason that non-ASCII feels non-concise is that there is 
widespread confusion between ASCII and Latin-1/ISO 8859-1 (which in turn is 
widely confused with Windows-1252).

-steve  




Sent from my iPhone


> On Sep 20, 2015, at 10:05 AM, Phillips, Addison <addi...@lab126.com> wrote:
> 
> I agree, although I note that sometimes the additional (redundant) 
> specificity of "non-7-bit-ASCII characters" is needed when talking to people 
> unclear on what "ASCII" means.
> 
> Addison
> 
>> -Original Message-
>> From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Peter 
>> Constable
>> Sent: Sunday, September 20, 2015 9:52 AM
>> To: Sean Leonard; unicode@unicode.org
>> Subject: RE: Concise term for non-ASCII Unicode characters
>> 
>> You already have been using "non-ASCII Unicode", which is about as 
>> concise and sufficiently accurate as you'll get. There's no term 
>> specifically defined in any standard or conventionally used for this.
>> 
>> 
>> Peter
>> 
>> -Original Message-----
>> From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Sean 
>> Leonard
>> Sent: Sunday, September 20, 2015 7:48 AM
>> To: unicode@unicode.org
>> Subject: Concise term for non-ASCII Unicode characters
>> 
>> What is the most concise term for characters or code points outside 
>> of the US-ASCII range (U+ - U+007F)? Sometimes I have referred to 
>> these as "extended characters" or "non-ASCII Unicode" but I do not 
>> find those terms precise. We are talking about the code points U+0080 
>> - U+10. I suppose that this also refers to code points/scalar 
>> values that are not formally Unicode characters, such as U+. 
>> Basically, I am looking for a concise term for values that would 
>> require multiple UTF-8 octets if encoded in UTF-8 (without referring to 
>> UTF-8 encoding specifically).
>> "Non-ASCII" is not precise enough since character sets like Shift-JIS 
>> are non- ASCII.
>> 
>> Also a citation to a relevant standard (whether Unicode or otherwise) 
>> would be helpful.
>> 
>> The terms "supplementary character" and "supplementary code point" 
>> are defined in the Unicode standard, referring to characters or code 
>> points above U+. I am looking for something like those, but for 
>> characters or code points above U+007F.
>> 
>> Thank you,
>> 
>> Sean
> 
> 



Re: Concise term for non-ASCII Unicode characters

2015-09-20 Thread Daniel Bünzli
Le dimanche, 20 septembre 2015 à 18:59, Steve Swales a écrit :
> Exactly. I think the reason that non-ASCII feels non-concise is that there is 
> widespread confusion between ASCII and Latin-1/ISO 8859-1 (which in turn is 
> widely confused with Windows-1252).

For this reason I usually use the term US-ASCII, which is the IANA name for the 
7-bit-ASCII characters [1].

Someone referring to the non-US-ASCII scalar values of unicode would make 
precise sense to me. But then maybe Peter's very last suggestion is actually 
the most precise you can get to.

Also if you are talking about UTF-8 I would use the term scalar values rather 
than "characters" or "code points" since surrogates can't be encoded in UTF-8.

Best,

Daniel

[1] http://www.iana.org/assignments/character-sets