Re: Corrigendum #9

2014-06-03 Thread David Starner
On Mon, Jun 2, 2014 at 4:33 PM, Richard Wordingham
richard.wording...@ntlworld.com wrote:
 Much as I don't like their uninvited use, it is possible to pass them
 and other undesirables through most applications by a slight bit of
 recoding at the application's boundaries.  Using 99 = (3 + 32 + 64) PUA
 characters, one can ape UTF-16 surrogates and encode:

What's the point? If we can use the PUA, then we don't need the
noncharacters; we can just use the PUA directly. If we have to play
around with remapping them, they're pointless; they're no easier to
use in that case then ESC or '\' or PUA characters.

-- 
Kie ekzistas vivo, ekzistas espero.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-03 Thread Mark Davis ☕️
On Mon, Jun 2, 2014 at 10:32 PM, David Starner prosfil...@gmail.com wrote:

 Why? It seems you're changing the rules
 ​...


This isn't are changing, it is has changed. The Corrigendum was issued
at the start of 2013, about 16 months ago; applicable to all relevant
earlier versions. It was the result of fairly extensive debate inside the
UTC; there hasn't been a single issue on this thread that wasn't considered
during the discussions there. And as far back as 2001, the UTC made it
clear that noncharacters *are* scalar values, and are to be converted by
UTF converters. Eg, see
http://www.unicode.org/mail-arch/unicode-ml/y2001-m09/0149.html (by chance,
one day before 9/11).

 probably trigger serious bugs in some lamebrained utility.

There were already plenty of programs that passed the noncharacters
through; very few would filter them (some would delete them, which is
horrible for security). Thinking that a utility would never encounter them
in input text was a pipe-dream. If a utility or library is so fragile that
it *breaks* on input of any valid UTF sequence, then it *is* a lamebrained
utility. A good unit test for any production chain would be to check there
is no crash on any input scalar value (and for that matter, any ill-formed
UTF text).
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-03 Thread Martin J. Dürst

On 2014/06/03 07:08, Asmus Freytag wrote:

On 6/2/2014 2:53 PM, Markus Scherer wrote:

On Mon, Jun 2, 2014 at 1:32 PM, David Starner prosfil...@gmail.com
mailto:prosfil...@gmail.com wrote:

I would especially discourage any web browser from handling
these; they're noncharacters used for unknown purposes that are
undisplayable and if used carelessly for their stated purpose, can
probably trigger serious bugs in some lamebrained utility.


I don't expect handling these in web browsers and lamebrained
utilities. I expect treat like unassigned code points.


Expecting them to be treated like unassigned code points shows that 
their use is a bad idea: Since when does the Unicode Consortium use 
unassigned code points (and the like) in plain sight?



I can't shake the suspicion that Corrigendum #9 is not actually solving
a general problem, but is a special favor to CLDR as being run by
insiders, and in the process muddying the waters for everyone else.


I have to fully agree with Asmus, Richard, Shawn and others that the use 
of non-characters in CLDR is a very bad and dangerous example.


However convenient the misuse of some of these codepoints in CLDR may 
be, it sets a very bad example for everybody else. Unicode itself should 
not just be twice as careful with the use of its own codepoints, but 10 
times as careful.


I'd strongly suggest that completely independent of when and how 
Corrigendum #9 gets tweaked or fixed, a quick and firm plan gets worked 
out for how to get rid of these codepoints in CLDR data. The sooner, the 
better.


Regards,   Martin.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-03 Thread Richard Wordingham
On Mon, 2 Jun 2014 23:21:38 -0700
David Starner prosfil...@gmail.com wrote:

 On Mon, Jun 2, 2014 at 4:33 PM, Richard Wordingham
 richard.wording...@ntlworld.com wrote:
  Using 99 = (3 +
  32 + 64) PUA characters, one can ape UTF-16 surrogates and encode:

 What's the point? If we can use the PUA, then we don't need the
 noncharacters; we can just use the PUA directly. If we have to play
 around with remapping them, they're pointless; they're no easier to
 use in that case then ESC or '\' or PUA characters.

A search for two 2-character string '\n' would also find a substring
of 4-character string 'a\\n'.  The PUA is in general not available for
general utilities to make special use of.

Richard. 

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-03 Thread David Starner
On Mon, Jun 2, 2014 at 11:55 PM, Mark Davis ☕️ m...@macchiato.com wrote:
 Thinking that a utility would never encounter them in input text
 was a pipe-dream.

Thinking that a utility would never mangle them if encountered in
input text was a pipe-dream.

 If a utility or library is so fragile that it breaks on
 input of any valid UTF sequence, then it is a lamebrained utility.

And?  The world is filled with lamebrained utilities, and being
cautious about what you take in can prevent one of those lamebrained
utilities from turning into an exploit.

 A good
 unit test for any production chain would be to check there is no crash on
 any input scalar value (and for that matter, any ill-formed UTF text).

Right; and if you filter out stuff at the frontend, like ill-formed
UTF text and noncharacters, you don't have to worry about what the
middle end will do with them.

I don't get what the goal of these changes were. It seems you've taken
these characters away from programmers to use them in programs and
given them to CLDR and anyone else willing to make their plain text
files skirt the limits.

-- 
Kie ekzistas vivo, ekzistas espero.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-03 Thread David Starner
On Tue, Jun 3, 2014 at 12:31 AM, Richard Wordingham
richard.wording...@ntlworld.com wrote:
 On Mon, 2 Jun 2014 23:21:38 -0700
 David Starner prosfil...@gmail.com wrote:

 On Mon, Jun 2, 2014 at 4:33 PM, Richard Wordingham
 richard.wording...@ntlworld.com wrote:
  Using 99 = (3 +
  32 + 64) PUA characters, one can ape UTF-16 surrogates and encode:

 The PUA is in general not available for
 general utilities to make special use of.

No, the PUA is not. Then where are you getting the 99 PUA characters
you suggested using?

-- 
Kie ekzistas vivo, ekzistas espero.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-03 Thread Richard Wordingham
On Tue, 3 Jun 2014 08:55:09 +0200
Mark Davis ☕️ m...@macchiato.com wrote:

 On Mon, Jun 2, 2014 at 10:32 PM, David Starner prosfil...@gmail.com
 wrote:
 
  Why? It seems you're changing the rules
  ​...
 
 
 This isn't are changing, it is has changed. The Corrigendum was
 issued at the start of 2013, about 16 months ago; applicable to all
 relevant earlier versions. It was the result of fairly extensive
 debate inside the UTC; there hasn't been a single issue on this
 thread that wasn't considered during the discussions there. And as
 far back as 2001, the UTC made it clear that noncharacters *are*
 scalar values, and are to be converted by UTF converters. Eg, see
 http://www.unicode.org/mail-arch/unicode-ml/y2001-m09/0149.html (by
 chance, one day before 9/11).

But that says U+FDD0 is not to be externally interchanged!

Richard.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-03 Thread Mark Davis ☕️
On Tue, Jun 3, 2014 at 9:41 AM, David Starner prosfil...@gmail.com wrote:

 Thinking that a utility would never mangle them if encountered in
 input text was a pipe-dream.


I didn't say not mangle, I said break, as in crash.

​I don't think this thread is going anywhere productive, so​ I'm signing
off from it.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-03 Thread Philippe Verdy
I think his point is that an application may want to encapsulate in a valid
text any orbitrary stream of code points (including non characters, PUAs,
or isolated surrogate code units found in 16-bit or 32-bit streams that are
invalid UTF-16 or UTF-32 streams, or even invalid arbitrary 8-but bytes in
streams that are not valid UTF-8).

For 8-bit streams, using ESC or \ s generally a good choice of escape to
derive a valid UTF-8 text stream. But for 16-bit and 32-bit stream, PUAs
are more economical (but PUA code units found in the stream still need to
be escaped.

If you think about the Java regexp \\uD800, it does not designates a code
point but only a code unit which is not valid plain text alone as it
violates UTF-16 encoding rules. Trying to match it in a valid UTF-16 stream
can work only if you can reprecent isolated code units for a specific
encoding like UTF-16, even if the targer stream to look for this match uses
any other valid UTF (not necessarily UTF-16: decode the target text,
reencode it to UTF-16 to generate a 16-bit stream in which you'll look for
isolated 16-but code units with the regexp)

So yes the regexp \\u (in Java source) is not used to match a single
valid character


2014-06-03 8:21 GMT+02:00 David Starner prosfil...@gmail.com:

 On Mon, Jun 2, 2014 at 4:33 PM, Richard Wordingham
 richard.wording...@ntlworld.com wrote:
  Much as I don't like their uninvited use, it is possible to pass them
  and other undesirables through most applications by a slight bit of
  recoding at the application's boundaries.  Using 99 = (3 + 32 + 64) PUA
  characters, one can ape UTF-16 surrogates and encode:

 What's the point? If we can use the PUA, then we don't need the
 noncharacters; we can just use the PUA directly. If we have to play
 around with remapping them, they're pointless; they're no easier to
 use in that case then ESC or '\' or PUA characters.

 --
 Kie ekzistas vivo, ekzistas espero.
 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Use of Unicode Symbol 26A0

2014-06-03 Thread Papendick, Michelle
Good Day -

Just wondering if Unicode provides for or anyone know of documentation for 
standard usage around the following symbol:

[cid:image001.png@01CF7C48.A6D54D00]

Noticed that is it used in many applications as a general warning or error 
symbol, but upon research it is also the symbol for personal injury so appears 
to be a conflict of meaning.

Any information around standard usage of the symbol in software applications is 
appreciated.

Thank you!
Michelle






___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-03 Thread Asmus Freytag

On 6/2/2014 3:08 PM, Asmus Freytag wrote:

On 6/2/2014 2:53 PM, Markus Scherer wrote:
On Mon, Jun 2, 2014 at 1:32 PM, David Starner prosfil...@gmail.com 
mailto:prosfil...@gmail.com wrote:


I would especially discourage any web browser from handling
these; they're noncharacters used for unknown purposes that are
undisplayable and if used carelessly for their stated purpose, can
probably trigger serious bugs in some lamebrained utility.


I don't expect handling these in web browsers and lamebrained 
utilities. I expect treat like unassigned code points.




I can't shake the suspicion that Corrigendum #9 is not actually 
solving a general problem, but is a special favor to CLDR as being run 
by insiders, and in the process muddying the waters for everyone else.


Clarifying:

I still haven't heard from anyone that this solves a general problem 
that is widespread. The only actual example has always been CLDR, and 
its decision to ship these code points in XML. Shipping these code 
points in files was pretty far down the list of what not to do when 
they were originally adopted. My view continues to be that this is was a 
questionable design decision by CLDR, given what was on the record. The 
reaction of several outside implementers during this discussion makes 
clear that viewing that design as problematic is not just my personal view.


Usually, if there's a discrepancy between an implementation and Unicode, 
the reaction is not to retract conformance language. I think arriving at 
this decision was easier for the UTC, because CLDR is not a random, 
unrelated implementation. And, as in any group, it's perhaps easier to 
not be as keenly aware of the impact on external implementations.


So, I'd like to clarify, that this is the sense in which I meant 
special favor, and which therefore is not the most felicitous 
expression to describe what I had in mind.


A./



A./


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Use of Unicode Symbol 26A0

2014-06-03 Thread Asmus Freytag

Michelle,

Unicode normally does not document all known usages of symbols. 
Occasionally, if a symbol is used in ways that might be unexpected from 
its name, the standard may add an alias or annotation. This is done in 
particular, when there is a question of whether a given symbol is the 
correct choice for a given application - especially if Unicode contains 
multiple, similar symbols.


In this case, that does not seem the case. The symbol is used for a 
variety of purposes, from warning to error to alerting readers to 
important information. These all seem to fit in the same general usage 
as suggested by the name, and the symbol is distinct enough so that that 
there is no other symbol in Unicode that might suggest itself as an 
alternate.


The use to warn about risk of personal injury would not seem to demand 
additional clarification.


A./

On 6/3/2014 7:25 AM, Papendick, Michelle wrote:


Good Day –

Just wondering if Unicode provides for or anyone know of documentation 
for standard usage around the following symbol:


cid:image001.png@01CF7C48.A6D54D00

Noticed that is it used in many applications as a general warning or 
error symbol, but upon research it is also the symbol for personal 
injury so appears to be a conflict of meaning.


Any information around standard usage of the symbol in software 
applications is appreciated.


Thank you!
Michelle



__



___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-03 Thread Asmus Freytag

Nicely put.

A./

On 6/3/2014 12:09 AM, Martin J. Dürst wrote:

On 2014/06/03 07:08, Asmus Freytag wrote:

On 6/2/2014 2:53 PM, Markus Scherer wrote:

On Mon, Jun 2, 2014 at 1:32 PM, David Starner prosfil...@gmail.com
mailto:prosfil...@gmail.com wrote:

I would especially discourage any web browser from handling
these; they're noncharacters used for unknown purposes that are
undisplayable and if used carelessly for their stated purpose, can
probably trigger serious bugs in some lamebrained utility.


I don't expect handling these in web browsers and lamebrained
utilities. I expect treat like unassigned code points.


Expecting them to be treated like unassigned code points shows that 
their use is a bad idea: Since when does the Unicode Consortium use 
unassigned code points (and the like) in plain sight?



I can't shake the suspicion that Corrigendum #9 is not actually solving
a general problem, ...


I have to fully agree with Asmus, Richard, Shawn and others that the 
use of non-characters in CLDR is a very bad and dangerous example.


However convenient the misuse of some of these codepoints in CLDR may 
be, it sets a very bad example for everybody else. Unicode itself 
should not just be twice as careful with the use of its own 
codepoints, but 10 times as careful.


I'd strongly suggest that completely independent of when and how 
Corrigendum #9 gets tweaked or fixed, a quick and firm plan gets 
worked out for how to get rid of these codepoints in CLDR data. The 
sooner, the better.


Regards,   Martin.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode



___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Use of Unicode Symbol 26A0

2014-06-03 Thread Jukka K. Korpela

2014-06-03 19:13, Asmus Freytag wrote:


Unicode normally does not document all known usages of symbols.


Not to mention unknown usages. Characters will be used in different 
ways, no matter what the Unicode Standard says, and it would be mostly 
pointless to put restrictions on it. In some cases, however, some types 
of usage are warned against, or better approaches are suggested˔.



The symbol is used for a
variety of purposes, from warning to error to alerting readers to
important information. These all seem to fit in the same general usage
as suggested by the name, and the symbol is distinct enough so that that
there is no other symbol in Unicode that might suggest itself as an
alternate.


Right, but if we consider the use of WARNING SIGN as a text character, 
or contexts where an image resembling WARNING SIGN is used and WARNING 
SIGN could well be used (with the usual caveats), then it seems to 
generally indicate a warning message as opposite to an error message, on 
one hand, and a purely informative note, on the other.


The use of graphic symbols similar to WARNING SIGN e.g. in traffic signs 
is really a different issue and external to Unicode, as it is not about 
characters, though it might be tangentially related.



The use to warn about risk of personal injury would not seem to demand
additional clarification.


On the practical side, it might be in order to warn against usage that 
relies on some particular interpretation like that. What I mean is that 
it is OK to use WARNING SIGN as warning about risk of personal injury, 
but questionable to expect that people will generally take it that way 
(and not more loosely as warning of some kind).


Yucca



___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


UTF-16 Encoding Scheme and U+FFFE

2014-06-03 Thread Richard Wordingham
How do I read definition D98 in TUS Version 6.3.0 Chapter 3 to prohibit
a file in the UTF-16 encoding scheme from starting with U+FFFE?  Or is
U+FFFE actually allowed to start such a file?

Is an implementation that deduces the encoding scheme of a plain
text file from a leading BOM to be characterised as reckless?

Richard.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-03 Thread Richard Wordingham
On Tue, 03 Jun 2014 16:09:27 +0900
Martin J. Dürst due...@it.aoyama.ac.jp wrote:

 I'd strongly suggest that completely independent of when and how 
 Corrigendum #9 gets tweaked or fixed, a quick and firm plan gets
 worked out for how to get rid of these codepoints in CLDR data. The
 sooner, the better.

I suspect this has already been done.  I know of no CLDR text files
still containing them.

Richard.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: UTF-16 Encoding Scheme and U+FFFE

2014-06-03 Thread Peter Constable
There's never been anything preventing a file from containing and beginning 
with U+FFFE. It's just not a very useful thing to do, hence not very likely.


Peter

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Richard 
Wordingham
Sent: June 3, 2014 11:53 AM
To: unicode@unicode.org
Subject: UTF-16 Encoding Scheme and U+FFFE

How do I read definition D98 in TUS Version 6.3.0 Chapter 3 to prohibit a file 
in the UTF-16 encoding scheme from starting with U+FFFE?  Or is
U+FFFE actually allowed to start such a file?

Is an implementation that deduces the encoding scheme of a plain text file from 
a leading BOM to be characterised as reckless?

Richard.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Unicode Regular Expressions, Surrogate Points and UTF-8

2014-06-03 Thread Xueming Shen

On 06/02/2014 01:01 PM, Richard Wordingham wrote:

On Mon, 2 Jun 2014 11:29:09 +0200
Mark Davis ☕️m...@macchiato.com  wrote:


\uD808\uDF45 specifies a sequence of two codepoints.

​That is simply incorrect.​

The above is in the sample notation of UTS #18 Version 17 Section 1.1.

 From what I can make out, the corresponding Java notation would be
\x{D808}\x{DF45}.  I don't *know* what \x{D808} and \x{DF45} match in
Java, or whether they are even acceptable.  The only thing UTS #18
RL1.7 permits them to match in Java is lone surrogates, but I don't
know if Java complies.


The notation for \uD808\uDF45 is interpreted as a supplementary codepoint and
is represent internally as a pair of surrogates in String.

  Pattern.compile(\\x{D808}\\x{DF45}).matcher(\ud808\udf45).find());  - 
false
  Pattern.compile(\uD808\uDF45).matcher(\ud808\udf45).find());- 
true
  Pattern.compile(\\x{D808}).matcher(\ud808\udf45).find());   - 
false
  Pattern.compile(\\x{D808}).matcher(\ud808_\udf45).find());  - 
true

-Sherman


All UTS #18 says for sure about regular expressions matching code units
is that they don't satisfy RL1.1, though Section 1.7 appears to ban
them when it says, A fundamental requirement is that Unicode text be
interpreted semantically by code point, not code units.  Perhaps it's
a fundamental requirement of something other than UTS #18.  I thought
matching parts of characters in terms of their canonical equivalences
was awkward enough, without having the additional option of matching
some of the code units!



___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Unicode Regular Expressions, Surrogate Points and UTF-8

2014-06-03 Thread Richard Wordingham
On Tue, 03 Jun 2014 15:06:30 -0700
Xueming Shen xueming.s...@oracle.com wrote:

 On 06/02/2014 01:01 PM, Richard Wordingham wrote:
  On Mon, 2 Jun 2014 11:29:09 +0200
  Mark Davis ☕️m...@macchiato.com  wrote:
 
  \uD808\uDF45 specifies a sequence of two codepoints.
  ​That is simply incorrect.​
  The above is in the sample notation of UTS #18 Version 17 Section
  1.1.
 
   From what I can make out, the corresponding Java notation would be
  \x{D808}\x{DF45}.  I don't *know* what \x{D808} and \x{DF45} match
  in Java, or whether they are even acceptable.  The only thing UTS
  #18 RL1.7 permits them to match in Java is lone surrogates, but I
  don't know if Java complies.
 
 The notation for \uD808\uDF45 is interpreted as a supplementary
 codepoint and is represent internally as a pair of surrogates in
 String.
 
Pattern.compile(\\x{D808}\\x{DF45}).matcher(\ud808\udf45).find());
 - false
 Pattern.compile(\uD808\uDF45).matcher(\ud808\udf45).find());
 - true
 Pattern.compile(\\x{D808}).matcher(\ud808\udf45).find());
 - false
 Pattern.compile(\\x{D808}).matcher(\ud808_\udf45).find());
 - true

Thank you for providing examples confirming that what in the UTS #18
*sample* notation would be written \uD808\uDF45, i.e. \x{D808}\x{DF45}
in Java notation, matches nothing in any 16-bit Unicode string.

Richard.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: UTF-16 Encoding Scheme and U+FFFE

2014-06-03 Thread Richard Wordingham
On Tue, 3 Jun 2014 21:28:05 +
Peter Constable peter...@microsoft.com wrote:

 There's never been anything preventing a file from containing and
 beginning with U+FFFE. It's just not a very useful thing to do, hence
 not very likely.

Well, while U+FFFE was apparently prohibited from public interchange,
one could be very confident of not finding it in an external file.  As
an internally generated file, it would then be much more likely to be
in the UTF-16BE or UTF-16LE encoding scheme.

Richard.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: UTF-16 Encoding Scheme and U+FFFE

2014-06-03 Thread Whistler, Ken
You cannot even be very confident of not finding actual ill-formed
UTF-16, like unpaired surrogates, in an external file, let alone
noncharacters.

As for the noncharacters, take a look at the collation test files
that we distribute with each version of UCA. The test data includes
test strings like the following, to verify that UCA implementations
do the correct thing when faced with unusual edge cases:

FFFE 0021
FFFE 003F
FFFE 0061
FFFE 0041
FFFE 0062
1FFFE 0021
1FFFE 003F
1FFFE 0334
...

As well as test strings starting with unpaired surrogates:

D800 0021
D800 003F
D800 0061
D800 0041
D800 0062

And while it is true that the *file* CollationTest_SHIFTED.txt doesn't
start with either a noncharacter or an unpaired surrogate -- because
all of the test data in it is represented in ASCII hex strings instead of
directly in UTF-16 -- the issue in any case isn't whether a *file* starts
with a noncharacter, but whether a UTF-16 *string* starts with a
noncharacter. Any one of those test strings could be trivially turned
into a text file by piping out that one UTF-16 string to a file. And I
could then write conformant test software that would read UTF-16
string input data from that file and run it through the UCA algorithm
to construct sortkeys for it.

As Peter said, the main thing that prevents running into these is
that it isn't very *useful* to start off files (or strings) with U+FFFE. (And,
additionally, in the case of UTF-16 text data files, it would be
confusing and possibly lead to misinterpretation of byte order,
if you were somehow depending solely on initial BOMs -- which
I wouldn't advise, anyway.)

Basically, the rules of standards (e.g., you shouldn't try to
publicly interchange noncharacters) are not like laws of
physics. Just because the standard says you shouldn't do
it doesn't mean it doesn't happen.

--Ken


 On Tue, 3 Jun 2014 21:28:05 +
 Peter Constable peter...@microsoft.com wrote:
 
  There's never been anything preventing a file from containing and
  beginning with U+FFFE. It's just not a very useful thing to do, hence
  not very likely.
 
 Well, while U+FFFE was apparently prohibited from public interchange,
 one could be very confident of not finding it in an external file.  As
 an internally generated file, it would then be much more likely to be
 in the UTF-16BE or UTF-16LE encoding scheme.
 
 Richard.


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode