Re: Is the binaryness/textness of a data format a property?

2020-03-22 Thread Markus Scherer via Unicode
On Sat, Mar 21, 2020 at 12:35 PM Doug Ewell via Unicode 
wrote:

> I thought the whole premise of GB18030 was that it was Unicode mapped into
> a GB2312 framework. What characters exist in GB18030 that don't exist in
> Unicode, and have they been proposed for Unicode yet, and why was none of
> the PUA space considered appropriate for that in the meantime?
>

My memory of GB18030 is that its code space has 1.6M code points, of which
1.1M are a permutation of Unicode. For the rest you would have to go beyond
the Unicode code space for 1:1 round-trip mappings.

Just please don't call it UTF-8.

markus


Re: Egyptian Hieroglyph Man with a Laptop

2020-02-12 Thread Markus Scherer via Unicode
On Wed, Feb 12, 2020 at 11:37 AM Marius Spix via Unicode <
unicode@unicode.org> wrote:

> In my opinion, this is an invalid character, which should not be
> included in Unicode.
>

Please remember that feedback that you want the committee to look at needs
to go through http://www.unicode.org/reporting.html

Best regards,
markus


Fwd: ICU 66preview available

2019-12-05 Thread Markus Scherer via Unicode
Dear Unicoders,

If you use ICU, then testing with ICU 66*preview*
 is a good way of trying out Unicode
13 *beta* .

(Just please don't use these snapshots in production releases.)

Best regards,
markus

-- Forwarded message -
Dear friends and users of ICU,

We are pleased to announce a preview of ICU 66. ICU 66 (scheduled for
release in 2020 March) will update to Unicode 13, and include some bug
fixes. This will be a low-impact release with no other significant feature
additions or implementation changes.

ICU 66preview updates to Unicode 13 beta
, including
new characters, scripts, emoji, and corresponding API constants. It also
updates to CLDR 36.1preview
 with Unicode 13 updates
and bug fixes. For details please see site.icu-project.org/download/66.

Please test this preview on your platforms and report bugs and regressions
by Tuesday, 2020-jan-07.

Please do not use this preview in production.

The preliminary API reference documents are published on
unicode-org.github.io/icu-docs/ – follow the “Dev” links there.

Best regards,

Markus Scherer for the ICU Project


Re: Proposal to add Roman transliteration schemes to ISO 15924.

2019-12-02 Thread Markus Scherer via Unicode
On Mon, Dec 2, 2019 at 5:47 PM विश्वासो वासुकिजः (Vishvas Vasuki) via
Unicode  wrote:

> But that says that the definitions are at
>>
>
>> https://github.com/unicode-org/cldr/releases/tag/latest/common/bcp47/transform.xml
>> ,
>> but all one currently gets from that is an error message 'XML Parsing
>> Error: no element found'.
>>
>
> Yes - that needs to be fixed (+markda...@google.com - could you please? )
>
> https://github.com/unicode-org/cldr/blob/master/common/bcp47/transform.xml
> shows iast!
>

FYI A working link to the version in the latest release is
https://github.com/unicode-org/cldr/blob/latest/common/bcp47/transform.xml

The subtag I would use for IAST seems to be:
> sa-Latn-t-sa-m0-iast (https://r12a.github.io/app-subtags/ is unable to
> confirm that the extension
> 
> t-sa-m0-iast  is all right though.. Could someone confirm?)
>

I assume that the second "sa" is unnecessary, but I am not very familiar
with the -t- extension.

Then, the next step seems to be to propose to add the below to
> https://github.com/unicode-org/cldr/blob/master/common/bcp47/transform.xml
> :
> ISO 15919, Kyoto-Harvard, ITRANS, Velthuis, SLP1, WX, National Library at
> Kolkata romanisation
> How to proceed with that?
>

I would start with filing a CLDR ticket:
http://cldr.unicode.org/index/bug-reports

Best regards,
markus


Re: Proposal to add Roman transliteration schemes to ISO 15924.

2019-12-02 Thread Markus Scherer via Unicode
On Mon, Dec 2, 2019 at 8:42 AM Roozbeh Pournader via Unicode <
unicode@unicode.org> wrote:

> You don't need an ISO 15924 script code. You need to think in terms of BCP
> 47. Sanskrit in Latin would be sa-Latn.
>

Right!

Now, if you want to distinguish the different transcription systems for
> writing Sanskrit in Latin, you can apply to registry a BCP 47 variant.
> There are also BCP 47 extension T, which may also be useful to you:
>
> https://tools.ietf.org/html/rfc6497
>

And that extension is administered by Unicode, with documentation and data
here:
http://www.unicode.org/reports/tr35/tr35.html#t_Extension

Best regards,
markus


Re: Encoding the Nsibidi script (African) for writing the Igbo language

2019-11-11 Thread Markus Scherer via Unicode
On Mon, Nov 11, 2019 at 4:03 AM Philippe Verdy via Unicode <
unicode@unicode.org> wrote:

> But first there's still no code in ISO 15924 (first step easy to complete
> before encoding in the UCS).
>

That's not first; it's nearly last.

The script code standard says "In general, script codes shall be added to
ISO 15924 when the script has been coded in ISO/IEC 10646, and when the
script is agreed, by experts in ISO 15924/RA-JAC to be unique and a *candidate
for encoding in the UCS*."

We generally assign the script code when the script is in the pipeline for
a near-future version of Unicode, which demonstrates that it's "a candidate
for encoding". We also want the name of the script to be settled, so that
the script code can be roughly mnemonic for the name.

markus


Re: Pure Regular Expression Engines and Literal Clusters

2019-10-11 Thread Markus Scherer via Unicode
On Fri, Oct 11, 2019 at 12:05 PM Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> On Thu, 10 Oct 2019 15:23:00 -0700
> Markus Scherer via Unicode  wrote:
>
> > [c \q{ch}]h should work like (ch|c)h. Note that the order matters in
> > the alternation -- so this works equivalently if longer strings are
> > sorted first.
>
> Thanks for answering the question.
>
> Does conformance UTS#18 to level 2 mandate the choice of matching
> substring? This would appear to prohibit compliance to POSIX rules,
> where the length of overall match counts.
>

We just had a discussion this week. Mark will revise the proposed update.

The idea is currently to specify properties-of-strings (and I think a
range/class with "clusters") behaving like an alternation where the longest
strings are first, and leaving it up to the regex engine exactly what that
means.

In general, UTS #18 offers a lot of things that regex implementers may or
may not adopt.

If you have specific ideas, please send them as PRI feedback.
(Discussion on the list is good and useful, but does not guarantee that it
gets looked at when it counts.)

Best regards,
markus


Re: Will TAGALOG LETTER RA, currently in the pipeline, be in the next version of Unicode?

2019-10-11 Thread Markus Scherer via Unicode
On Fri, Oct 11, 2019 at 4:37 AM Fred Brennan via Unicode <
unicode@unicode.org> wrote:

> Many users are asking me and I'm not sure of the answer (nor how to find
> it
> out).
>

You can find out by looking at the data files that are being developed for
Unicode 13.
Look at the latest UnicodeData.txt in
https://www.unicode.org/Public/13.0.0/ucd/

I don't see a TAGALOG LETTER RA there.

DerivedAge.txt there shows Tagalog characters only from Unicode 3.2.

The next place to check would be the pipeline page:
https://www.unicode.org/alloc/Pipeline.html

It shows TAGALOG LETTER RA in the section "Characters Accepted or In Ballot
for Future Versions".
UTC accepted it just in July of this year, but it's not yet in ISO ballot.

If all goes well, it could go into Unicode 14, March 2021.

Best regards,
markus


Re: Pure Regular Expression Engines and Literal Clusters

2019-10-10 Thread Markus Scherer via Unicode
On Tue, Oct 8, 2019 at 7:28 AM Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> An example UTS#18 gives for matching a literal cluster can be simplified
> to, in its notation:
>
> [c \q{ch}]
>
> This is interpreted as 'match against "ch" if possible, otherwise
> against "c".  Thus the strings "ca" and "cha" would both match the
> expression
>
> [c \q{ch}]a
>
> while "chh" but not "ch" would match against
>
> [c \q{ch}]h
>

Right. We just independently discussed this today in the UTC meeting,
connected with the "properties of strings" discussion in the proposed
update.

[c \q{ch}]h should work like (ch|c)h. Note that the order matters in the
alternation -- so this works equivalently if longer strings are sorted
first.

May I correctly argue instead that matching against literal clusters
> would be satisfied by instead supporting, for this example, the regular
> subexpression "(c|ch)" or the UnicodeSet expression "[c{ch}]"?
>

ICU UnicodeSet [c{ch}] is equivalent to UTS #18 [c\q{ch}].

ICU's UnicodeSet syntax is simpler, the UTS #18 syntax is more
backward-compatible.

Best regards,
markus


Re: Manipuri/Meitei customary writing system

2019-10-04 Thread Markus Scherer via Unicode
On Fri, Oct 4, 2019 at 2:05 PM Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> > >> Is the use of the Meitei script aspirational or customary?
> > >> Which script is being used for major newspapers, popular books,
> > >> and video captions?
> > >
> > > This may give you some more information:
> > > https://www.atypi.org/conferences/tokyo-2019/programme/activity?a=906
>
> >
> > Sorry, this should have been two separate URIs (about the same talk).
> >
> > > https://www.youtube.com/watch?v=S8XxVZkfUkk
> > >
> > > It's a recent talk at ATypI in Tokyo (sponsored by Google, among
> > > others).
>
> So newspaper sales tell us that the Bengali script is still the *usual*
> script for the language.


Yes. FYI in the video, the relevant part is at 14:04-14:34.
My transcription:

"Due to the lack of readership of Meetei Mayek, local newspapers continue
to use Bengali script. On 21st September 2008, Hueiyen Lanpao, a newspaper
company, published the first Meetei Mayek newspaper set entirely using
Meetei Mayek script. Although there have been small columns for Meetei
Mayek in other newspapers, Hueiyen Lanpao is still the only local newspaper
in all of Manipur to be printed using Meetei Mayek script till date."


Earlier the presenter says that Bengali is starting to disappear from
public signage.

Is that a different question to what the 'customary' script is?
>

To me, things like newspapers are among the most indicative of customary
use.

>From what I understand, someone who wants to support this language should
prepare to support both Beng and Mtei, with emphasis on Beng now and Mtei
later.

markus


Manipuri/Meitei customary writing system

2019-10-03 Thread Markus Scherer via Unicode
Dear Unicoders,

Is Manipuri/Meitei customarily written in Bangla/Bengali script or
in Meitei script?

I am looking at
https://en.wikipedia.org/wiki/Meitei_language#Writing_systems which seems
to describe writing practice in transition, and I can't quite tell where it
stands.

Is the use of the Meitei script aspirational or customary?
Which script is being used for major newspapers, popular books, and video
captions?

Thanks,
markus


Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Markus Scherer via Unicode
There are lots of ways to implement the UCA.

When you want fast string comparison, the zero weights are useful for
processing -- and you don't actually assemble a sort key.

People who want sort keys usually want them to be short, so you spend time
on compression. You probably also build sort keys as byte vectors not
uint16 vectors (because byte vectors fit into more APIs and tend to be
shorter), like ICU does using the CLDR collation data file. The CLDR root
collation data file remunges all weights into fractional byte sequences,
and leaves gaps for tailoring.

markus


Re: Dealing with Georgian capitalization in programming languages

2018-10-02 Thread Markus Scherer via Unicode
On Tue, Oct 2, 2018 at 12:50 AM Martin J. Dürst via Unicode <
unicode@unicode.org> wrote:

> ... The only
> operation that can cause problems is 'capitalize'.
>
> When I say "cause problems", I mean producing mixed-case output. I
> originally thought that 'capitalize' would be fine. It is fine for
> lowercase input: I stays lowercase because Unicode Data indicates that
> titlecase for lowercase Georgian letters is the letter itself. But it
> will produce the apparently undesirable Mixed Case for ALL UPPERCASE input.
>
> My questions here are:
> - Has this been considered when Georgian Mtavruli was discussed in the
>UTC?
> - How have any other implementers (ICU,...) addressed this, in
>particular the operation that's called 'capitalize' in Ruby?
>

By default, ICU toTitle() functions titlecase at word boundaries (with
adjustment) and lowercase all else.
That is, we implement Unicode chapter 3.13 Default Case Conversions R3
toTitlecase(x), except that we modified the default boundary adjustment.

You can customize the boundaries (e.g., only the start of the string).
We have options for whether and how to adjust the boundaries (e.g., adjust
to the next cased letter) and for copying, not lowercasing, the other
characters.
See C++ and Java class CaseMap and the relevant options.

markus


Re: Diacritic marks in parentheses

2018-07-26 Thread Markus Scherer via Unicode
I would not expect for Ä+combining () above = Ä᪻ to look right except with
specialized fonts.
http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%84%5Cu1ABB==0

Even if it worked widely, I think it would be confusing.
I think you are best off writing Arzt/Ärztin.

Viele Grüße,
markus


Re: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols

2018-05-15 Thread Markus Scherer via Unicode
On Tue, May 15, 2018 at 10:47 AM, Johnny Farraj via Unicode <
unicode@unicode.org> wrote:

> Dear Unicode list members,
>
> I wish to get feedback about a new symbol submission proposal.
>

Just to clarify, this is a discussion list where you may get some useful
feedback. This is not where you would submit an actual proposal.

See https://www.unicode.org/pending/proposals.html

I am proposing the addition of 2 new characters to the Musical Symbols
> table:
>
> - the half-flat sign (lowers a note by a quarter tone)
> - the half-sharp sign (raises a note by a quarter tone)
>

In an actual proposal, I would expect a discussion of whether you are
proposing to encode established symbols, or whether you are proposing new
symbols to be adopted by the community (in which case Unicode would
probably wait & see if they get established).

A proposal should also show evidence of usage and glyph variations.

Best regards,
markus


Re: [Unicode] Re: Fonts and font sizes used in the Unicode

2018-03-05 Thread Markus Scherer via Unicode
On Mon, Mar 5, 2018 at 9:03 AM, suzuki toshiya via Unicode <
unicode@unicode.org> wrote:

> I have a question; if some people try to make a
> translated version of Unicode, they should contact
> all font contributors and ask for the license?
> Unicode Consortium cannot give any sublicense?
>

If you want to translate the Unicode Standard or its companion standards
(UAX, UTS, ...), then please contact the Unicode Consortium.

Thus, I guess, it would not be so irrelevant to ask
> the permission to JTC1, about the fonts used in
> ISO/IEC 10646 - although it does not mean that
> JTC1 would permit anything. If I'm misunderstanding,
> please correct me.
>

The production of the ISO 10646 standard is done by the Unicode Consortium.
I am fuzzy on what exactly that means for copyright. If you need to find
out, then please contact the consortium.

markus


Re: Fonts and font sizes used in the Unicode

2018-03-04 Thread Markus Scherer via Unicode
On Sun, Mar 4, 2018 at 6:10 AM, Helena Miton via Unicode <
unicode@unicode.org> wrote:

> Greetings. Is there a way to know which font and font size have been used
> in the Unicode charts (for various writing systems)? Many thanks!
>

What are you trying to do?

Many of the fonts are unique to the Unicode chart production, and are not
licensed for other uses. Some are not even generally usable.

markus


Re: Emoji blooper

2018-02-13 Thread Markus Scherer via Unicode
On my machine (Chromebox+Gmail), the trumpets point down to the lower left.

If you want to convey precise images, then send images...

markus


Re: Internationalization & Unicode Conference 2018

2018-01-24 Thread Markus Scherer via Unicode
If your presentation is accepted for the conference, you should get a hotel
discount.
markus


Re: Minimal Implementation of Unicode Collation Algorithm

2017-12-04 Thread Markus Scherer via Unicode
On Mon, Dec 4, 2017 at 5:30 AM, Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> May a collation algorithm that always compares all strings as equal be a
> compliant implementation of the Unicode Collation Algorithm (UTS #10)?
> If not, by which clause is it not compliant?  Formally, this algorithm
> would require that all weights be zero.
>

I think so. The algorithm would be equivalent to an implementation of the
UCA with a degenerate CET that maps every character to a Completely
Ignorable Collation Element.

Would an implementation that supported no characters be compliant?
>

I guess so. I assume that would mean that the CET maps nothing, and that
the implementation does implement the implicit weighting of Han characters
and unassigned (here: unmapped) code points. It would also have to do NFD
first.

It used to be that for an implementation to be claimed as compliant, it
> also had to pass a specific conformance test.  This requirement has now
> been abandoned, perhaps because the Default Unicode Collation Element
> Table (DUCET) is incompatible with the CLDR Collation Algorithm.
>

The DUCET is missing some things that are needed by the CLDR Collation
Algorithm, but that has nothing to do with UCA compliance.

The simple fact is that tailorings are common, and it has to be possible to
conform to the algorithm without forbidding tailorings.

markus


Re: implicit weight base for U+2CEA2

2017-09-27 Thread Markus Scherer via Unicode
On Wed, Sep 27, 2017 at 4:07 PM, James Tauber  wrote:

> Ah yes, I was just going by membership in the CJK Unified Ideographs
> Extension E block, not actual assignment.
>
> So the lack of assignment means it should fail the Unified_Ideograph
> membership in http://unicode.org/reports/tr10/#Values_For_Base_Table
>

Right.

http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

3400..4DB5; Unified_Ideograph # Lo [6582] CJK UNIFIED
IDEOGRAPH-3400..CJK UNIFIED IDEOGRAPH-4DB5
4E00..9FEA; Unified_Ideograph # Lo [20971] CJK UNIFIED
IDEOGRAPH-4E00..CJK UNIFIED IDEOGRAPH-9FEA
FA0E..FA0F; Unified_Ideograph # Lo   [2] CJK COMPATIBILITY
IDEOGRAPH-FA0E..CJK COMPATIBILITY IDEOGRAPH-FA0F
FA11  ; Unified_Ideograph # Lo   CJK COMPATIBILITY IDEOGRAPH-FA11
FA13..FA14; Unified_Ideograph # Lo   [2] CJK COMPATIBILITY
IDEOGRAPH-FA13..CJK COMPATIBILITY IDEOGRAPH-FA14
FA1F  ; Unified_Ideograph # Lo   CJK COMPATIBILITY IDEOGRAPH-FA1F
FA21  ; Unified_Ideograph # Lo   CJK COMPATIBILITY IDEOGRAPH-FA21
FA23..FA24; Unified_Ideograph # Lo   [2] CJK COMPATIBILITY
IDEOGRAPH-FA23..CJK COMPATIBILITY IDEOGRAPH-FA24
FA27..FA29; Unified_Ideograph # Lo   [3] CJK COMPATIBILITY
IDEOGRAPH-FA27..CJK COMPATIBILITY IDEOGRAPH-FA29
2..2A6D6  ; Unified_Ideograph # Lo [42711] CJK UNIFIED
IDEOGRAPH-2..CJK UNIFIED IDEOGRAPH-2A6D6
2A700..2B734  ; Unified_Ideograph # Lo [4149] CJK UNIFIED
IDEOGRAPH-2A700..CJK UNIFIED IDEOGRAPH-2B734
2B740..2B81D  ; Unified_Ideograph # Lo [222] CJK UNIFIED
IDEOGRAPH-2B740..CJK UNIFIED IDEOGRAPH-2B81D
2B820..2CEA1  ; Unified_Ideograph # Lo [5762] CJK UNIFIED
IDEOGRAPH-2B820..CJK UNIFIED IDEOGRAPH-2CEA1
2CEB0..2EBE0  ; Unified_Ideograph # Lo [7473] CJK UNIFIED
IDEOGRAPH-2CEB0..CJK UNIFIED IDEOGRAPH-2EBE0

# Total code points: 87882

https://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3AUnified_Ideograph%3A%5D=on==

markus


Re: implicit weight base for U+2CEA2

2017-09-27 Thread Markus Scherer via Unicode
On Wed, Sep 27, 2017 at 1:49 PM, James Tauber via Unicode <
unicode@unicode.org> wrote:

> I recently updated pyuca[1], my pure Python implementation of the Unicode
> Collation Algorithm to work with 8.0.0, 9.0.0, and 10.0.0 but to get all
> the tests to work, I had to special case the implicit weight base for
> U+2CEA2. The spec seems to suggest the base should be FB80 but I had to
> override just that code point to have a base of FBC0 for the tests to pass.
>
> Is this a known issue with the spec or something I've missed?
>

2CEA2..2CEAF are unassigned code points for which the UCA+DUCET uses a base
of FBC0.

markus


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-09-23 Thread Markus Scherer via Unicode
FYI, I changed the ICU behavior for the upcoming ICU 60 release (pending
code review).

Proposal & description:
https://sourceforge.net/p/icu/mailman/message/35990833/

Code changes: http://bugs.icu-project.org/trac/review/13311

Best regards,
markus

On Thu, Aug 3, 2017 at 5:34 PM, Mark Davis ☕️  wrote:

> FYI, the UTC retracted the following.
>
> *[151-C19 ] Consensus:* 
> Modify
> the section on "Best Practices for Using FFFD" in section "3.9 Encoding
> Forms" of TUS per the recommendation in L2/17-168
> , for
> Unicode version 11.0.
>
> Mark
>


Re: Emoji Space

2017-07-17 Thread Markus Scherer via Unicode
On Mon, Jul 17, 2017 at 5:25 AM, Christoph Päper via Unicode <
unicode@unicode.org> wrote:

> As you may know, the combined original Japanese emoji set included three
> whitespace characters: one was the full width of a (square) emoji, one was
> half that and the last one was a quarter blank. Their KDDI Shift-JIS codes
> were F7A9, F7AA and F7AB, respectively, and their internal numeric IDs were
> #173, #174 and #175, respectively. They were apparently not adapted as new
> Unicode characters and no existing space character gained the Emoji
> property.
>

They were among the 115 or so emoji unified with Unicode 5.2-and-earlier
characters.
http://www.unicode.org/Public/UCD/latest/ucd/EmojiSources.txt

2002;;F7AA;
2003;;F7A9;
2005;;F7AB;


markus


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-03 Thread Markus Scherer via Unicode
On Wed, May 31, 2017 at 5:12 AM, Henri Sivonen  wrote:

> On Sun, May 21, 2017 at 7:37 PM, Mark Davis ☕️ via Unicode
>  wrote:
> > There is plenty of time for public comment, since it was targeted at
> Unicode
> > 11, the release for about a year from now, not Unicode 10, due this year.
> > When the UTC "approves a change", that change is subject to comment, and
> the
> > UTC can always reverse or modify its approval up until the meeting before
> > release date. So there are ca. 9 months in which to comment.
>
> What should I read to learn how to formulate an appeal correctly?
>

I suggest you submit a write-up via http://www.unicode.org/reporting.html

and make the case there that you think the UTC should retract

http://www.unicode.org/L2/L2017/17103.htm#151-C19

*B.13.3.3 Illegal UTF-8 [Scherer, L2/17-168
*]

*[151-C19 ]
Consensus:* Modify
the section on "Best Practices for Using FFFD" in section "3.9 Encoding
Forms" of TUS per the recommendation in L2/17-168
, for Unicode
version 11.0.

Does it matter if a proposal/appeal is submitted as a non-member
> implementor person, as an individual person member or as a liaison
> member?


The reporting.html form exists for gathering feedback from the public. The
UTC regularly reviews and considers such feedback in its quarterly meetings.

Also, since Chromium/Blink/v8 are using ICU, I suggest you submit an ICU
ticket via http://bugs.icu-project.org/trac/newticket

and make the case there, too, that you think (assuming you do) that ICU
should change its handling of illegal UTF-8 sequences.

> If people really believed that the guidelines in that section should have
> > been conformance clauses, they should have proposed that at some point.
>
> It seems to me that this thread does not support the conclusion that
> the Unicode Standard's expression of preference for the number of
> REPLACEMENT CHARACTERs should be made into a conformance requirement
> in the Unicode Standard. This thread could be taken to support a
> conclusion that the Unicode Standard should not express any preference
> beyond "at least one and at most as many as there were bytes".
>

Given the discussion and controversy here, in my opinion, the standard
should probably tone down the "best practice" and "recommendation" language.

> Aside from UTF-8 history, there is a reason for preferring a more
> > "structural" definition for UTF-8 over one purely along valid sequences.
> > This applies to code that *works* on UTF-8 strings rather than just
> > converting them. For UTF-8 *processing* you need to be able to iterate
> both
> > forward and backward, and sometimes you need not collect code points
> while
> > skipping over n units in either direction -- but your iteration needs to
> be
> > consistent in all cases. This is easier to implement (especially in fast,
> > short, inline code) if you have to look only at how many trail bytes
> follow
> > a lead byte, without having to look whether the first trail byte is in a
> > certain range for some specific lead bytes.
>
> But the matter at hand is decoding potentially-invalid UTF-8 input
> into a valid in-memory Unicode representation, so later processing is
> somewhat a red herring as being out of scope for this step. I do agree
> that if you already know that the data is valid UTF-8, it makes sense
> to work from the bit pattern definition only.


No, it's not a red herring. Not every piece of software has a neat "inside"
with all valid text, and with a controllable surface to the "outside".

In a large project with a small surface for text to enter the system, such
as a browser with a centralized chunk of code for handling streams of input
text, it might well work to validate once and then assume "on the inside"
that you only ever see well-formed text.

In a library with API of the granularity of "compare two strings",
"uppercase a string" or "normalize a string", you have no control over your
input; you cannot assume that your input is valid; you cannot crash when
it's not valid; you cannot overrun your buffer; you cannot go into an
endless loop. It's also cumbersome to fail with an error whenever you
encounter invalid text, because you need more code for error detection &
handling, and because significant C++ code bases do not allow exceptions.
(Besides, ICU also offers C APIs.)

Processing potentially-invalid UTF-8, iterating over it, and looking up
data for it, *can* definitely be simpler (take less code etc.) if for any
given lead byte you always collect the same maximum number of trail bytes,
and if you have fewer distinct types of lead bytes with their corresponding
sequences.

Best regards,
markus


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Markus Scherer via Unicode
On Fri, May 26, 2017 at 3:28 AM, Martin J. Dürst 
wrote:

> But there's plenty in the text that makes it absolutely clear that some
> things cannot be included. In particular, it says
>
> 
> The term “maximal subpart of an ill-formed subsequence” refers to the code
> units that were collected in this manner. They could be the start of a
> well-formed sequence, except that the sequence lacks the proper
> continuation. Alternatively, the converter may have found an continuation
> code unit, which cannot be the start of a well-formed sequence.
> 
>
> And the "in this manner" refers to:
> 
> A sequence of code units will be processed up to the point where the
> sequence either can be unambiguously interpreted as a particular Unicode
> code point or where the converter recognizes that the code units collected
> so far constitute an ill-formed subsequence.
> 
>
> So we have the same thing twice: Bail out as soon as something is
> ill-formed.


The UTF-8 conversion code that I wrote for ICU, and apparently the code
that various other people have written, collects sequences starting from
lead bytes, according to the original spec, and at the end looks at whether
the assembled code point is too low for the lead byte, or is a surrogate,
or is above 10. Stopping at a non-trail byte is quite natural, and
reading the PRI text accordingly is quite natural too.

Aside from UTF-8 history, there is a reason for preferring a more
"structural" definition for UTF-8 over one purely along valid sequences.
This applies to code that *works* on UTF-8 strings rather than just
converting them. For UTF-8 *processing* you need to be able to iterate both
forward and backward, and sometimes you need not collect code points while
skipping over n units in either direction -- but your iteration needs to be
consistent in all cases. This is easier to implement (especially in fast,
short, inline code) if you have to look only at how many trail bytes follow
a lead byte, without having to look whether the first trail byte is in a
certain range for some specific lead bytes.

(And don't say that everyone can validate all strings once and then all
code can assume they are valid: That just does not work for library code,
you cannot assume anything about your input strings, and you cannot crash
when they are ill-formed.)

markus


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-24 Thread Markus Scherer via Unicode
On Wed, May 24, 2017 at 3:56 PM, Karl Williamson 
wrote:

> On 05/24/2017 12:46 AM, Martin J. Dürst wrote:
>
>> That's wrong. There was a public review issue with various options and
>> with feedback, and the recommendation has been implemented and in use
>> widely (among else, in major programming language and browsers) without
>> problems for quite some time.
>>
>
> Could you supply a reference to the PRI and its feedback?
>

http://www.unicode.org/review/resolved-pri-100.html#pri121

The PRI did not discuss possible different versions of "maximal subpart",
and the examples there yield the same results either way. (No non-shortest
forms.)

The recommendation in TUS 5.2 is "Replace each maximal subpart of an
> ill-formed subsequence by a single U+FFFD."
>

You are right.

http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf shows a slightly
expanded example compared with the PRI.

The text simply talked about a "conversion process" stopping as soon as it
encounters something that does not fit, so these edge cases would depend on
whether the conversion process treats original-UTF-8 sequences as single
units.

And I agree with that.  And I view an overlong sequence as a maximal
> ill-formed subsequence that should be replaced by a single FFFD. There's
> nothing in the text of 5.2 that immediately follows that recommendation
> that indicates to me that my view is incorrect.
>
> Perhaps my view is colored by the fact that I now maintain code that was
> written to parse UTF-8 back when overlongs were still considered legal
> input.  An overlong was a single unit.  When they became illegal, the code
> still considered them a single unit.
>

Right.

I can understand how someone who comes along later could say C0 can't be
> followed by any continuation character that doesn't yield an overlong,
> therefore C0 is a maximal subsequence.
>

Right.

But I assert that my interpretation is just as valid as that one.  And
> perhaps more so, because of historical precedent.
>

I agree.

markus


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Markus Scherer via Unicode
On Tue, May 23, 2017 at 7:05 AM, Asmus Freytag via Unicode <
unicode@unicode.org> wrote:

> So, if the proposal for Unicode really was more of a "feels right" and not
> a "deviate at your peril" situation (or necessary escape hatch), then we
> are better off not making a RECOMMEDATION that goes against collective
> practice.
>

I think the standard is quite clear about this:

Although a UTF-8 conversion process is required to never consume
well-formed subsequences as part of its error handling for ill-formed
subsequences, such a process is not otherwise constrained in how it deals
with any ill-formed subsequence itself. An ill-formed subsequence
consisting of more than one code unit could be treated as a single error or
as multiple errors.


markus


Re: Comparing Raw Values of the Age Property

2017-05-22 Thread Markus Scherer via Unicode
On Mon, May 22, 2017 at 2:44 PM, Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> Given two raw values of the Age property, defined in UCD file
> DerivedAge.txt, how is a computer program supposed to compare them?
> Apart from special handling for the value "Unassigned" and its short
> alias "NA", one used to be able to compare short values against short
> values and long values against long values by simple string
> comparison.  However, now we are coming to Version 10.0 of Unicode,
> this no longer works - "1.1" < "10.0" < "2.0".
>

This is normal for numbers, and for multi-field version numbers.
If you want numeric sorting, then you need to either use a collator with
that option, or parse the versions into tuples of integers and sort those.

There are some possibilities - the values appear in order in
> PropertyValueAliases.txt and in DerivedAge.txt.


You should not rely on the order of values in data files, unless the file
explicitly states that order matters.

Can one rely on the FULL STOP being the field
> divider,


I think so. Dots are extremely common for version numbers. I see no reason
for Unicode to use something else.

and can one rely on there never being any grouping characters
> in the short values?


I don't know what "grouping characters" you have in mind.

I think the format is pretty self-evident.

markus


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Markus Scherer via Unicode
Let me try to address some of the issues raised here.

The proposal changes a recommendation, not a requirement. Conformance
applies to finding and interpreting valid sequences properly. This includes
not consuming parts of valid sequences when dealing with illegal ones, as
explained in the section "Constraints on Conversion Processes".

Otherwise, what you do with illegal sequences is a matter of what you think
makes sense -- a matter of opinion and convenience. Nothing more.

I wrote my first UTF-8 handling code some 18 years ago, before joining the
ICU team. At the time, I believe the ISO UTF-8 definition was not yet
limited to U+10, and decoding overlong sequences and those yielding
surrogate code points was regarded as a misdemeanor. The spec has been
tightened up, but I am pretty sure that most people familiar with how UTF-8
came about would recognize  and  as single sequences.

I believe that the discussion of how to handle illegal sequences came out
of security issues a few years ago from some implementations including
valid single and lead bytes with preceding illegal sequences. Beyond the
"Constraints on Conversion Processes", there was evidently also a desire to
recommend how to handle illegal sequences.

I think that the current recommendation was an extrapolation of common
practice for non-UTF encodings, such as Shift-JIS or GB 18030. It's ok for
UTF-8, too, but "it feels like" (yes, that's the level of argument for
stuff that doesn't really matter) not treating  and  as
single sequences is "weird".

Why do we care how we carve up an illegal sequence into subsequences? Only
for debugging and visual inspection. Maybe some process is using illegal,
overlong sequences to encode something special (à la Java string
serialization, "modified UTF-8"), and for that it might be convenient too
to treat overlong sequences as single errors.

If you don't like some recommendation, then do something else. It does not
matter. If you don't reject the whole input but instead choose to replace
illegal sequences with something, then make sure the something is not
nothing -- replacing with an empty string can cause security issues.
Otherwise, what the something is, or how many of them you put in, is not
very relevant. One or more U+FFFDs is customary.

When the current recommendation came in, I thought it was reasonable but
didn't like the edge cases. At the time, I didn't think it was important to
twiddle with the text in the standard, and I didn't care that ICU didn't
exactly implement that particular recommendation.

I have seen implementations that clobber every byte in an illegal sequence
with a space, because it's easier than writing an U+FFFD for each byte or
for some subsequences. Fine. Someone might write a single U+FFFD for an
arbitrarily long illegal subsequence; that's fine, too.

Karl Williamson sent feedback to the UTC, "In short, I believe the best
practices are wrong." I think "wrong" is far too strong, but I got an
action item to propose a change in the text. I proposed a modified
recommendation. Nothing gets elevated to "right" that wasn't, nothing gets
demoted to "wrong" that was "right".

None of this is motivated by which UTF is used internally.

It is true that it takes a tiny bit more thought and work to recognize a
wider set of sequences, but a capable implementer will optimize
successfully for valid sequences, and maybe even for a subset of those for
what might be expected high-frequency code point ranges. Error handling can
go into a slow path. In a true state table implementation, it will require
more states but should not affect the performance of valid sequences.

Many years ago, I decided for ICU to add a small amount of slow-path
error-handling code for more human-friendly illegal-sequence reporting. In
other words, this was not done out of convenience; it was an inconvenience
that seemed justified by nicer error reporting. If you don't like to do so,
then don't.

Which UTF is better? It depends. They all have advantages and problems.
It's all Unicode, so it's all good.

ICU largely uses UTF-16 but also UTF-8. It has data structures and code for
charset conversion, property lookup, sets of characters (UnicodeSet), and
collation that are co-optimized for both UTF-16 and UTF-8. It has a slowly
growing set of APIs working directly with UTF-8.

So, please take a deep breath. No conformance requirement is being touched,
no one is forced to do something they don't like, no special consideration
is given for one UTF over another.

Best regards,
markus