Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-29 Thread Henri Sivonen via Unicode
On Sat Jun 3 23:09:01 CDT 2017Sat Jun 3 23:09:01 CDT 2017 Markus Scherer wrote: > I suggest you submit a write-up via http://www.unicode.org/reporting.html > > and make the case there that you think the UTC should retract > > http://www.unicode.org/L2/L2017/17103.htm#151-C19 The submission has

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-08-04 Thread Henri Sivonen via Unicode
On Fri, Aug 4, 2017 at 3:34 AM, Mark Davis ☕️ via Unicode wrote: > FYI, the UTC retracted the following. > > [151-C19] Consensus: Modify the section on "Best Practices for Using FFFD" > in section "3.9 Encoding Forms" of TUS per the recommendation in L2/17-168, > for Unicode

Re: Inadvertent copies of test data in L2/17-197 ?

2017-08-07 Thread Henri Sivonen via Unicode
On Mon, Aug 7, 2017 at 9:53 AM, Martin J. Dürst wrote: > I just had a look at http://www.unicode.org/L2/L2017/17197-utf8-retract.pdf > to use the test data in there for Ruby. > I was under the impression from previous looks at it that it contained a lot > of test data. It

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Henri Sivonen via Unicode
On Mon, May 15, 2017 at 6:37 PM, Alastair Houghton <alast...@alastairs-place.net> wrote: > On 15 May 2017, at 11:21, Henri Sivonen via Unicode <unicode@unicode.org> > wrote: >> >> In reference to: >> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf &g

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-18 Thread Henri Sivonen via Unicode
On Thu, May 18, 2017 at 2:41 AM, Asmus Freytag via Unicode wrote: > On 5/17/2017 2:31 PM, Richard Wordingham via Unicode wrote: > > There's some sort of rule that proposals should be made seven days in > advance of the meeting. I can't find it now, so I'm not sure whether >

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Henri Sivonen via Unicode
On Tue, May 16, 2017 at 9:36 PM, Markus Scherer wrote: > Let me try to address some of the issues raised here. Thank you. > The proposal changes a recommendation, not a requirement. This is a very bad reason in favor of the change. If anything, this should be a reason why

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Henri Sivonen via Unicode
On Tue, May 16, 2017 at 10:22 AM, Asmus Freytag wrote: > but I think the way he raises this point is needlessly antagonistic. I apologize. My level of dismay at the proposal's ICU-centricity overcame me. On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Henri Sivonen via Unicode
On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode wrote: > I’m not sure how the discussion of “which is better” relates to the > discussion of ill-formed UTF-8 at all. Clearly, the "which is better" issue is distracting from the underlying issue. I'll clarify what I

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Henri Sivonen via Unicode
On Tue, May 16, 2017 at 6:23 AM, Karl Williamson <pub...@khwilliamson.com> wrote: > On 05/15/2017 04:21 AM, Henri Sivonen via Unicode wrote: >> >> In reference to: >> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf >> >> I think Unico

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Henri Sivonen via Unicode
On Tue, May 16, 2017 at 1:09 PM, Alastair Houghton <alast...@alastairs-place.net> wrote: > On 16 May 2017, at 09:31, Henri Sivonen via Unicode <unicode@unicode.org> > wrote: >> >> On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton >> <alast...@alastairs-p

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Henri Sivonen via Unicode
On Tue, May 16, 2017 at 9:50 AM, Henri Sivonen wrote: > Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick > test with three major browsers that use UTF-16 internally and have > independent (of each other) implementations of UTF-8 decoding > (Firefox, Edge

Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Henri Sivonen via Unicode
In reference to: http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf I think Unicode should not adopt the proposed change. The proposal is to make ICU's spec violation conforming. I think there is both a technical and a political reason why the proposal is a bad idea. First, the technical

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Henri Sivonen via Unicode
On Wed, May 31, 2017 at 8:11 PM, Richard Wordingham via Unicode <unicode@unicode.org> wrote: > On Wed, 31 May 2017 15:12:12 +0300 > Henri Sivonen via Unicode <unicode@unicode.org> wrote: >> I am not claiming it's too difficult to implement. I think it >> inappropriat

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Henri Sivonen via Unicode
I've researched this more. While the old advice dominates the handling of non-shortest forms, there is more variation than I previously thought when it comes to truncated sequences and CESU-8-style surrogates. Still, the ICU behavior is an outlier considering the set of implementations that I

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-06 Thread Henri Sivonen via Unicode
On Mon, Jun 4, 2018 at 10:49 PM, Manish Goregaokar via Unicode wrote: > The Rust community is considering adding non-ascii identifiers, which follow > UAX #31 (XID_Start XID_Continue*, with tweaks). UAX #31 is rather light on documenting its rationale. I realize that XML is a different case

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-08 Thread Henri Sivonen via Unicode
On Wed, Jun 6, 2018 at 2:55 PM, Henri Sivonen wrote: > Considering that ruling out too much can be a problem later, but just > treating anything above ASCII as opaque hasn't caused trouble (that I > know of) for HTML other than compatibility issues with XML's stricter > stance, why should a

PDF restrictions on the Unicode Standard 10.0

2018-01-13 Thread Henri Sivonen via Unicode
I was reading https://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf on a Sony Digital Paper device and tried to scribble some notes and make highlights but I couldn't. I still couldn't after ensuring that the pen was charged and could write on other PDFs. Since Evince told me

Re: Unicode String Models

2018-09-11 Thread Henri Sivonen via Unicode
On Sat, Sep 8, 2018 at 7:36 PM Mark Davis ☕️ via Unicode wrote: > > I recently did some extensive revisions of a paper on Unicode string models > (APIs). Comments are welcome. > > https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit# * The Grapheme Cluster Model

Re: Unicode String Models

2018-09-11 Thread Henri Sivonen via Unicode
On Tue, Sep 11, 2018 at 2:13 PM Eli Zaretskii wrote: > > > Date: Tue, 11 Sep 2018 13:12:40 +0300 > > From: Henri Sivonen via Unicode > > > > * I suggest splitting the "UTF-8 model" into three substantially > > different models: > > > >

Is the Editor's Draft public?

2018-04-20 Thread Henri Sivonen via Unicode
Is the Editor's Draft of the Unicode Standard visible publicly? Use case: Checking if things that I might send feedback about have already been addressed since the publication of Unicode 10.0. -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/

Re: Is the Editor's Draft public?

2018-04-20 Thread Henri Sivonen via Unicode
On Fri, Apr 20, 2018 at 12:16 PM, Martin J. Dürst wrote: > On 2018/04/20 18:12, Martin J. Dürst wrote: > >> There was an announcement for a public review period just recently. The >> review period is up to the 23rd of April. I'm not sure whether the >> announcement is up

Re: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

2018-12-10 Thread Henri Sivonen via Unicode
We're about to remove the U+FFFD generation for the case where there is no content between two ISO-2022-JP escape sequences from the WHATWG Encoding Standard. Is there anything wrong with my analysis that U+FFFD generation in that case is not a useful security measure when unnecessary transitions

Re: Unicode String Models

2018-11-22 Thread Henri Sivonen via Unicode
On Tue, Oct 2, 2018 at 3:04 PM Mark Davis ☕️ wrote: > > * The Python 3.3 model mentions the disadvantages of memory usage >> cliffs but doesn't mention the associated perfomance cliffs. It would >> be good to also mention that when a string manipulation causes the >> storage to expand or

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-11-22 Thread Henri Sivonen via Unicode
reply. Why is excluding junk important? > On Fri, Jun 8, 2018 at 11:07 AM, Henri Sivonen via Unicode wrote: >> >> On Wed, Jun 6, 2018 at 2:55 PM, Henri Sivonen wrote: >> > Considering that ruling out too much can be a problem later, but just >> > treating anythin

Re: Unicode String Models

2018-09-12 Thread Henri Sivonen via Unicode
On Wed, Sep 12, 2018 at 11:37 AM Hans Åberg via Unicode wrote: > The idea is to extend Unicode itself, so that those bytes can be represented > by legal codepoints. Extending Unicode itself would likely create more problems that it would solve. Extending the value space of Unicode scalar values

Re: Proposing mostly invisible characters

2019-09-12 Thread Henri Sivonen via Unicode
On Thu, Sep 12, 2019, 15:53 Christoph Päper via Unicode wrote: > ISHY/SIHY is especially useful for encoding (German) noun compounds in > wrapped titles, e.g. on product labeling, where hyphens are often > suppressed for stylistic reasons, e.g. orthographically correct > _Spargelsuppe_,

Re: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

2020-08-17 Thread Henri Sivonen via Unicode
/tr36/? > > Mark > > > On Mon, Dec 10, 2018 at 11:10 AM Henri Sivonen via Unicode > wrote: >> >> We're about to remove the U+FFFD generation for the case where there >> is no content between two ISO-2022-JP escape sequences from the WHATWG >> Encoding