On Wed, 31 May 2017 19:24:04 +
Shawn Steele via Unicode wrote:
> It seems to me that being able to use a data stream of ambiguous
> quality in another application with predictable results, then that
> stream should be “repaired” prior to being handed over. Then both
> endpoints would be usin
On Wed, 31 May 2017 17:43:08 +
Shawn Steele via Unicode wrote:
> There also appears to be a special weight given to
> non-minimally-encoded sequences. It would seem to me that none of
> these illegal sequences should appear in practice, so we have either:
> I do not understand the energy
> And *that* is what the specification says. The whole problem here is that
> someone elevated
> one choice to the status of “best practice”, and it’s a choice that some of
> us don’t think *should*
> be considered best practice.
> Perhaps “best practice” should simply be altered to say that yo
Henri Sivonen wrote:
> If anything, I hope this thread results in the establishment of a
> requirement for proposals to come with proper research about what
> multiple prominent implementations to about the subject matter of a
> proposal concerning changes to text about implementation behavior.
C
> it’s more meaningful for whoever sees the output to see a single U+FFFD
> representing
> the illegally encoded NUL that it is to see two U+FFFDs, one for an invalid
> lead byte and
> then another for an “unexpected” trailing byte.
I disagree. It may be more meaningful for some applications
> For implementations that emit FFFD while handling text conversion and repair
> (ie, converting ill-formed
> UTF-8 to well-formed), it is best for interoperability if they get the same
> results, so that indices within the
> resulting strings are consistent across implementations for all the cor
> I do not understand the energy being invested in a case that shouldn't
happen, especially in a case that is a subset of all the other bad cases
that could happen.
I think Richard stated the most compelling reason:
… The bug you mentioned arose from two different ways of
counting the string leng
On 31 May 2017, at 18:43, Shawn Steele via Unicode wrote:
>
> It is unclear to me what the expected behavior would be for this corruption
> if, for example, there were merely a half dozen 0x80 in the middle of ASCII
> text? Is that garbage a single "character"? Perhaps because it's a
> conse
> On 30 May 2017, at 18:11, Shawn Steele via Unicode
> wrote:
>
>> Which is to completely reverse the current recommendation in Unicode 9.0.
>> While I agree that this might help you fending off a bug report, it would
>> create chances for bug reports for Ruby, Python3, many if not all Web
>
> > In either case, the bad characters are garbage, so neither approach is
> > "better" - except that one or the other may be more conducive to the
> > requirements of the particular API/application.
> There's a potential issue with input methods that indirectly edit the backing
> store. For e
On Wed, 31 May 2017 15:12:12 +0300
Henri Sivonen via Unicode wrote:
> The write-up mentions
> https://bugs.chromium.org/p/chromium/issues/detail?id=662822#c13 . I'd
> like to draw everyone's attention to that bug, which is real-world
> evidence of a bug arising from two UTF-8 decoders within one
I've researched this more. While the old advice dominates the handling
of non-shortest forms, there is more variation than I previously
thought when it comes to truncated sequences and CESU-8-style
surrogates. Still, the ICU behavior is an outlier considering the set
of implementations that I teste
12 matches
Mail list logo