> On 16 May 2017, at 20:43, Richard Wordingham via Unicode
> wrote:
>
> On Tue, 16 May 2017 11:36:39 -0700
> Markus Scherer via Unicode wrote:
>
>> Why do we care how we carve up an illegal sequence into subsequences?
>> Only for debugging and visual inspection. Maybe some process is using
>>
Hans Åberg wrote:
> It would be useful, for use with filesystems, to have Unicode
> codepoint markers that indicate how UTF-8, including non-valid
> sequences, is translated into UTF-32 in a way that the original
> octet sequence can be restored.
I have always argued strongly against this idea,
Richard Wordingham wrote:
>> It is not at all clear what the intent of the encoder was - or even
>> if it's not just a problem with the data stream. E0 80 80 is not
>> permitted, it's garbage. An encoder can't "intend" it.
>
> It was once a legal way of encoding NUL, just like C0 E0, which is
> st
Henri Sivonen wrote:
> I find it shocking that the Unicode Consortium would change a
> widely-implemented part of the standard (regardless of whether Unicode
> itself officially designates it as a requirement or suggestion) on
> such flimsy grounds.
>
> I'd like to register my feedback that I beli
> On 17 May 2017, at 22:36, Doug Ewell via Unicode wrote:
>
> Hans Åberg wrote:
>
>> It would be useful, for use with filesystems, to have Unicode
>> codepoint markers that indicate how UTF-8, including non-valid
>> sequences, is translated into UTF-32 in a way that the original
>> octet sequen
Hans Åberg wrote:
>> Far from solving the stated problem, it would introduce a new one:
>> conversion from the "bad data" Unicode code points, currently
>> well-defined, would become ambiguous.
>
> Actually not: just translate the invalid UTF-8 sequences into invalid
> UTF-32.
Far from solving th
> On 17 May 2017, at 23:18, Doug Ewell wrote:
>
> Hans Åberg wrote:
>
>>> Far from solving the stated problem, it would introduce a new one:
>>> conversion from the "bad data" Unicode code points, currently
>>> well-defined, would become ambiguous.
>>
>> Actually not: just translate the invali
On Wed, 17 May 2017 13:41:56 -0700
Doug Ewell via Unicode wrote:
> Perhaps surprisingly, it's already too late. UTC approved this change
> the day after the proposal was written.
>
> http://www.unicode.org/L2/L2017/17103.htm#151-C19
Approved for Unicode 11.0. Unicode 10.0 has yet to be release
On Wed, 17 May 2017 13:37:51 -0700
Doug Ewell via Unicode wrote:
> Richard Wordingham wrote:
>
> >> It is not at all clear what the intent of the encoder was - or even
> >> if it's not just a problem with the data stream. E0 80 80 is not
> >> permitted, it's garbage. An encoder can't "intend" it
Richard Wordingham wrote:
> So it was still a legal way for a non-UTF-8-compliant process!
Anything is possible if you are non-compliant. You can encode U+263A
with 9,786 FF bytes followed by a terminating FE byte and call that
"UTF-8," if you are willing to be non-compliant enough.
> Note for e
On Wed, 17 May 2017 15:31:56 -0700
Doug Ewell via Unicode wrote:
> Richard Wordingham wrote:
>
> > So it was still a legal way for a non-UTF-8-compliant process!
>
> Anything is possible if you are non-compliant. You can encode U+263A
> with 9,786 FF bytes followed by a terminating FE byte an
On 5/17/2017 2:31 PM, Richard
Wordingham via Unicode wrote:
There's some sort of rule that proposals should be made seven days in
advance of the meeting. I can't find it now, so I'm not sure whether
the actual rule was followed, let alone what authority it has.
I find intriguating that the update intends to enforce the decoding of the
**shortest** sequences, but now wants to treat **maximal sequences** as a
single unit with arbitrary length. UTF-8 was designed to work only with
some state machines that would NEVER need to parse more than 4 bytes.
For me,
Richard Wordingham wrote:
I'm afraid I don't get the analogy.
You can't build a full Unicode system out of Unicode-compliant parts.
Others will have to address Richard's point about canonical-equivalent
sequences.
However, having dug out Unicode Version 2 Appendix A Section 2 UTF-8
(in h
On Thu, 18 May 2017 02:04:55 +0200
Philippe Verdy via Unicode wrote:
> I find intriguating that the update intends to enforce the decoding
> of the **shortest** sequences, but now wants to treat **maximal
> sequences** as a single unit with arbitrary length. UTF-8 was
> designed to work only with
On Thu, May 18, 2017 at 2:41 AM, Asmus Freytag via Unicode
wrote:
> On 5/17/2017 2:31 PM, Richard Wordingham via Unicode wrote:
>
> There's some sort of rule that proposals should be made seven days in
> advance of the meeting. I can't find it now, so I'm not sure whether
> the actual rule was fo
16 matches
Mail list logo