On Sat, 19 Mar 2022, 'Pascal Jasmin' via Programming wrote:
(It probably would help everyone if there was a shorter retelling of the
semantics, even assuming the reader was able to skim through most of
it.)
There's a bulleted list at the end. Everything preceeding it is
rationale.
there is a UCS-1 that is different from J's utf8? Is UCS-1 actually an
update of utf8 that has differences?
Sorry, I should have explicated this. UCS-1 is what I call j's unicode
encoding of one byte per code unit, one code unit per code point. (Not
very 'universal', as all it can represent is ASCII + a few unicode
characters, but I don't know what else to call it.) There is nothing
wrong with it, as an implementation strategy, but it should not be exposed
to the user. No one else uses it; it is completely uninteresting from an
interoperability standpoint.
Does your proposal's main concern is some ability to handle misformed
unicode/utf8 sequences?
Random access to unicode, and no incoherent aliasing. Handling malformed
sequences is gravy.
If handling means turn that "character" into null
[...]
The main idea may instead be that if there is malformed unicode, then
instead of figuring out some result, whoever sent this garbage should be
notified that it is garbage.
I proposed that both mechanisms be available, and the programmer can
choose from among them at will.
If handling means turn that "character" into null, how do you guarantee
the malformation wasn't a missing byte, and that the rest of the
"stream" would be well formed (and the intent of message) if that
missing byte could be guessed instead of consuming "the first byte of
next character".
Low-level stream processing will need to do low-level encoding handling.
This might entail handling the stream as a sequence of numbers rather than
characters.
I will also note that my 'nulls' are typed; you get a separate one for
every potentially bad source byte, so no information is thrown away. But
you might not want to use that for byte-slices of a valid utf8 stream.
I believe you are also saying that UCS-1 or utf8 are ubiquitous in the
outside world. I can only understand the appeal as one of
space/bandwidth saving. A better space saving encoding is lempel-ziv
(zip) or better compression on unicode4.
I am not proposing that utf8 be used as an internal representation.
Emphatically the opposite. But because utf8 is ubiquitous, all
interoperation should default to encoding/decoding as utf8.
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm