Hello Ralph.

Ralph Corderoy wrote in <20181218113855.e29b61f...@orac.inputplus.co.uk>:
 |>> The usage of the Content-Length: tag would be very perfect
 |>
 |> mutt(1) updates these when it writes messages, too.  We actively strip
 |> them (except when *keep-content-length* is set, at least until v15, as
 |> documented), since we are not able to keep them up-to-date.  If others
 |> do too, they are not reliable.
 |
 |I too have found the `Content-Length' header a pain over the years.
 |Something carried in from the less rigorous world of Usenet IIRC.  mbox
 |and email manipulators don't have to know about it and that means it
 |easily gets out of step with `From␣' headers, giving two different
 |versions of `truth'.

Yes.  And what can you do with the info?  What would make sense is
an external index file which also has offset and such, header
info, whatever.  For the header summary well, blablabla :)

 |> I mean, if the data would start with a UTF-8 encoded Unicode BOM,
 |
 |It would be nice to not see more of these enter existence.  :-)

Well i do not know.  I do hate the BOM and said so on the Unicode
list in the past :).  However, as times goes by, i found it not to
be so bad, it has its purpose for markers, here and there.  And
why not at the beginning of a text file.  UTF-16 has its merits
too, i would have massively opposed to this in the past...
I do not use BOM for my stuff nonetheless, except i did offer it
for the binary datastreams more than fifteen years ago, with/for
automatic endianess adjustment (read_bom(), write_bom() etc.).

 |> We could possibly extend our MIME classifier, and when we have seen
 |> multiple UTF-8 sequences after reading all the body, and if and only
 |> if the current locale is "C" a.k.a. and if *ttycharset* is
 |> n_iconv_name_is_ascii(), _then_ we could, instead of using the normal
 |> *charset-8bit*, go for UTF-8.
 |
 |If the locale is `C', and it's valid ASCII then why not plump for 7-bit?
 |And if instead it's valid UTF-8 then it seems OK to declare it such as I
 |can't think of intended Latin1, say, that happens to be valid UTF-8,
 |e.g. not many want to write `£' and probably intended `£'.

That is a misunderstanding.  Our name_is_ascii() iterates over the
given name and checks whether it actually names an ASCII encoding
name, of which there are many (cp367, ANSI_X3.4-1968, US-ASCII, to
name a few).  Different systems use different names to name ASCII,
the standard just does not offer a possibility to say, "yes, this
is ASCII".  IANA has a character set registry, and many names have
aliases, official name, some other standard, "preferred MIME name"
and so on.  On the Unix side this is usually solved with
charset.alias, but that does not help me as an application writer,
who wants to know "what is the actual _real_ name of this
character set", so that after an iconv_open("foo") i can query the
actual name by iconv_character_set_name().

And again, i do not know.  How do i know it is UTF-8?  Maybe it
is, maybe not.  Can i know from seeing four bytes that look like
UTF-8?  I may know from parsing an entire file and seeing only
7-bit bytes, and 8-bit ones that form valid UTF-8 sequences,
increasingly so as the number of such sequences rise.
If i see such in the plain "C" locale, with a user chosen
*ttycharset* that indeed is ASCII, then i _could_ deduce that the
input actually really is UTF-8.  That was what i was saying.

Then again, i am in doubts whether i should.  C.UTF-8 a.k.a.
POSIX.UTF-8 come to the mainstream, musl C lib only has that,
OpenBSD effectively too i think, others will follow.
If a user explicitly says "LC_ALL=C mail ...", then how am
i allowed to do something like this?  If the user says "LC_ALL=C
mail" than he can also append a "-Sttycharset=BLA".

Of course it is a problem, iconv(3)ing from ASCII to UTF-8 fails
if there is some 8-bit data in the input, even LATIN1 fails
here(!)...  So i do not have any chance to do

  echo hä | LC_ALL=C s-nail ...

it will always fail, unless given an explicit -S
ttycharset=utf8/latin1/xy whatever it really is.
The problem with SuSE is that their port always added the luxury
of turning 8-bit on the input side into UTF-8 or LATIN1 otherwise,
if i understand the patch correctly.  People may have relied on
that, scripts may break, systems may start to misbehave ...

To be honest, i do not know at the moment.  It could be a good
thing to have UTF-8 detection, but when is that sufficient (as
above).  On the other hand, automatically falling back to LATIN1
if it is not UTF-8 cannot be it, for S-nail.

Maybe automatic UTF-8 detection with a trigger variable that
enables it, so that Werner can add it to the global SuSE mail.rc,
but off by default.  Falling back to nothing, ending with failure
if it is not UTF-8.  Something like *mime-utf8-autodetect* or so.

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

Reply via email to