Hello Ralph. Ralph Corderoy wrote in <20181218113855.e29b61f...@orac.inputplus.co.uk>: |>> The usage of the Content-Length: tag would be very perfect |> |> mutt(1) updates these when it writes messages, too. We actively strip |> them (except when *keep-content-length* is set, at least until v15, as |> documented), since we are not able to keep them up-to-date. If others |> do too, they are not reliable. | |I too have found the `Content-Length' header a pain over the years. |Something carried in from the less rigorous world of Usenet IIRC. mbox |and email manipulators don't have to know about it and that means it |easily gets out of step with `From␣' headers, giving two different |versions of `truth'.
Yes. And what can you do with the info? What would make sense is an external index file which also has offset and such, header info, whatever. For the header summary well, blablabla :) |> I mean, if the data would start with a UTF-8 encoded Unicode BOM, | |It would be nice to not see more of these enter existence. :-) Well i do not know. I do hate the BOM and said so on the Unicode list in the past :). However, as times goes by, i found it not to be so bad, it has its purpose for markers, here and there. And why not at the beginning of a text file. UTF-16 has its merits too, i would have massively opposed to this in the past... I do not use BOM for my stuff nonetheless, except i did offer it for the binary datastreams more than fifteen years ago, with/for automatic endianess adjustment (read_bom(), write_bom() etc.). |> We could possibly extend our MIME classifier, and when we have seen |> multiple UTF-8 sequences after reading all the body, and if and only |> if the current locale is "C" a.k.a. and if *ttycharset* is |> n_iconv_name_is_ascii(), _then_ we could, instead of using the normal |> *charset-8bit*, go for UTF-8. | |If the locale is `C', and it's valid ASCII then why not plump for 7-bit? |And if instead it's valid UTF-8 then it seems OK to declare it such as I |can't think of intended Latin1, say, that happens to be valid UTF-8, |e.g. not many want to write `£' and probably intended `£'. That is a misunderstanding. Our name_is_ascii() iterates over the given name and checks whether it actually names an ASCII encoding name, of which there are many (cp367, ANSI_X3.4-1968, US-ASCII, to name a few). Different systems use different names to name ASCII, the standard just does not offer a possibility to say, "yes, this is ASCII". IANA has a character set registry, and many names have aliases, official name, some other standard, "preferred MIME name" and so on. On the Unix side this is usually solved with charset.alias, but that does not help me as an application writer, who wants to know "what is the actual _real_ name of this character set", so that after an iconv_open("foo") i can query the actual name by iconv_character_set_name(). And again, i do not know. How do i know it is UTF-8? Maybe it is, maybe not. Can i know from seeing four bytes that look like UTF-8? I may know from parsing an entire file and seeing only 7-bit bytes, and 8-bit ones that form valid UTF-8 sequences, increasingly so as the number of such sequences rise. If i see such in the plain "C" locale, with a user chosen *ttycharset* that indeed is ASCII, then i _could_ deduce that the input actually really is UTF-8. That was what i was saying. Then again, i am in doubts whether i should. C.UTF-8 a.k.a. POSIX.UTF-8 come to the mainstream, musl C lib only has that, OpenBSD effectively too i think, others will follow. If a user explicitly says "LC_ALL=C mail ...", then how am i allowed to do something like this? If the user says "LC_ALL=C mail" than he can also append a "-Sttycharset=BLA". Of course it is a problem, iconv(3)ing from ASCII to UTF-8 fails if there is some 8-bit data in the input, even LATIN1 fails here(!)... So i do not have any chance to do echo hä | LC_ALL=C s-nail ... it will always fail, unless given an explicit -S ttycharset=utf8/latin1/xy whatever it really is. The problem with SuSE is that their port always added the luxury of turning 8-bit on the input side into UTF-8 or LATIN1 otherwise, if i understand the patch correctly. People may have relied on that, scripts may break, systems may start to misbehave ... To be honest, i do not know at the moment. It could be a good thing to have UTF-8 detection, but when is that sufficient (as above). On the other hand, automatically falling back to LATIN1 if it is not UTF-8 cannot be it, for S-nail. Maybe automatic UTF-8 detection with a trigger variable that enables it, so that Werner can add it to the global SuSE mail.rc, but off by default. Falling back to nothing, ending with failure if it is not UTF-8. Something like *mime-utf8-autodetect* or so. --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)