On approximately 11/6/2008 3:59 AM, came the following characters from the keyboard of Stephen J. Turnbull:
Glenn Linderman writes:

> There is no reference to the word emacs or types in any of the messages > you've posted in this thread, maybe you are referring to another thread > somewhere? Sorry, I'm new to this party, but I have read the whole > thread... unless my mail reader has missed part of it.

I'm sorry, you are right; the relevant message was never sent.  Here
it is; I've looked it over briefly and it seems intelligible, but from
your point of view it may seem out of context now.


Stuff happens. Apology accepted. The goal here isn't to make points or play one-up, the goal is to figure out if making a more complex interface (having both bytes and Unicode interfaces) is beneficial to life. I'm certain that I don't see all the issues yet; but if the issues can be stated clearly, and the alternative solutions outlined, then I would get educated, which is good for me, but perhaps annoying for you. Progress gets made faster if we stay out of the flame-fanning.

I've read the other responses received to date, but choose to compose my response to this message, as it is the most meaty. The others discuss only particular (interesting) details.

Summary of issues is at the end. Skip directly to the summary before reading the interspersed comments, if you wish. Search for "summarize".


Comment on general data handling. It is good to follow the rules, of course, but not everyone does. When they don't, it is not clear a program can cure the problem by itself.

1) If the data is already corrupted by using the wrong encoding, potentially it could be reversed if the proper encoding could be intuited.

1a) If it is returned as bytes, then once the proper encoding is intuited, the data can be decoded properly into Unicode.

1b) If it is returned as Latin-1 decoded Unicode, then once the proper encoding is intuited, the Unicode data can be reencoded as bytes using Latin-1 (this is a fully reversible, no data loss reencoding), and then decoded properly into Unicode.

The hard part here is intuiting the proper encoding; 1b is less efficient than 1a, but no less possible. Intuiting the proper encoding is most likely done by human choice (iterating over: try this encoding, does it look better?)

2) If the data is already corrupted by using multiple encodings when only one is claimed, then again it could be reversed if the proper encodings, as well as the boundaries between them, could be intuited.

The same parts a) and b) apply as in #1, but extremely complexified by the boundary selections. Again it seems that human choice is required. Select a range of text, and try displaying it in a different encoding to see if it makes more sense.

For both 1 & 2, the user interaction is much more time consuming than the 3-stage decoding, encoding, and redecoding process, I would expect.

More below.


Glenn Linderman writes:

 > This is where you use the Latin-1 conversion.  Don't throw an error
 > when in doesn't conform, but don't go to heroic efforts to provide
 > bytes alternatives... just convert the bytes to Unicode, and the
 > way the mail RFCs are written, and the types of encodings used, it
 > is mostly readable.  And if it isn't encoded, it is even more
 > readable.

This is what XEmacs/Mule does.  It's a PITA for everybody (except the
Mule implementers, whose life is dramatically simplified by punting
this way).  For one thing, what's readable to a human being may be
death to a subprogram that expects valid MIME.  GNU Emacs is even
worse; it does provide both a bytes-like type and a unicode-like type,
but then it turns around and provides a way to "cast" unicodes to
bytes and vice-versa, thus exposing implementation in an unclean (and
often buggy) way.

 > And so how much is it a problem?  What are the effects of the problem?

In Emacs, the problem is that strings that are punted get concatenated
with strings that are properly decoded, and when reencoding is
attempted, you get garbage or a coding error.


Uh-huh. Garbage (wrongly decoded, then re-encoded), I would expect. Coding errors, I would not, since Latin-1 codepoints are certainly reencodable to Unicode (creating legal looking garbage OUt of originally illegal garbage). Can you give me an example of a coding error, or is this just FUD?


Since Mule discarded
the type (punt vs. decode) information, the app loses.


This is precisely the problem that was faced for "fake unicode file handling" that was the topic of a thread a few weeks ago. While the Latin-1 transform (or UTF-8b, or others mentioned there), can provide a round-trip decode/encode, it is only useful and usable if the knowledge that the transform was performed is retained. The choice there was to have a binary interface, and build a Unicode interface on top of it that can't see the binaries that do not conform to UTF-8. The problem there is that existing programs expect to be able to manipulate file names as text, but existing operating systems provide bytes interfaces.


There's no way to recover.


Not automatically. Point 2) above addresses this. It would require human intelligence to attempt to recover, and even the human would find it extremely painstaking to assist in the recovery process.


The apps most at risk are things like MUAs (which Emacs
does well) and web browsers (which it doesn't), and even AUCTeX (a
mode for handling LaTeX documents---TeX is not Unicode-aware so its
error messages are frequently truncated in the middle of a UTF-8
character) and they go to great lengths to keep track of what is valid
and what is not in the app.  They don't always succeed.  I think Emacs
should be doing this for them, somehow (and I'm an XEmacs implementer,
not an MUA implementer!)


So your belief that Emacs should be doing this for them somehow is nice, perhaps it should. However, it doesn't sound like you have a solution for emacs... How should it keep track? How is it helpful? If TeX is not Unicode aware, what is it doing dealing with UTF-8 data? Or it is dealing with Latin-1 transformed UTF-8 garbage?


The situation in Python will be strongly analogous, I believe.


And so are you proposing that a binary interface to the data, rather than a Unicode interface to the Latin-1 transformed data, will be more usable by the Python solution that might be able to be similar to the Emacs solution, that hasn't been figured out yet?

Once the boundaries and encoding has been lost by the original buggy MUA that has injected the data into the email message, only human intelligence has a chance of recreating the original message in all cases, and even then it may take more than one human to achieve it.

There may be cases where heuristics can be applied, when human intelligence figures out the type of bugs in the original MUA, and can recognize patterns that allow it to rediscover the boundaries. This is unlikely to work in all cases, but could perhaps work in some cases.

Even in the cases where it can work with some measurable success, I claim that the heuristics could be coded based on the Latin-1 transformed Unicode equally effectively as based on the bytes.


 > I'm not suggesting making it worse than what it already is, in
 > bytes form; just to translate the bytes to Unicode codepoints so
 > that they can be returned on a Unicode interface.

Which *does* make it worse, unless you enforce a type difference so
that punted strings can't be mixed with decoded strings without
effort.  That type difference may as well be bytes vs. Unicode as some
subclass of Unicode vs. Unicode.


138 is still 138 whether it is a byte or a Unicode codepoint. Yes, concatenating stuff that is transformed with stuff that is properly decoded would be stupid. Enforcing a type difference is purely an application thing, though. Each piece of data retrieved would have a consistent decoding provided... either the proper decoding as specified in the message, or the Latin-1 or current code page decode if no encoding is specified. Either is reversible if the application doesn't like the results, and wants to try a different encoding. The APIs could have optional parameters and results that specify the encoding to use, or the encoding that was used, to decode the results. If the app wishes to keep that separate, and convert it to a different type to help it stay separate that is the app's privilege. If the app wishes to concatenate with other data, that is the app's choice (and having the interface define a bunch of different types for different decodings wouldn't really help the ignorant app, which would simply convert the different types back to strings and then concatenate, or the smart app, which could do its own type encapsulations if it thinks that would help).


"Why would you mix strings?"  Well, for one example there are multiple
address headers which get collected into an addressee list for purpose
of constructing a reply.  If one of the headers is broken and another
is not, you get mixed mode.


Sure. Now you have mixed mode. Try to send the reply message... if the email address part is OK, then it gets sent, with a gibberish name. If the email address part is not OK, that destination bounces.

Now what? Seriously, what else could be done? You could try a bunch of different encodings to attempt to resolve the broken email address or name... requires human intelligence to decide which is correct... when the bounce message comes, the human will get involved. If the bounce message doesn't come, then all is well (problem only affected the name part, not the email address part).


The same thing can happen for
multilingual message bodies: they get split into a multipart with
different charsets for different parts, and if one is broken but
another is not, you get mixed mode.


First, if the multilingual message bodies are know to be multilingual when they are encoded, and are in different multiparts, what are the chances that an application that knows to correctly keep the multilingual parts separate is dumb enough to encode one correctly and one incorrectly? Is this a real scenario? What software/version does this?

If it is real scenario, it still requires human intelligence to resolve... to choose different encodings, and decide which one "looks right". Since it is in separate parts, the boundaries are not lost, so this is case 1 above.

If the boundaries are lost, the human can direct the program to go back to the original message, which still has its boundaries, and start over from there, with different encodings. If the app wants to be smart enough to provide such features. You might write such an app just for fun; I might or might not, depending on if someone pays me, or I have other incentive.

Given boundaries, it is case 1) above. If the boundaries are lost, it is case 2). How is it easier if the bytes are preserved, vs translated via Latin-1 to a Unicode string?



> So they'll use the Unicode API for text, and the bytes APIs for binary > attachments, because that is what is natural.

Well, as I see it there won't be bytes APIs for text.  The APIs will
return Unicode text if they succeed, and raise an error if not.  If
the error is caught, the offending object will be available as bytes.


Sure; I'd proposed a way to get a whole messages as bytes for archiving, logging, message store, etc. I'd proposed a way to get a particular MIME part as bytes for binary parts.

You seem to be proposing a way to get text MIME parts as binary if they fail to decode. I have no particular problem with the API providing that ability.

I have a specific question here: what encodings, when the attempt is made to decode to Unicode, will ever fail?

For 8-bit encodings, the answer is none. You may get gibberish, but not a failure, because every 8-bit encoding has every byte value used, and Unicode contains all those characters.

So you've mentioned Asian encodings, and certainly these could fail to convert to Unicode if the decoder finds inappropriate sequences. I don't know enough about all the multi-byte encodings to know if all of them can fail, or if applying a particular decoding might produce gibberish, but not fail. The ones I know about use a particular range of characters to represent "first byte" of a pair, but what I don't know is whether any byte can follow the first byte, or if only certain bytes can follow the first byte. I do that for some multi-byte encodings, the first byte can be followed by second bytes in the ASCII range; I don't know if it is illegal to be followed by another byte in the "first byte" range. Certainly there could be 2-byte pairs that don't have an associated character, although I don't know that that exists for any particular encoding.

Can you cite a particular multi-byte encoding that has byte sequences that are illegal, and can be used to detect failure? Or can failure only be detected by the human determining that it is gibberish?


> If improperly encoded messages are received, and appropriate > transliterations are made so that the bytes get converted (default code > page) or passed through (Latin-1 transformation), then the data may be > somewhat garbled for characters in the non-ASCII subset. But that is > not different than the handling done by any 8-bit email client, nor, I > suspect (a little uncertainty here) different than the handling done by > Python < 3.0 mail libraries.

Which is exactly how we got to this point.  Experience with GNU
Mailman and other such applications indicate that the implementation
in the existing Python email module needs work, and Barry Warsaw and
others who have tried to work on it say that it's not that easy, and
that the API may need to change to accomodate needed changes in the
implementation.


So let me try to summarize. I could have reached some inappropriate issues or conclusions. I'm willing to be corrected. But I'd much prefer to be corrected by specific cases that can be detected and corrected via a bytes interface that cannot be detected and corrected by using a bytes-translitered-to-Unicode interface, complete with specific encodings that are used, properly or improperly, to arrive at the case, and specific APIs that must be changed to achieve the goal.

A) An attempt to decode text to Unicode may fail.
A1) doesn't apply to 8-bit encodings.
A2) doesn't apply to some multi-byte encodings
A3) applies to UTF-8
A4) may apply to some other multi-byte encodings

B) User sees gibberish because of decoding problems. What can be done? Can the app provide features to help? Do any of the features depend on API features? Let's assume that the app wants to help, and provides features. User must also get involved, because the app/API can't tell the difference between gibberish and valid text.

B1) User can see a map of the components of the email, and their encodings, and whether they were provided by the email message, or were the default for the app. User chooses a different decoding for a component, and the app reprocesses that component. API requirement: a way for the user/app to specify an override to the decoding for a component.

B2) User chooses binary for a particular component. App reprocesses the component, and asks what file to store the binary in. API requirement: a way for the user/app to specify an override to the decoding for a component.


I've now looked briefly at the email module APIs. They seem quite flexible to me. I don't know what happens under the covers. It seems that the API is already set up flexibly enough to handle both bytes and Unicode!!! Perhaps it is just the implementation that should be adjusted. (I'm not saying that might not be too big a job for 3.0, I haven't read the code.)

It seems that get_/set_payload might want to be able to return/accept either string or bytes, depending on the other parameters involved.

Let's talk again about creation of messages first.

If a string is supplied, it is Unicode. The encoding parameter describes what encoding should be applied to convert the message to wire-protocol bytes. The data should be saved as Unicode until the request is made to convert it to wire protocol, so that set_charset can be called a few dozen times if desired (not clear why that would be done, though) to change the encoding. Perhaps it is appropriate to verify that the encoding can happen without using the substitution character, or perhaps that should be the user's responsibility. This choice should be documented.

If bytes are supplied, an encoding must also be supplied. The data should be saved in this encoding until the request is made to convert it to wire-protocol. This encoding should be used if possible, otherwise converted to an encoding that is acceptable to the wire protocol. Perhaps it is appropriate to verify that the translation, if necessary, can happen without using the substitution character, or perhaps that should be the user's responsibility. This choice should be documented. It seems that charset None implies ASCII, for historical reasons; perhaps that can be overloaded to alternately mean binary, as the handling would be roughly the same, but perhaps a new 'binary' charset should be created to make it clear that charset changes don't make sense, and to reject attempts to convert binary data to character data.

For an incoming message, the wire-protocol format should be used as the primary data store. Cached pointers and lengths to various MIME parts and subparts (individual headers, body, preamble, epilogue) would be appropriate. get_ operations would find the data, and interpret it according to the current (defaults to message content, overridden by set_ operations) charset and encoding. Requesting a Unicode charset would imply decoding the part to Unicode from the current charset and would return a string; requesting other character sets would imply converting from the message charset to the specified charset and returning bytes; requesting binary (or possibly 'None', see above) would return the wire-protocol bytes unchanged. Then the application could do what it wants to attempt to decode that data to text using other encodings (i.e. not starting the conversion from the encoding declared explicitly or implicitly in the message part).

The as_string() method becomes a misnomer in Python 3.0; since it is Python 3.0, that can be changed, no? It should become as_wire_protocol, and would default to returning bytes of binary data, which is what the wire-protocol APIs need. A variety that returns the bytes as Unicode codepoints could be implemented, for the purpose of "View source" type operations on the wire-protocol form... but that would and should only be a direct Latin-1 transliteration to Unicode.

Now that I've looked at the API, I don't see why it should be changed significantly for Python 3.0. I have no clue how much of the guts would have to be changed to achieve the equivalent of what I described above. I do believe that what I outlined above would use the present API to achieve both the "I want Unicode only" philosophy that you ascribe to me, and the "I want to do bit-flipping" (whatever that means) philosophy that you claimed for yourself.

Setting headers via the msg['Subject'] syntax to Unicode values is no problem. Just make sure that they get converted to ASCII encoded properly at the end. msg['Subject'] and msg[b'Subject'] could be made equivalent, but I'd never use the latter, it has an annoying b character to distract from the meaning. The syntax should permit the use of Unicode, in other words, but:

* to encode non-ASCII data with full control over what parts get encoded and how, the Header API is still appropriate * as an alternative, the API could be extended to include a default header encoding * Strings supplied via the msg['Subject'] = 'some string' interface, are handled as follows: if 'some string' is in the ASCII subset, no problem. If not, and if the default header encoding has not been set, then an exception is raised. Otherwise, the default header encoding is used to encode the Unicode string as necessary.


So I see the API as quite robust, although its current implementation may not be as described above, and I can't scope the effort to achieve the above.

I'd like to see a "headers_as_wire_protocol" API added for generating bounce messages. It is easy enough to extract from as_wire_protocol, but common enough to be useful, methinks, and avoids allocating space for a huge message just to get its headers.


What specific problems are perceived, that the present API can't handle?

Are there areas in which it behaves differently than I outline above?

If so, is my outline an improvement, or confusing, and why?

Are there other issues?

Barry said:
> Yes, Python 2.x's email package handles broken messages, and email-ng
> must too.  "Handling it" means:
>
> 1) never throw an exception
> 2) record defects in a usable way for upstream consumers of the message
> to handle
>
> it currently also means
>
> 3) ignore idempotency for defective messages.

I'm not sure what "ignore idempotency" means in this context...


If the above outline is perceived as a useful set of semantics for the 3.0 email library, I might be able to find a little time (don't tell my wife) to help work on them, assuming that they are mostly implemented in Python and/or C. But I'd need a bit of hand-holding to get started, since I haven't yet figured out how to compile my own Python.


--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
_______________________________________________
Python-3000 mailing list
Python-3000@python.org
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Reply via email to