Re: [Python-3000] email libraries: use byte or unicode strings?

Glenn Linderman Thu, 06 Nov 2008 11:36:31 -0800

On approximately 11/6/2008 3:59 AM, came the following characters fromthe keyboard of Stephen J. Turnbull:

Glenn Linderman writes:
> There is no reference to the word emacs or types in any of the messages> you've posted in this thread, maybe you are referring to another thread> somewhere? Sorry, I'm new to this party, but I have read the whole> thread... unless my mail reader has missed part of it.
I'm sorry, you are right; the relevant message was never sent.  Here
it is; I've looked it over briefly and it seems intelligible, but from
your point of view it may seem out of context now.

Stuff happens. Apology accepted. The goal here isn't to make points orplay one-up, the goal is to figure out if making a more complexinterface (having both bytes and Unicode interfaces) is beneficial tolife. I'm certain that I don't see all the issues yet; but if theissues can be stated clearly, and the alternative solutions outlined,then I would get educated, which is good for me, but perhaps annoyingfor you. Progress gets made faster if we stay out of the flame-fanning.

I've read the other responses received to date, but choose to compose myresponse to this message, as it is the most meaty. The others discussonly particular (interesting) details.

Summary of issues is at the end. Skip directly to the summary beforereading the interspersed comments, if you wish. Search for "summarize".

Comment on general data handling. It is good to follow the rules, ofcourse, but not everyone does. When they don't, it is not clear aprogram can cure the problem by itself.

1) If the data is already corrupted by using the wrong encoding,potentially it could be reversed if the proper encoding could be intuited.

1a) If it is returned as bytes, then once the proper encoding isintuited, the data can be decoded properly into Unicode.

1b) If it is returned as Latin-1 decoded Unicode, then once the properencoding is intuited, the Unicode data can be reencoded as bytes usingLatin-1 (this is a fully reversible, no data loss reencoding), and thendecoded properly into Unicode.

The hard part here is intuiting the proper encoding; 1b is lessefficient than 1a, but no less possible. Intuiting the proper encodingis most likely done by human choice (iterating over: try this encoding,does it look better?)

2) If the data is already corrupted by using multiple encodings whenonly one is claimed, then again it could be reversed if the properencodings, as well as the boundaries between them, could be intuited.

The same parts a) and b) apply as in #1, but extremely complexified bythe boundary selections. Again it seems that human choice is required.Select a range of text, and try displaying it in a different encodingto see if it makes more sense.

For both 1 & 2, the user interaction is much more time consuming thanthe 3-stage decoding, encoding, and redecoding process, I would expect.


More below.

Glenn Linderman writes:

 > This is where you use the Latin-1 conversion.  Don't throw an error
 > when in doesn't conform, but don't go to heroic efforts to provide
 > bytes alternatives... just convert the bytes to Unicode, and the
 > way the mail RFCs are written, and the types of encodings used, it
 > is mostly readable.  And if it isn't encoded, it is even more
 > readable.

This is what XEmacs/Mule does.  It's a PITA for everybody (except the
Mule implementers, whose life is dramatically simplified by punting
this way).  For one thing, what's readable to a human being may be
death to a subprogram that expects valid MIME.  GNU Emacs is even
worse; it does provide both a bytes-like type and a unicode-like type,
but then it turns around and provides a way to "cast" unicodes to
bytes and vice-versa, thus exposing implementation in an unclean (and
often buggy) way.

 > And so how much is it a problem?  What are the effects of the problem?

In Emacs, the problem is that strings that are punted get concatenated
with strings that are properly decoded, and when reencoding is

attempted, you get garbage or a coding error.

Uh-huh. Garbage (wrongly decoded, then re-encoded), I would expect.Coding errors, I would not, since Latin-1 codepoints are certainlyreencodable to Unicode (creating legal looking garbage OUt of originallyillegal garbage). Can you give me an example of a coding error, or isthis just FUD?

Since Mule discarded
the type (punt vs. decode) information, the app loses.

This is precisely the problem that was faced for "fake unicode filehandling" that was the topic of a thread a few weeks ago. While theLatin-1 transform (or UTF-8b, or others mentioned there), can provide around-trip decode/encode, it is only useful and usable if the knowledgethat the transform was performed is retained. The choice there was tohave a binary interface, and build a Unicode interface on top of it thatcan't see the binaries that do not conform to UTF-8. The problemthere is that existing programs expect to be able to manipulate filenames as text, but existing operating systems provide bytes interfaces.

There's no way to recover.

Not automatically. Point 2) above addresses this. It would requirehuman intelligence to attempt to recover, and even the human would findit extremely painstaking to assist in the recovery process.

The apps most at risk are things like MUAs (which Emacs
does well) and web browsers (which it doesn't), and even AUCTeX (a
mode for handling LaTeX documents---TeX is not Unicode-aware so its
error messages are frequently truncated in the middle of a UTF-8
character) and they go to great lengths to keep track of what is valid
and what is not in the app.  They don't always succeed.  I think Emacs
should be doing this for them, somehow (and I'm an XEmacs implementer,
not an MUA implementer!)

So your belief that Emacs should be doing this for them somehow is nice,perhaps it should. However, it doesn't sound like you have a solutionfor emacs... How should it keep track? How is it helpful? If TeX isnot Unicode aware, what is it doing dealing with UTF-8 data? Or it isdealing with Latin-1 transformed UTF-8 garbage?

The situation in Python will be strongly analogous, I believe.

And so are you proposing that a binary interface to the data, ratherthan a Unicode interface to the Latin-1 transformed data, will be moreusable by the Python solution that might be able to be similar to theEmacs solution, that hasn't been figured out yet?

Once the boundaries and encoding has been lost by the original buggy MUAthat has injected the data into the email message, only humanintelligence has a chance of recreating the original message in allcases, and even then it may take more than one human to achieve it.

There may be cases where heuristics can be applied, when humanintelligence figures out the type of bugs in the original MUA, and canrecognize patterns that allow it to rediscover the boundaries. This isunlikely to work in all cases, but could perhaps work in some cases.

Even in the cases where it can work with some measurable success, Iclaim that the heuristics could be coded based on the Latin-1transformed Unicode equally effectively as based on the bytes.

 > I'm not suggesting making it worse than what it already is, in
 > bytes form; just to translate the bytes to Unicode codepoints so
 > that they can be returned on a Unicode interface.

Which *does* make it worse, unless you enforce a type difference so
that punted strings can't be mixed with decoded strings without
effort.  That type difference may as well be bytes vs. Unicode as some
subclass of Unicode vs. Unicode.

138 is still 138 whether it is a byte or a Unicode codepoint. Yes,concatenating stuff that is transformed with stuff that is properlydecoded would be stupid. Enforcing a type difference is purely anapplication thing, though. Each piece of data retrieved would have aconsistent decoding provided... either the proper decoding as specifiedin the message, or the Latin-1 or current code page decode if noencoding is specified. Either is reversible if the application doesn'tlike the results, and wants to try a different encoding. The APIs couldhave optional parameters and results that specify the encoding to use,or the encoding that was used, to decode the results. If the app wishesto keep that separate, and convert it to a different type to help itstay separate that is the app's privilege. If the app wishes toconcatenate with other data, that is the app's choice (and having theinterface define a bunch of different types for different decodingswouldn't really help the ignorant app, which would simply convert thedifferent types back to strings and then concatenate, or the smart app,which could do its own type encapsulations if it thinks that would help).

"Why would you mix strings?"  Well, for one example there are multiple
address headers which get collected into an addressee list for purpose
of constructing a reply.  If one of the headers is broken and another

is not, you get mixed mode.

Sure. Now you have mixed mode. Try to send the reply message... if theemail address part is OK, then it gets sent, with a gibberish name. Ifthe email address part is not OK, that destination bounces.

Now what? Seriously, what else could be done? You could try a bunch ofdifferent encodings to attempt to resolve the broken email address orname... requires human intelligence to decide which is correct... whenthe bounce message comes, the human will get involved. If the bouncemessage doesn't come, then all is well (problem only affected the namepart, not the email address part).

The same thing can happen for
multilingual message bodies: they get split into a multipart with
different charsets for different parts, and if one is broken but
another is not, you get mixed mode.

First, if the multilingual message bodies are know to be multilingualwhen they are encoded, and are in different multiparts, what are thechances that an application that knows to correctly keep themultilingual parts separate is dumb enough to encode one correctly andone incorrectly? Is this a real scenario? What software/version does this?

If it is real scenario, it still requires human intelligence toresolve... to choose different encodings, and decide which one "looksright". Since it is in separate parts, the boundaries are not lost, sothis is case 1 above.

If the boundaries are lost, the human can direct the program to go backto the original message, which still has its boundaries, and start overfrom there, with different encodings. If the app wants to be smartenough to provide such features. You might write such an app just forfun; I might or might not, depending on if someone pays me, or I haveother incentive.

Given boundaries, it is case 1) above. If the boundaries are lost, itis case 2). How is it easier if the bytes are preserved, vs translatedvia Latin-1 to a Unicode string?

> So they'll use the Unicode API for text, and the bytes APIs for binary> attachments, because that is what is natural.
Well, as I see it there won't be bytes APIs for text.  The APIs will
return Unicode text if they succeed, and raise an error if not.  If
the error is caught, the offending object will be available as bytes.

Sure; I'd proposed a way to get a whole messages as bytes for archiving,logging, message store, etc. I'd proposed a way to get a particularMIME part as bytes for binary parts.

You seem to be proposing a way to get text MIME parts as binary if theyfail to decode. I have no particular problem with the API providingthat ability.

I have a specific question here: what encodings, when the attempt ismade to decode to Unicode, will ever fail?

For 8-bit encodings, the answer is none. You may get gibberish, but nota failure, because every 8-bit encoding has every byte value used, andUnicode contains all those characters.

So you've mentioned Asian encodings, and certainly these could fail toconvert to Unicode if the decoder finds inappropriate sequences. Idon't know enough about all the multi-byte encodings to know if all ofthem can fail, or if applying a particular decoding might producegibberish, but not fail. The ones I know about use a particular rangeof characters to represent "first byte" of a pair, but what I don't knowis whether any byte can follow the first byte, or if only certain bytescan follow the first byte. I do that for some multi-byte encodings, thefirst byte can be followed by second bytes in the ASCII range; I don'tknow if it is illegal to be followed by another byte in the "first byte"range. Certainly there could be 2-byte pairs that don't have anassociated character, although I don't know that that exists for anyparticular encoding.

Can you cite a particular multi-byte encoding that has byte sequencesthat are illegal, and can be used to detect failure? Or can failureonly be detected by the human determining that it is gibberish?

> If improperly encoded messages are received, and appropriate> transliterations are made so that the bytes get converted (default code> page) or passed through (Latin-1 transformation), then the data may be> somewhat garbled for characters in the non-ASCII subset. But that is> not different than the handling done by any 8-bit email client, nor, I> suspect (a little uncertainty here) different than the handling done by> Python < 3.0 mail libraries.
Which is exactly how we got to this point.  Experience with GNU
Mailman and other such applications indicate that the implementation
in the existing Python email module needs work, and Barry Warsaw and
others who have tried to work on it say that it's not that easy, and
that the API may need to change to accomodate needed changes in the
implementation.

So let me try to summarize. I could have reached some inappropriateissues or conclusions. I'm willing to be corrected. But I'd muchprefer to be corrected by specific cases that can be detected andcorrected via a bytes interface that cannot be detected and corrected byusing a bytes-translitered-to-Unicode interface, complete with specificencodings that are used, properly or improperly, to arrive at the case,and specific APIs that must be changed to achieve the goal.


A) An attempt to decode text to Unicode may fail.
A1) doesn't apply to 8-bit encodings.
A2) doesn't apply to some multi-byte encodings
A3) applies to UTF-8
A4) may apply to some other multi-byte encodings

B) User sees gibberish because of decoding problems. What can be done?Can the app provide features to help? Do any of the features depend onAPI features? Let's assume that the app wants to help, and providesfeatures. User must also get involved, because the app/API can't tellthe difference between gibberish and valid text.

B1) User can see a map of the components of the email, and theirencodings, and whether they were provided by the email message, or werethe default for the app. User chooses a different decoding for acomponent, and the app reprocesses that component. API requirement: away for the user/app to specify an override to the decoding for a component.

B2) User chooses binary for a particular component. App reprocesses thecomponent, and asks what file to store the binary in. API requirement:a way for the user/app to specify an override to the decoding for acomponent.

I've now looked briefly at the email module APIs. They seem quiteflexible to me. I don't know what happens under the covers. It seemsthat the API is already set up flexibly enough to handle both bytes andUnicode!!! Perhaps it is just the implementation that should beadjusted. (I'm not saying that might not be too big a job for 3.0, Ihaven't read the code.)

It seems that get_/set_payload might want to be able to return/accepteither string or bytes, depending on the other parameters involved.


Let's talk again about creation of messages first.

If a string is supplied, it is Unicode. The encoding parameterdescribes what encoding should be applied to convert the message towire-protocol bytes. The data should be saved as Unicode until therequest is made to convert it to wire protocol, so that set_charset canbe called a few dozen times if desired (not clear why that would bedone, though) to change the encoding. Perhaps it is appropriate toverify that the encoding can happen without using the substitutioncharacter, or perhaps that should be the user's responsibility. Thischoice should be documented.

If bytes are supplied, an encoding must also be supplied. The datashould be saved in this encoding until the request is made to convert itto wire-protocol. This encoding should be used if possible, otherwiseconverted to an encoding that is acceptable to the wire protocol.Perhaps it is appropriate to verify that the translation, if necessary,can happen without using the substitution character, or perhaps thatshould be the user's responsibility. This choice should be documented.It seems that charset None implies ASCII, for historical reasons;perhaps that can be overloaded to alternately mean binary, as thehandling would be roughly the same, but perhaps a new 'binary' charsetshould be created to make it clear that charset changes don't makesense, and to reject attempts to convert binary data to character data.

For an incoming message, the wire-protocol format should be used as theprimary data store. Cached pointers and lengths to various MIME partsand subparts (individual headers, body, preamble, epilogue) would beappropriate. get_ operations would find the data, and interpret itaccording to the current (defaults to message content, overridden byset_ operations) charset and encoding. Requesting a Unicode charsetwould imply decoding the part to Unicode from the current charset andwould return a string; requesting other character sets would implyconverting from the message charset to the specified charset andreturning bytes; requesting binary (or possibly 'None', see above) wouldreturn the wire-protocol bytes unchanged. Then the application could dowhat it wants to attempt to decode that data to text using otherencodings (i.e. not starting the conversion from the encoding declaredexplicitly or implicitly in the message part).

The as_string() method becomes a misnomer in Python 3.0; since it isPython 3.0, that can be changed, no? It should become as_wire_protocol,and would default to returning bytes of binary data, which is what thewire-protocol APIs need. A variety that returns the bytes as Unicodecodepoints could be implemented, for the purpose of "View source" typeoperations on the wire-protocol form... but that would and should onlybe a direct Latin-1 transliteration to Unicode.

Now that I've looked at the API, I don't see why it should be changedsignificantly for Python 3.0. I have no clue how much of the guts wouldhave to be changed to achieve the equivalent of what I described above.I do believe that what I outlined above would use the present API toachieve both the "I want Unicode only" philosophy that you ascribe tome, and the "I want to do bit-flipping" (whatever that means) philosophythat you claimed for yourself.

Setting headers via the msg['Subject'] syntax to Unicode values is noproblem. Just make sure that they get converted to ASCII encodedproperly at the end. msg['Subject'] and msg[b'Subject'] could be madeequivalent, but I'd never use the latter, it has an annoying b characterto distract from the meaning. The syntax should permit the use ofUnicode, in other words, but:

* to encode non-ASCII data with full control over what parts get encodedand how, the Header API is still appropriate* as an alternative, the API could be extended to include a defaultheader encoding* Strings supplied via the msg['Subject'] = 'some string' interface, arehandled as follows: if 'some string' is in the ASCII subset, no problem.If not, and if the default header encoding has not been set, then anexception is raised. Otherwise, the default header encoding is used toencode the Unicode string as necessary.

So I see the API as quite robust, although its current implementationmay not be as described above, and I can't scope the effort to achievethe above.

I'd like to see a "headers_as_wire_protocol" API added for generatingbounce messages. It is easy enough to extract from as_wire_protocol,but common enough to be useful, methinks, and avoids allocating spacefor a huge message just to get its headers.



What specific problems are perceived, that the present API can't handle?

Are there areas in which it behaves differently than I outline above?

If so, is my outline an improvement, or confusing, and why?

Are there other issues?

Barry said:
> Yes, Python 2.x's email package handles broken messages, and email-ng
> must too.  "Handling it" means:
>
> 1) never throw an exception
> 2) record defects in a usable way for upstream consumers of the message
> to handle
>
> it currently also means
>
> 3) ignore idempotency for defective messages.

I'm not sure what "ignore idempotency" means in this context...

If the above outline is perceived as a useful set of semantics for the3.0 email library, I might be able to find a little time (don't tell mywife) to help work on them, assuming that they are mostly implemented inPython and/or C. But I'd need a bit of hand-holding to get started,since I haven't yet figured out how to compile my own Python.



--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
_______________________________________________
Python-3000 mailing list
Python-3000@python.org
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Re: [Python-3000] email libraries: use byte or unicode strings?

Reply via email to