On 3/1/2011 12:40 PM, R. David Murray wrote:
This is a long email, for which my apologies.  I hope you all will
manage to find some time to read it and provide feedback, as it speaks
to fundamental design issues.

Indeed.  Good to discuss before designing with ready-mix.

Everything else is an implementation detail :)

Agreed.

We propose to create a new API to make all of this easier for
the application programmer.

YES!!

[*] There are current real-world use cases for this:  there are nntp
     servers that use utf-8 for headers, and the http protocol uses
     latin-1 (or sometimes, I think, utf-8)

All the tunables listed are relevant. The HTTP protocol standard claims to use Latin-1 + RFC 2047 encoding for non-Latin-1 characters; in practice, the browser implementations apparently use nearly _any_ encoding for headers!!! For <form> responses, when there is actually user-specified data involved, they use the encoding defined for the page containing the form, as the encoding of the MIME headers sent back. The "standard headers" seem to be ASCII, and somewhat immune to choice of encoding, except perhaps for those few encodings that are not ASCII supersets. (I have no clue how such are handled, if they are. Anyone want to write an EBCDIC page containing a <form> for testing?)

This is useful, as it reduces the amount of character escaping likely to be required, the designer of the page chooses a character set that can represent the page, and is likely in the language of the intended recipient, who is likely to fill out the form using the same language.

It would be more useful, if the browsers included a(n ASCII) header that specified the encoding of subsequent headers: they do not. Therefore, the server that receives the headers must somehow "know" the proper encoding. For the situation where the CGI (or equivalent) script both generates the page containing the <form> and receives the form data, this is simple. For the situation where the same web application designer creates the page containing the <form> and the CGI receiving the form data, and explicitly or implicitly declares the same encoding for both, this is functional, but there is the danger of someone changing the static pages to conform to a new standard encoding without realizing the consequences on the associated CGI scripts. It is also rather hard to create "form filling" applications that can send form data to a server bypassing the access of the form itself... such applications must also "know" the proper encoding, and such applications are much more likely to be generated outside the realm of the original development environment, and much less likely to be involved in any planning to change encodings inside the application <form>s and CGIs.

To support reading byte-stream HTTP headers, therefore, it is critical that the email API accept an encoding from the application which "knows" the encoding; presently cgi.py has to pre-decode incoming headers because email does not have such a parameter. On the other hand, maybe cgi.py shouldn't use email header parsing at all... since browsers don't use RFC 2047 encoding in practice, the parsing of headers without such is straightforward.

Further, HTTP data streams can be extremely large, and thus time-consuming to obtain over the wire. CGI applications cannot afford to keep large blocks of data in RAM during receipt, thus if email wishes to support CGI, it needs features for placing large blocks of data on disk instead of in RAM during the parsing phase; cgi.py presently has to preparse headers, to separate them from the data streams, which it then handles on its own, because of this issue.

Hence, cgi.py does sufficient preparsing and private handling of HTTP data streams, that it seems that the only real benefit it gains from using email at all, is the handling of the complex RFC 2047 decoding... which in practice isn't used in HTTP data streams!

In any case, if email wants to promulgate itself as the "one true way" to process HTTP data streams, as well as SMTP and NNTP data streams, then it needs to address the issues above.

There is, by the way, room for improvement in the cgi.py handler for HTTP data streams; presently all large MIME objects are written to disk (but small ones are kept as string or byte streams), but it isn't necessarily the right disk, and the data must then be again copied, byte by byte, to its final file system location. I see that as abhorrent overhead. There is presently no provision for hooks that ask the CGI application what to do with the data being received, while it is being received, nor for policies to assist with better heuristics, with the goal in mind that a properly and completely received MIME object could then be renamed to its final location rather than copied.

I guess I'm proposing, then, that there be an API version definition,
with two values as of Python3.3: email5 API, and email6 API.  We'll
figure out how we name and interrogate these formally later.

Question: While it is pretty clear that enhanced behaviors are required to benefit new applications that use email, and while some new APIs may be incompatible with some existing APIs, might it be possible to design the new API, and then build a compatibility layer that looks like the old API on top? Such that there would be policies for the new APIs that would work like the old APIs to ease the implementation of such a layer? I'm not sure I fully understand the use of _factory or factory parameters, but for APIs that have _factory and grow a factory, could not the presence of which parameter imply any variant functionality?

(OK, this question comes after not looking at the email API during all the GSOC and your implementation efforts since the last big round of discussion, but your proposals here seem to sound like it would be more possible with your current thinking that with your previous thinking.)

The Header registry in this vision is accessed through the Message class.
I have various thoughts about how this will work, but I'm going to leave
those for later, since this email is already long enough.  I also have
some additional thoughts about backward compatibility, but it is going
to require some experimentation to see if they are realistic.

Consider me an interested observer; I'll enjoy reading, thinking, and commenting about these ideas too, but sadly am unlikely to implement an email client this year :( But I have aspirations to do so, because none of the existing email clients exactly suit my preferences... (everyone should write an editor and an email client, no? I've done the former several times... what I want, though, is emacs-python, instead of emacs-lisp).

Glenn
_______________________________________________
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Reply via email to