Re: [Python-Dev] Dropping bytes "support" in json

James Y Knight Fri, 10 Apr 2009 08:08:40 -0700

On Apr 9, 2009, at 10:38 PM, Barry Warsaw wrote:

So, what I'm really asking is this. Let's say you agree that thereare use cases for accessing a header value as either the raw encodedbytes or the decoded unicode.

As I said in the thread having nearly the same exact discussion on web-sig, except about WSGI headers...

What should this return:

>>> message['Subject']

The raw bytes or the decoded unicode?

Until you write a parser for every header, you simply cannot decode tounicode. The only sane choices are:

1) raw bytes
2) parsed structured data

There's no "decoded to unicode but not parsed" option: that's doingthings in the wrong order. If you RFC2047-decode the header beforedoing tokenization and parsing, you will just have a *broken*implementation.

Here's an example where it matters. If you decode the RFC2047 partbefore parsing, you'd decide that there's two recipients to themessage. There aren't. "<[email protected]>, " is the display-name of"[email protected]", not a second recipient.


  To: =?UTF-8?B?PGJyb2tlbkBleGFtcGxlLmNvbT4sIA==?= <[email protected]>

Here's a quote from RFC2047:

NOTE: Decoding and display of encoded-words occurs *after* astructured field body is parsed into tokens. It is thereforepossible to hide 'special' characters in encoded-words which, whendisplayed, will be indistinguishable from 'special' characters inthe surrounding text. For this and other reasons, it is NOTgenerally possible to translate a message header containing 'encoded-word's to an unencoded form which can be parsed by an RFC 822 mailreader.

And another quote for good measure:

(2) Any header field not defined as '*text' should be parsedaccording to the syntax rules for that header field. However, any'word' that appears within a 'phrase' should be treated as an'encoded-word' if it meets the syntax rules in section 2. Otherwiseit should be treated as an ordinary 'word'.



Now, I suppose there's also a third possibility:

3) US-ASCII-only strings, unmolested except for doinga .decode('ascii'). That'll give you a string all right, but it'sreally just cheating. It's not actually a text string in anymeaningful sense.

(in all this I'm assuming your question is not about the "Subject"header in particular; that is of course just unstructured text so theparse step doesn't actually do anything...).


James

_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Dropping bytes "support" in json

Reply via email to