Re: [Python-Dev] bytes / unicode

2010-06-28 Thread Greg Ewing
R. David Murray wrote: Having such a poly_str type would probably make my life easier. A thought on this poly_str type: perhaps it could be called ascii, since that's what it would have to be restricted to, and have a'xxx' as a literal syntax for it, seeing as literals seem to be one of

Re: [Python-Dev] bytes / unicode

2010-06-28 Thread Senthil Kumaran
On Mon, Jun 28, 2010 at 08:28:45PM +1200, Greg Ewing wrote: A thought on this poly_str type: perhaps it could be called ascii, since that's what it would have to be restricted to, and have a'xxx' as a literal syntax for it, seeing as literals seem to be one of its main use cases. This

Re: [Python-Dev] bytes / unicode

2010-06-28 Thread R. David Murray
On Mon, 28 Jun 2010 13:55:26 +0530, Senthil Kumaran orsent...@gmail.com wrote: On Mon, Jun 28, 2010 at 08:28:45PM +1200, Greg Ewing wrote: Thinking way outside the square, and probably the pale as well, maybe @ could be pressed into service as an infix operator, with s...@i

Re: [Python-Dev] bytes / unicode

2010-06-28 Thread Nick Coghlan
On Mon, Jun 28, 2010 at 6:28 PM, Greg Ewing greg.ew...@canterbury.ac.nz wrote: R. David Murray wrote: Having such a poly_str type would probably make my life easier. A thought on this poly_str type: perhaps it could be called ascii, since that's what it would have to be restricted to, and

Re: [Python-Dev] bytes / unicode

2010-06-27 Thread Antoine Pitrou
On Sat, 26 Jun 2010 23:49:11 -0400 P.J. Eby p...@telecommunity.com wrote: Remember, bytes and strings already have to detect mixed-type operations. Not in Python 3. They just raise a TypeError on bad (mixed-type) arguments. Regards Antoine. ___

Re: [Python-Dev] bytes / unicode

2010-06-27 Thread Stephen J. Turnbull
P.J. Eby writes: At 12:42 PM 6/26/2010 +0900, Stephen J. Turnbull wrote: What I'm saying here is that if bytes are the signal of validity, and the stdlib functions preserve validity, then it's better to have the stdlib functions object to unicode data as an argument. Compare the

Re: [Python-Dev] bytes / unicode

2010-06-27 Thread P.J. Eby
At 03:53 PM 6/27/2010 +1000, Nick Coghlan wrote: We could talk about this even longer, but the most effective way forward is going to be a patch that improves the URL parsing situation. Certainly, it's the only practical solution for the immediate problems in 3.2. I only mentioned that I hate

Re: [Python-Dev] bytes / unicode

2010-06-27 Thread R. David Murray
I've been watching this discussion with intense interest, but have been so lagged in following the thread that I haven't replied. I got caught up today On Sun, 27 Jun 2010 15:53:59 +1000, Nick Coghlan ncogh...@gmail.com wrote: The difference is that we have three classes of algorithm here:

Re: [Python-Dev] bytes / unicode

2010-06-26 Thread P.J. Eby
At 12:42 PM 6/26/2010 +0900, Stephen J. Turnbull wrote: What I'm saying here is that if bytes are the signal of validity, and the stdlib functions preserve validity, then it's better to have the stdlib functions object to unicode data as an argument. Compare the alternative: it returns a

Re: [Python-Dev] bytes / unicode

2010-06-26 Thread Nick Coghlan
On Sun, Jun 27, 2010 at 4:17 AM, P.J. Eby p...@telecommunity.com wrote: The idea that I'm proposing is that the basic string and byte types should defer to user-defined string types for mixed type operations, so that polymorphism of string-manipulation functions is the *default* case, rather

Re: [Python-Dev] bytes / unicode

2010-06-26 Thread P.J. Eby
At 12:43 PM 6/27/2010 +1000, Nick Coghlan wrote: While full support for third party strings and byte sequence implementations is an interesting idea, I think it's overkill for the specific problem of making it easier to write str/bytes agnostic functions for tasks like URL parsing. OTOH, to

Re: [Python-Dev] bytes / unicode

2010-06-26 Thread Nick Coghlan
On Sun, Jun 27, 2010 at 1:49 PM, P.J. Eby p...@telecommunity.com wrote: I just hate the idea that functions taking strings should have to be *rewritten* to be explicitly type-agnostic.  It seems *so* un-Pythonic...  like if all the bitmasking functions you'd ever written using 32-bit int

Re: [Python-Dev] bytes / unicode

2010-06-25 Thread Stephen J. Turnbull
Guido van Rossum writes: On Thu, Jun 24, 2010 at 1:12 AM, Stephen J. Turnbull step...@xemacs.org wrote: Understood, but both the majority of str/bytes methods and several existing APIs (e.g. many in the os module, like os.listdir()) do it this way. Understood. Also, IMO a

Re: [Python-Dev] bytes / unicode

2010-06-25 Thread Stephen J. Turnbull
P.J. Eby writes: This doesn't have to be in the functions; it can be in the *types*. Mixed-type string operations have to do type checking and upcasting already, but if the protocol were open, you could make an encoded-bytes type that would handle the error checking. Don't you

Re: [Python-Dev] bytes / unicode

2010-06-25 Thread P.J. Eby
At 04:49 PM 6/25/2010 +0900, Stephen J. Turnbull wrote: P.J. Eby writes: This doesn't have to be in the functions; it can be in the *types*. Mixed-type string operations have to do type checking and upcasting already, but if the protocol were open, you could make an encoded-bytes type

Re: [Python-Dev] bytes / unicode

2010-06-25 Thread Ian Bicking
On Fri, Jun 25, 2010 at 2:05 AM, Stephen J. Turnbull step...@xemacs.orgwrote: But join('x', 'y') - 'x/y' and join(b'x', b'y') - b'x/y' make sense to me. So, actually, I *don't* understand what you mean by needing LBYL. Consider docutils. Some folks assert that URIs *are* bytes and

Re: [Python-Dev] bytes / unicode

2010-06-25 Thread Stephen J. Turnbull
Ian Bicking writes: I don't get what you are arguing against. Are you worried that if we make URL code polymorphic that this will mean some code will treat URLs as bytes, and that code will be incompatible with URLs as text? No one is arguing we remove text support from any of these

Re: [Python-Dev] bytes / unicode

2010-06-25 Thread P.J. Eby
At 01:18 AM 6/26/2010 +0900, Stephen J. Turnbull wrote: It seems to me what is wanted here is something like Perl's taint mechanism, for *both* kinds of strings. Am I missing something? You could certainly view it as a kind of tainting. The part where the type would be bytes-based is indeed

Re: [Python-Dev] bytes / unicode

2010-06-25 Thread Stephen J. Turnbull
P.J. Eby writes: it's just that if you already have the bytes, and all you want to do is tag them (e.g. the WSGI headers case), the extra encoding step seems pointless. Well, I'll have to concede that unless and until I get involved in the WSGI development effort.wink But with your

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread Stephen J. Turnbull
Guido van Rossum writes: For example: how we can make the suite of functions used for URL processing more polymorphic, so that each developer can choose for herself how URLs need to be treated in her application. While you have come down on the side of polymorphism (as opposed to separate

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread Lennart Regebro
On Tue, Jun 22, 2010 at 20:07, James Y Knight f...@fuhm.net wrote: Yeah. This is a real issue I have with the direction Python3 went: it pushes you into decoding everything to unicode early, even when you don't care -- Well, yes, maybe even if *you* don't care. But often the functions you need

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread M.-A. Lemburg
Lennart Regebro wrote: On Tue, Jun 22, 2010 at 20:07, James Y Knight f...@fuhm.net wrote: Yeah. This is a real issue I have with the direction Python3 went: it pushes you into decoding everything to unicode early, even when you don't care -- Well, yes, maybe even if *you* don't care. But

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread Michael Foord
On 24/06/2010 11:58, M.-A. Lemburg wrote: Lennart Regebro wrote: On Tue, Jun 22, 2010 at 20:07, James Y Knightf...@fuhm.net wrote: Yeah. This is a real issue I have with the direction Python3 went: it pushes you into decoding everything to unicode early, even when you don't care --

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread Guido van Rossum
On Thu, Jun 24, 2010 at 1:12 AM, Stephen J. Turnbull step...@xemacs.org wrote: Guido van Rossum writes:   For example: how we can make the suite of functions used for URL   processing more polymorphic, so that each developer can choose for   herself how URLs need to be treated in her

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread Nick Coghlan
On Fri, Jun 25, 2010 at 12:33 AM, Guido van Rossum gu...@python.org wrote: Also, IMO a polymorphic function should *not* accept *mixed* bytes/text input -- join('x', b'y') should be rejected. But join('x', 'y') - 'x/y' and join(b'x', b'y') - b'x/y' make sense to me. A policy of allowing

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread Guido van Rossum
On Thu, Jun 24, 2010 at 8:25 AM, Nick Coghlan ncogh...@gmail.com wrote: On Fri, Jun 25, 2010 at 12:33 AM, Guido van Rossum gu...@python.org wrote: Also, IMO a polymorphic function should *not* accept *mixed* bytes/text input -- join('x', b'y') should be rejected. But join('x', 'y') - 'x/y' and

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread Baptiste Carvello
P.J. Eby a écrit : [...] stdlib constants are almost always ASCII, and the main use cases for ebytes would involve ascii-extended encodings.) Then, how about a new ascii string literal? This would produce a special kind of string that would coerce to a normal string when mixed with a str,

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread P.J. Eby
At 05:12 PM 6/24/2010 +0900, Stephen J. Turnbull wrote: Guido van Rossum writes: For example: how we can make the suite of functions used for URL processing more polymorphic, so that each developer can choose for herself how URLs need to be treated in her application. While you have come

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread Nick Coghlan
On Fri, Jun 25, 2010 at 3:07 AM, P.J. Eby p...@telecommunity.com wrote: (Btw, in some earlier emails, Stephen, you implied that this could be fixed with codecs -- but it can't, because the problem isn't with the bytes containing invalid Unicode, it's with the Unicode containing invalid bytes

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread Nick Coghlan
On Fri, Jun 25, 2010 at 1:41 AM, Guido van Rossum gu...@python.org wrote: I don't think we should abuse sum for this. A simple idiom to get the *empty* string of a particular type is x[:0] so you could write something like this to concatenate a list or strings or bytes: xs[:0].join(xs). Note

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Stephen J. Turnbull
Ian Bicking writes: Just for perspective, I don't know if I've ever wanted to deal with a URL like that. Ditto, I do many times a day for Japanese media sites and Wikipedia. I know how it is supposed to work, and I know what a browser does with that, but so many tools will clean that

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Stephen J. Turnbull
James Y Knight writes: The surrogateescape method is a nice workaround for this, but I can't help thinking that it might've been better to just treat stuff as possibly-invalid-but-probably-utf8 byte-strings from input, through processing, to output. This is the world we already

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread M.-A. Lemburg
Nick Coghlan wrote: On Wed, Jun 23, 2010 at 4:09 AM, M.-A. Lemburg m...@egenix.com wrote: It would be great if we could have something like the above as builtin method: x.split(''.as(x)) As per my other message, another possible (and reasonably intuitive) spelling would be:

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Nick Coghlan
On Wed, Jun 23, 2010 at 7:18 PM, M.-A. Lemburg m...@egenix.com wrote: Note that the point of using a builtin method was to get better performance. Such type adaptions are often needed in loops, so adding a few extra Python function calls just to convert a str object to a bytes object or

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread P.J. Eby
At 08:34 PM 6/22/2010 -0400, Glyph Lefkowitz wrote: I suspect the practical problem here is that there's no CharacterString ABC That, and the absence of a string coercion protocol so that mixing your custom string with standard strings will do the right thing for your intended use.

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Stephen J. Turnbull wrote: We do need str-based implementations of modules like urllib. Why would that be? URLs aren't text, and never will be. The fact that to the eye they may seem to be text-ish doesn't make them text. This *is* a case where

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Guido van Rossum
On Wed, Jun 23, 2010 at 8:30 AM, Tres Seaver tsea...@palladion.com wrote: Stephen J. Turnbull wrote: We do need str-based implementations of modules like urllib. Why would that be?  URLs aren't text, and never will be.  The fact that to the eye they may seem to be text-ish doesn't make them

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Barry Warsaw
On Jun 23, 2010, at 08:43 AM, Guido van Rossum wrote: So I propose that we drop the discussion are URLs text or bytes and try to find something more pragmatic to discuss. email has exactly the same question, and the answer is yes. wink For example: how we can make the suite of functions used

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Bill Janssen
Tres Seaver tsea...@palladion.com wrote: Stephen J. Turnbull wrote: We do need str-based implementations of modules like urllib. Why would that be? URLs aren't text, and never will be. The fact that to the eye they may seem to be text-ish doesn't make them text. This URLs are exactly

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Ian Bicking
On Wed, Jun 23, 2010 at 10:30 AM, Tres Seaver tsea...@palladion.com wrote: Stephen J. Turnbull wrote: We do need str-based implementations of modules like urllib. Why would that be? URLs aren't text, and never will be. The fact that to the eye they may seem to be text-ish doesn't make

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Bill Janssen
Guido van Rossum gu...@python.org wrote: So I propose that we drop the discussion are URLs text or bytes and try to find something more pragmatic to discuss. For example: how we can make the suite of functions used for URL processing more polymorphic, so that each developer can choose for

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Ian Bicking
Oops, I forgot some important quoting (important for the algorithm, maybe not actually for the discussion)... from urllib.parse import urlsplit, urlunsplit import encodings.idna # urllib.parse.quote both always returns str, and is not as conservative in quoting as required here... def

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Glyph Lefkowitz
On Jun 22, 2010, at 8:57 PM, Robert Collins wrote: bzr has a cache of decoded strings in it precisely because decode is slow. We accept slowness encoding to the users locale because thats typically much less data to examine than we've examined while generating the commit/diff/whatever. We

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Bill Janssen wrote: The bigger problem seems to be that we're revisiting the design discussion about urllib.parse from the summer of 2008. See http://bugs.python.org/issue3300 if you want to recall how we hashed this out 2 years ago. I didn't

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Antoine Pitrou
On Wed, 23 Jun 2010 14:23:33 -0400 Tres Seaver tsea...@palladion.com wrote: Perhaps such decisions need revisiting in light of subsequent experience / pain / learning. E.g: - - the repeated inability of the web-sig to converge on appropriate semantics for a Python3-compatible version of

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Toshio Kuratomi
On Wed, Jun 23, 2010 at 09:36:45PM +0200, Antoine Pitrou wrote: On Wed, 23 Jun 2010 14:23:33 -0400 Tres Seaver tsea...@palladion.com wrote: - - the slow adoption / porting rate of major web frameworks and libraries to Python 3. Some of the major web frameworks and libraries have a ton

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Antoine Pitrou
On Wed, 23 Jun 2010 17:30:22 -0400 Toshio Kuratomi a.bad...@gmail.com wrote: Note that this assumption seems optimistic to me. I started talking to Graham Dumpleton, author of mod_wsgi a couple years back because mod_wsgi and paste do decoding of bytes to unicode at different layers which

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Toshio Kuratomi
On Wed, Jun 23, 2010 at 11:35:12PM +0200, Antoine Pitrou wrote: On Wed, 23 Jun 2010 17:30:22 -0400 Toshio Kuratomi a.bad...@gmail.com wrote: Note that this assumption seems optimistic to me. I started talking to Graham Dumpleton, author of mod_wsgi a couple years back because mod_wsgi

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Toshio Kuratomi
On Tue, Jun 22, 2010 at 11:58:57AM +0900, Stephen J. Turnbull wrote: Toshio Kuratomi writes: One comment here -- you can also have uri's that aren't decodable into their true textual meaning using a single encoding. Apache will happily serve out uris that have utf-8, shift-jis,

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Glyph Lefkowitz
On Jun 21, 2010, at 10:58 PM, Stephen J. Turnbull wrote: The RFC says that URIs are text, and therefore they can (and IMO should) be operated on as text in the stdlib. No, *blue* is the best color for a shed. Oops, wait, let me try that again. While I broadly agree with this statement, it

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Raymond Hettinger
On Jun 21, 2010, at 10:31 PM, Glyph Lefkowitz wrote: This is a common pain-point for porting software to 3.x - you had a string, it kinda worked most of the time before, but now you need to keep track of text too and the functions which seemed to work on bytes no longer do. Thanks Glyph.

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Stephen J. Turnbull
Glyph Lefkowitz writes: On Jun 21, 2010, at 10:58 PM, Stephen J. Turnbull wrote: Note also that the complete solution argument cuts both ways. Eg, a complete solution should implement UTS 39 confusables detection[1] and IDNA[2]. Good luck doing that with bytes! And good luck

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Stephen J. Turnbull
Toshio Kuratomi writes: I'll definitely buy that. Would urljoin(b_base, b_subdir) = bytes and urljoin(u_base, u_subdir) = unicode be acceptable though? Probably. But it doesn't matter what I say, since Guido has defined that as polymorphism and approved it in principle. (I think,

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Guido van Rossum
[Just addressing one little issue here; generally I'm just happy that we're discussing this issue in such detail from so many points of view.] On Mon, Jun 21, 2010 at 10:50 PM, Toshio Kuratomi a.bad...@gmail.com wrote: [...] Would urljoin(b_base, b_subdir) = bytes and urljoin(u_base, u_subdir) =

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Ian Bicking
On Tue, Jun 22, 2010 at 6:31 AM, Stephen J. Turnbull step...@xemacs.orgwrote: Toshio Kuratomi writes: I'll definitely buy that. Would urljoin(b_base, b_subdir) = bytes and urljoin(u_base, u_subdir) = unicode be acceptable though? Probably. But it doesn't matter what I say, since

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Toshio Kuratomi
On Tue, Jun 22, 2010 at 08:31:13PM +0900, Stephen J. Turnbull wrote: Toshio Kuratomi writes: unicode handling redesign. I'm stating my reading of the RFC not to defend the use case Philip has, but because I think that the outlook that non-text uris (before being percentencoded) are

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread James Y Knight
On Jun 22, 2010, at 1:03 PM, Ian Bicking wrote: Similarly I'd expect (from experience) that a programmer using Python to want to take the same approach, sticking with unencoded data in nearly all situations. Yeah. This is a real issue I have with the direction Python3 went: it pushes you

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread M.-A. Lemburg
Guido van Rossum wrote: [Just addressing one little issue here; generally I'm just happy that we're discussing this issue in such detail from so many points of view.] On Mon, Jun 21, 2010 at 10:50 PM, Toshio Kuratomi a.bad...@gmail.com wrote: [...] Would urljoin(b_base, b_subdir) = bytes

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Terry Reedy
On 6/22/2010 1:22 AM, Glyph Lefkowitz wrote: The thing that I have heard in passing from a couple of folks with experience in this area is that some older software in asia would present characters differently if they were originally encoded in a japanese encoding versus a chinese encoding, even

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Terry Reedy
On 6/22/2010 12:53 PM, Guido van Rossum wrote: On Mon, Jun 21, 2010 at 11:47 PM, Raymond Hettinger raymond.hettin...@gmail.com wrote: On Jun 21, 2010, at 10:31 PM, Glyph Lefkowitz wrote: This is a common pain-point for porting software to 3.x - you had a string, it kinda worked most of

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Ian Bicking
On Tue, Jun 22, 2010 at 1:07 PM, James Y Knight f...@fuhm.net wrote: The surrogateescape method is a nice workaround for this, but I can't help thinking that it might've been better to just treat stuff as possibly-invalid-but-probably-utf8 byte-strings from input, through processing, to

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Robert Collins
On Wed, Jun 23, 2010 at 6:09 AM, M.-A. Lemburg m...@egenix.com wrote:           return constant.encode('utf-8') So now you can write x.split(literal_as('', x)). This polymorphism is what we used in Python2 a lot to write code that works for both Unicode and 8-bit strings. Unfortunately,

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Nick Coghlan
On Wed, Jun 23, 2010 at 2:17 AM, Guido van Rossum gu...@python.org wrote: (1) Literals. If you write something like x.split('') you are implicitly assuming x is text. I don't see a very clean way to overcome this; you'll have to implement some kind of type check e.g.    x.split('') if

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Nick Coghlan
On Wed, Jun 23, 2010 at 4:09 AM, M.-A. Lemburg m...@egenix.com wrote: It would be great if we could have something like the above as builtin method: x.split(''.as(x)) As per my other message, another possible (and reasonably intuitive) spelling would be: x.split(x.coerce('')) Writing it

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Michael Foord
On 22/06/2010 22:40, Robert Collins wrote: On Wed, Jun 23, 2010 at 6:09 AM, M.-A. Lemburgm...@egenix.com wrote: return constant.encode('utf-8') So now you can write x.split(literal_as('', x)). This polymorphism is what we used in Python2 a lot to write code that works

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Michael Foord
On 22/06/2010 19:07, James Y Knight wrote: On Jun 22, 2010, at 1:03 PM, Ian Bicking wrote: Similarly I'd expect (from experience) that a programmer using Python to want to take the same approach, sticking with unencoded data in nearly all situations. Yeah. This is a real issue I have with

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Ian Bicking
On Tue, Jun 22, 2010 at 11:17 AM, Guido van Rossum gu...@python.org wrote: (2) Data sources. These can be functions that produce new data from non-string data, e.g. str(int), read it from a named file, etc. An example is read() vs. write(): it's easy to create a (hypothetical) polymorphic

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread P.J. Eby
At 07:41 AM 6/23/2010 +1000, Nick Coghlan wrote: Then my example above could be made polymorphic (for ASCII compatible encodings) by writing: [x for x in seq if x.endswith(x.coerce(b))] I'm trying to see downsides to this idea, and I'm not really seeing any (well, other than 2.7 being almost

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Glyph Lefkowitz
On Jun 22, 2010, at 12:53 PM, Guido van Rossum wrote: On Mon, Jun 21, 2010 at 11:47 PM, Raymond Hettinger raymond.hettin...@gmail.com wrote: On Jun 21, 2010, at 10:31 PM, Glyph Lefkowitz wrote: This is a common pain-point for porting software to 3.x - you had a string, it kinda worked

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Glyph Lefkowitz
On Jun 22, 2010, at 2:07 PM, James Y Knight wrote: Yeah. This is a real issue I have with the direction Python3 went: it pushes you into decoding everything to unicode early, even when you don't care -- all you really wanted to do is pass it from one API to another, with some well-defined

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Glyph Lefkowitz
On Jun 22, 2010, at 7:23 PM, Ian Bicking wrote: This is a place where bytes+encoding might also have some benefit. XML is someplace where you might load a bunch of data but only touch a little bit of it, and the amount of data is frequently large enough that the efficiencies are

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Mike Klaas
On Tue, Jun 22, 2010 at 4:23 PM, Ian Bicking i...@colorstudy.com wrote: This reminds me of the optimization ElementTree and lxml made in Python 2 (not sure what they do in Python 3?) where they use str when a string is ASCII to avoid the memory and performance overhead of unicode. An

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Robert Collins
On Wed, Jun 23, 2010 at 12:25 PM, Glyph Lefkowitz gl...@twistedmatrix.com wrote: I can also appreciate what's been said in this thread a bunch of times: to my knowledge, nobody has actually shown a profile of an application where encoding is significant overhead.  I believe that encoding

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Stephen J. Turnbull
Robert Collins writes: Also, url's are bytestrings - by definition; Eh? RFC 3896 explicitly says A URI is an identifier consisting of a sequence of characters matching the syntax rule named URI in Section 3. (where the phrase sequence of characters appears in all ancestors I found

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Lennart Regebro
2010/6/21 Stephen J. Turnbull step...@xemacs.org: IMO, the UI is right.  Something like the above ought to work. Right. That said, many times when you want to do urlparse etc they might be binary, and you might want binary. So maybe the methods should work with both? -- Lennart Regebro:

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Nick Coghlan
On Mon, Jun 21, 2010 at 12:30 PM, P.J. Eby p...@telecommunity.com wrote: I also find it weird that there seem to be two camps on this subject, one of which claims that All Is Well And There Is No Problem -- but I do not recall seeing anyone who was in the What do I do; this doesn't seem ready

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Stephen J. Turnbull
Lennart Regebro writes: 2010/6/21 Stephen J. Turnbull step...@xemacs.org: IMO, the UI is right.  Something like the above ought to work. Right. That said, many times when you want to do urlparse etc they might be binary, and you might want binary. So maybe the methods should work

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread P.J. Eby
At 10:51 PM 6/21/2010 +1000, Nick Coghlan wrote: It may be that there are places where we need to rewrite standard library algorithms to be bytes/str neutral (e.g. by using length one slices instead of indexing). It may be that there are more APIs that need to grow encoding keyword arguments

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Michael Foord
On 21/06/2010 17:46, P.J. Eby wrote: At 10:51 PM 6/21/2010 +1000, Nick Coghlan wrote: It may be that there are places where we need to rewrite standard library algorithms to be bytes/str neutral (e.g. by using length one slices instead of indexing). It may be that there are more APIs that need

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread P.J. Eby
At 01:08 AM 6/22/2010 +0900, Stephen J. Turnbull wrote: But if you need that everywhere, what's so hard about def urljoin_wrapper (base, subdir): return urljoin(str(base, 'latin-1'), subdir).encode('latin-1') Now, note how that pattern fails as soon as you want to use non-ISO-8859-1

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Toshio Kuratomi
On Tue, Jun 22, 2010 at 01:08:53AM +0900, Stephen J. Turnbull wrote: Lennart Regebro writes: 2010/6/21 Stephen J. Turnbull step...@xemacs.org: IMO, the UI is right.  Something like the above ought to work. Right. That said, many times when you want to do urlparse etc they might

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Terry Reedy
On 6/20/2010 11:56 PM, Terry Reedy wrote: The specific example is urllib.parse.parse_qsl('a=b%e0') [('a', 'b�')] where the character after 'b' is white ? in dark diamond, indicating an error. parse_qsl() splits that input on '=' and sends each piece to urllib.parse.unquote unquote()

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Guido van Rossum
On Mon, Jun 21, 2010 at 9:46 AM, P.J. Eby p...@telecommunity.com wrote: At 10:51 PM 6/21/2010 +1000, Nick Coghlan wrote: It may be that there are places where we need to rewrite standard library algorithms to be bytes/str neutral (e.g. by using length one slices instead of indexing). It may

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread P.J. Eby
At 05:49 PM 6/21/2010 +0100, Michael Foord wrote: Why is your proposed bstr wrapper not practical to implement outside the core and use in your own libraries and frameworks? __contains__ doesn't have a converse operation, so you can't code a type that works around this (Python 3.1 shown):

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Terry Reedy
On 6/21/2010 8:51 AM, Nick Coghlan wrote: I don't know that the all is well camp actually exists. The camp that I do see existing is the one that says without a bug report, inconsistencies in the standard library's unicode handling won't get fixed. The issues picked up by the regression test

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread P.J. Eby
At 12:56 PM 6/21/2010 -0400, Toshio Kuratomi wrote: One comment here -- you can also have uri's that aren't decodable into their true textual meaning using a single encoding. Apache will happily serve out uris that have utf-8, shift-jis, and euc-jp components inside of their path but the

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread P.J. Eby
At 10:29 AM 6/21/2010 -0700, Guido van Rossum wrote: Perhaps there are more situations where a polymorphic API would be helpful. Such APIs are not always so easy to implement, because they have to be careful with literals or other constants (and even more so mutable state) used internally -- but

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Robert Collins
2010/6/21 Stephen J. Turnbull step...@xemacs.org: Robert Collins writes:   Also, url's are bytestrings - by definition; Eh?  RFC 3896 explicitly says ?Definitions of Managed Objects for the DS3/E3 Interface Type Perhaps you mean 3986 ? :)    A URI is an identifier consisting of a sequence

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Terry Reedy
On 6/21/2010 1:29 PM, P.J. Eby wrote: At 05:49 PM 6/21/2010 +0100, Michael Foord wrote: Why is your proposed bstr wrapper not practical to implement outside the core and use in your own libraries and frameworks? __contains__ doesn't have a converse operation, so you can't code a type that

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Terry Reedy
On 6/21/2010 1:29 PM, Guido van Rossum wrote: Actually, the big problem with Python 2 is that if you mix str and unicode, things work or crash depending on whether any of the str objects involved contain non-ASCII bytes. If one API decides to upgrade to Unicode, the result, when passed to

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Stephen J. Turnbull
Toshio Kuratomi writes: One comment here -- you can also have uri's that aren't decodable into their true textual meaning using a single encoding. Apache will happily serve out uris that have utf-8, shift-jis, and euc-jp components inside of their path but the textual representation

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Stephen J. Turnbull
Robert Collins writes: Perhaps you mean 3986 ? :) Thank you for the correction.    A URI is an identifier consisting of a sequence of characters    matching the syntax rule named URI in Section 3. (where the phrase sequence of characters appears in all ancestors I found back to

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Glyph Lefkowitz
On Jun 21, 2010, at 2:17 PM, P.J. Eby wrote: One issue I remember from my enterprise days is some of the Asian-language developers at NTT/Verio explaining to me that unicode doesn't actually solve certain issues -- that there are use cases where you really *do* need bytes plus encoding in

[Python-Dev] bytes / unicode

2010-06-20 Thread Antoine Pitrou
On Sun, 20 Jun 2010 14:40:56 -0400 P.J. Eby p...@telecommunity.com wrote: Actually, I would say that it's more that (in the network protocol case) we *have* bytes, some of which we would like to *treat* as text, yet do not wish to constantly convert back and forth to full-blown unicode

Re: [Python-Dev] bytes / unicode

2010-06-20 Thread Benjamin Peterson
2010/6/20 Antoine Pitrou solip...@pitrou.net: On Sun, 20 Jun 2010 14:40:56 -0400 P.J. Eby p...@telecommunity.com wrote: Actually, I would say that it's more that (in the network protocol case) we *have* bytes, some of which we would like to *treat* as text, yet do not wish to constantly

Re: [Python-Dev] bytes / unicode

2010-06-20 Thread Robert Collins
Also, url's are bytestrings - by definition; if the standard library has made them unicode objects in 3, I expect a lot of pain in the webserver space. -Rob ___ Python-Dev mailing list Python-Dev@python.org

Re: [Python-Dev] bytes / unicode

2010-06-20 Thread Terry Reedy
On 6/20/2010 5:55 PM, Benjamin Peterson wrote: 2010/6/20 Antoine Pitrousolip...@pitrou.net: On Sun, 20 Jun 2010 14:40:56 -0400 P.J. Ebyp...@telecommunity.com wrote: Actually, I would say that it's more that (in the network protocol case) we *have* bytes, some of which we would like to

Re: [Python-Dev] bytes / unicode

2010-06-20 Thread P.J. Eby
At 07:33 PM 6/20/2010 -0400, Terry Reedy wrote: Do you have in mind any tools that could and should operate on both, but do not? From http://mail.python.org/pipermail/web-sig/2009-September/004105.html : The problem which arises is that unquoting of URLs in Python 3.X stdlib can only be done

Re: [Python-Dev] bytes / unicode

2010-06-20 Thread P.J. Eby
At 11:47 PM 6/20/2010 +0200, Antoine Pitrou wrote: On Sun, 20 Jun 2010 14:40:56 -0400 P.J. Eby p...@telecommunity.com wrote: Actually, I would say that it's more that (in the network protocol case) we *have* bytes, some of which we would like to *treat* as text, yet do not wish to constantly

Re: [Python-Dev] bytes / unicode

2010-06-20 Thread Terry Reedy
On 6/20/2010 9:33 PM, P.J. Eby wrote: At 07:33 PM 6/20/2010 -0400, Terry Reedy wrote: Do you have in mind any tools that could and should operate on both, but do not? From http://mail.python.org/pipermail/web-sig/2009-September/004105.html : Thank for the concrete examples in this and your

  1   2   >