subject:"Re\: \[Python\-Dev\] bytes \/ unicode"

Re: [Python-Dev] bytes / unicode

2010-06-28 Thread Greg Ewing


R. David Murray wrote:


Having such a poly_str type would probably make my life easier.


A thought on this poly_str type: perhaps it could be
called ascii, since that's what it would have to be
restricted to, and have

  a'xxx'

as a literal syntax for it, seeing as literals seem to
be one of its main use cases.


I also would like just vent a little frustration at having to
use single-character-slice notation when I want to index a character
in a string in my algorithms


Thinking way outside the square, and probably the pale
as well, maybe @ could be pressed into service as an
infix operator, with

  s...@i

being equivalent to

  s[i:i+1]

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-28 Thread Senthil Kumaran

On Mon, Jun 28, 2010 at 08:28:45PM +1200, Greg Ewing wrote:
 A thought on this poly_str type: perhaps it could be
 called ascii, since that's what it would have to be
 restricted to, and have
 
   a'xxx'
 
 as a literal syntax for it, seeing as literals seem to
 be one of its main use cases.

This seems like a good idea.

 
 Thinking way outside the square, and probably the pale
 as well, maybe @ could be pressed into service as an
 infix operator, with
 
   s...@i
 
 being equivalent to
 
   s[i:i+1]
 

And this is way beyond being intuitive. 


-- 
Senthil
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-28 Thread R. David Murray

On Mon, 28 Jun 2010 13:55:26 +0530, Senthil Kumaran orsent...@gmail.com wrote:
 On Mon, Jun 28, 2010 at 08:28:45PM +1200, Greg Ewing wrote:
  Thinking way outside the square, and probably the pale
  as well, maybe @ could be pressed into service as an
  infix operator, with
  
s...@i
  
  being equivalent to
  
s[i:i+1]
  
 
 And this is way beyond being intuitive. 

Agreed, -1 on that.  Like I said, I was just venting.  The decision
to have indexing bytes return an int is set in stone now and I
just have to live with it.

--
R. David Murray  www.bitdance.com
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-28 Thread Nick Coghlan

On Mon, Jun 28, 2010 at 6:28 PM, Greg Ewing greg.ew...@canterbury.ac.nz wrote:
 R. David Murray wrote:

 Having such a poly_str type would probably make my life easier.

 A thought on this poly_str type: perhaps it could be
 called ascii, since that's what it would have to be
 restricted to, and have

  a'xxx'

 as a literal syntax for it, seeing as literals seem to
 be one of its main use cases.

One of the virtues of doing this as a helper type in a module
somewhere (probably string) is that we can defer that kind of decision
until later.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-27 Thread Antoine Pitrou

On Sat, 26 Jun 2010 23:49:11 -0400
P.J. Eby p...@telecommunity.com wrote:
 
 Remember, bytes and strings already have to detect mixed-type 
 operations.

Not in Python 3. They just raise a TypeError on bad
(mixed-type) arguments.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-27 Thread Stephen J. Turnbull

P.J. Eby writes:
  At 12:42 PM 6/26/2010 +0900, Stephen J. Turnbull wrote:
  What I'm saying here is that if bytes are the signal of validity, and
  the stdlib functions preserve validity, then it's better to have the
  stdlib functions object to unicode data as an argument.  Compare the
  alternative: it returns a unicode object which might get passed around
  for a while before one of your functions receives it and identifies it
  as unvalidated data.
  
  I still don't follow,

OK, I give up, since it was your use case that concerned me.  I
obviously misunderstood.  Sorry for the confusion.

Sign me,
+1 on polymorphic functions in Tsukuba Japan

  In general this is a hard problem, though.  Polymorphism, OK, one-way
  tainting OK, but in general combining related types is pretty
  arbitrary, and as in the encoded-bytes case, the result type often
  varies depending on expectations of callers, not the types of the
  data.
  
  But the caller can enforce those expectations by passing in arguments 
  whose types do what they want in such cases, as long as the string 
  literals used by the function don't get to override the relevant 
  parts of the string protocol(s).

This simply isn't true for encoded bytes as proposed.  For encoded
text, the current encoding has no deterministic relationship to the
desired encoding (at the level of generality of the stdlib; of course
in specific applications it may be mandated by a standard or private
convention).

I will have to pass on your other user-defined string types.  I've
never tried to implement one.  I only wanted to point out that a
user-controllable tainted string type would be preferable to
confounding unicode with tainted.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-27 Thread P.J. Eby


At 03:53 PM 6/27/2010 +1000, Nick Coghlan wrote:

We could talk about this even longer, but the most effective way
forward is going to be a patch that improves the URL parsing
situation.


Certainly, it's the only practical solution for the immediate problems in 3.2.

I only mentioned that I hate the idea because I'd be more 
comfortable if it was explicitly declared to be a temporary hack to 
work around the absence of a string coercion protocol, due to the 
moratorium on language changes.


But, since the moratorium *is* in effect, I'll try to make this my 
last post on string protocols for a while...  and maybe wait until 
I've looked at the code (str/bytes C implementations) in more detail 
and can make a more concrete proposal for what the protocol would be 
and how it would work.  (Not to mention closer to the end of the moratorium.)



There are a *very small* number of APIs where it is appropriate to 
be polymorphic


This is only true if you focus exclusively on bytes vs. unicode, 
rather than the general issue that it's currently impractical to pass 
*any* sort of user-defined string type through code that you don't 
directly control (stdlib or third-party).




The virtues of a separate poly_str type are that:
1. It can be simple and implemented in Python, dispatching to str or
bytes as appropriate (probably in the strings module)
2. No chance of impacting the performance of the core interpreter (as
builtins are not affected)


Note that adding a string coercion protocol isn't going to change 
core performance for existing cases, since any place where the 
protocol would be invoked would be a code branch that either throws 
an error or *already* falls back to some other protocol (e.g. the 
buffer protocol).




3. Lower impact if it turns out to have been a bad idea


How many protocols have been added that turned out to be bad 
ideas?  The only ones that have been removed in 3.x, IIRC, are 
three-way compare, slice-specific operations, and __coerce__...  and 
I'm going to miss __cmp__.  ;-)


However, IIUC, the reason these protocols were dropped isn't because 
they were bad ideas.  Rather, they're things that can be 
implemented in terms of a finer-grained protocol.  i.e., if you want 
__cmp__ or __getslice__ or __coerce__, you can always implement them 
via a mixin that converts the newer fine-grained protocols into 
invocations of the older protocol.  (As I plan to do for __cmp__ in 
the handful of places I use it.)


At the moment, however, this isn't possible for multi-string 
operations outside of __add__/__radd__ and comparison -- the coercion 
rules are hard-wired and can't be overridden by user-defined types.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-27 Thread R. David Murray

I've been watching this discussion with intense interest, but have
been so lagged in following the thread that I haven't replied.
I got caught up today

On Sun, 27 Jun 2010 15:53:59 +1000, Nick Coghlan ncogh...@gmail.com wrote:
 The difference is that we have three classes of algorithm here:
 - those that work only on octet sequences
 - those that work only on character sequences
 - those that can work on either
 
 Python 2 lumped all 3 classes of algorithm together through the
 multi-purpose 8-bit str type. The unicode type provided some scope to
 separate out the second category, but the divisions were rather
 blurry.
 
 Python 3 forces the first two to be separated by using either octets
 (bytes/bytearray) or characters (str). There are a *very small* number
 of APIs where it is appropriate to be polymorphic, but this is
 currently difficult due to the need to supply literals of the
 appropriate type for the objects being operated on.
 
 This isn't ever going to happen automagically due to the need to
 explicitly provide two literals (one for octet sequences, one for
 character sequences).

In email6 I'm currently handling this by putting the algorithm on a
base class and the literals on 'Bytes...' and 'String...'  subclasses as
class variables.  Slightly ugly, but it works.

The current design also speaks to an earlier point someone made about the
fact that we are really dealing with more complex, and domain specific,
data, not simply byte strings.  A BytesMessage contains lots of
structured encoding information as well as the possibility of 'garbage'
bytes.  A StringMessage contains text and data decoded into objects
(ex: an image object), possibly with some PEP 383 surrogates included
(haven't quite figured that part out yet).  So, a BytesMessage object
isn't just a byte string, it's a load of structured data that requires
the associated algorithms to convert into meaningful text and objects.
Going the other way, the decisions made about character encodings need to
be encoded into the structured bytes representation that could ultimately
go out on the wire.

I suspect that the same thing needs to be done for URIs/IRIs, and
html/MIME and the corresponding text and objects.  It is my hope that
the email6 work will lay a firm foundation for the latter, but URI/IRI
is a whole different protocol that I'm glad I don't have to deal with :)

 The virtues of a separate poly_str type are that:

Having such a poly_str type would probably make my life easier.

I also would like just vent a little frustration at having to
use single-character-slice notation when I want to index a character
in a string in my algorithms

--
R. David Murray  www.bitdance.com
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-26 Thread P.J. Eby


At 12:42 PM 6/26/2010 +0900, Stephen J. Turnbull wrote:

What I'm saying here is that if bytes are the signal of validity, and
the stdlib functions preserve validity, then it's better to have the
stdlib functions object to unicode data as an argument.  Compare the
alternative: it returns a unicode object which might get passed around
for a while before one of your functions receives it and identifies it
as unvalidated data.


I still don't follow, since passing in bytes should return 
bytes.  Returning unicode would be an error, in the case of a 
polymorphic function (per Guido).




But you agree that there are better mechanisms for validation
(although not available in Python yet), so I don't see this as an
potential obstacle to polymorphism now.


Nope.  I'm just saying that, given two bytestrings to url-join or 
path join or whatever, a polymorph should hand back a 
bytestring.  This seems pretty uncontroversial.




  What I want is for the stdlib to create stringlike objects of a
  type determined by the types of the inputs --

In general this is a hard problem, though.  Polymorphism, OK, one-way
tainting OK, but in general combining related types is pretty
arbitrary, and as in the encoded-bytes case, the result type often
varies depending on expectations of callers, not the types of the
data.


But the caller can enforce those expectations by passing in arguments 
whose types do what they want in such cases, as long as the string 
literals used by the function don't get to override the relevant 
parts of the string protocol(s).


The idea that I'm proposing is that the basic string and byte types 
should defer to user-defined string types for mixed type 
operations, so that polymorphism of string-manipulation functions is 
the *default* case, rather than a *special* case.  This makes 
tainting easier to implement, as well as optimizing and other special 
cases (like my source string w/file and line info, or a string with 
font/formatting attributes).






___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/pje%40telecommunity.com


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-26 Thread Nick Coghlan

On Sun, Jun 27, 2010 at 4:17 AM, P.J. Eby p...@telecommunity.com wrote:
 The idea that I'm proposing is that the basic string and byte types should
 defer to user-defined string types for mixed type operations, so that
 polymorphism of string-manipulation functions is the *default* case, rather
 than a *special* case.  This makes tainting easier to implement, as well as
 optimizing and other special cases (like my source string w/file and line
 info, or a string with font/formatting attributes).

Rather than building this into the base string type, perhaps it would
be better (at least initially) to add in a polymorphic str subtype
that worked along the following lines:

1. Has an encoded argument in the constructor (e.g. poly_str(/, encoded=b/)
2. If given objects with an encode() method, assumes they're strings
and uses its own parent class methods
3. If given objects with a decode() method, assumes they're encoded
and delegates to the encoded attribute

str/bytes agnostic functions would need to invoke poly_str
deliberately, while bytes-only and text-only algorithms could just use
the appropriate literals.

Third party types would be supported to some degree (by having either
encode or decode methods), although they could still run into trouble
with some operations (While full support for third party strings and
byte sequence implementations is an interesting idea, I think it's
overkill for the specific problem of making it easier to write
str/bytes agnostic functions for tasks like URL parsing).

Regards,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-26 Thread P.J. Eby


At 12:43 PM 6/27/2010 +1000, Nick Coghlan wrote:

While full support for third party strings and
byte sequence implementations is an interesting idea, I think it's
overkill for the specific problem of making it easier to write
str/bytes agnostic functions for tasks like URL parsing.


OTOH, to write your partial implementation is almost as complex - it 
still must take into account joining and formatting, and so by that 
point, you've just proposed a new protocol for coercion...  so why 
not just make the coercion protocol explicit in the first place, 
rather than hardwiring a third type's worth of special cases?


Remember, bytes and strings already have to detect mixed-type 
operations.  If there was an API for that, then the hardcoded special 
cases would just be replaced, or supplemented with type slot checks 
and calls after the special cases.


To put it another way, if you already have two types special-casing 
their interactions with each other, then rather than add a *third* 
type to that mix, maybe it's time to have a protocol instead, so that 
the types that care can do the special-casing themselves, and you 
generalize to N user types.


(Btw, those who are saying that the resulting potential for N*N 
interaction makes the feature unworkable seem to be overlooking 
metaclasses and custom numeric types -- two Python features that in 
principle have the exact same problem, when you use them beyond a 
certain scope.  At least with those features, though, you can 
generally mix your user-defined metaclasses or numeric types with the 
Python-supplied basic ones and call arbitrary Python functions on 
them, without as much heartbreak as you'll get with a from-scratch 
stringlike object.)


All that having been said, a new protocol probably falls under the 
heading of the language moratorium, unless it can be considered new 
methods on builtins?  (But that seems like a stretch even to me.)


I just hate the idea that functions taking strings should have to be 
*rewritten* to be explicitly type-agnostic.  It seems *so* 
un-Pythonic...  like if all the bitmasking functions you'd ever 
written using 32-bit int constants had to be rewritten just because 
we added longs to the language, and you had to upcast them to be 
compatible or something.  Sounds too much like C or Java or some 
other non-Python language, where dynamism and polymorphy are the 
special case, instead of the general rule. 


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-26 Thread Nick Coghlan

On Sun, Jun 27, 2010 at 1:49 PM, P.J. Eby p...@telecommunity.com wrote:
 I just hate the idea that functions taking strings should have to be
 *rewritten* to be explicitly type-agnostic.  It seems *so* un-Pythonic...
  like if all the bitmasking functions you'd ever written using 32-bit int
 constants had to be rewritten just because we added longs to the language,
 and you had to upcast them to be compatible or something.  Sounds too much
 like C or Java or some other non-Python language, where dynamism and
 polymorphy are the special case, instead of the general rule.

The difference is that we have three classes of algorithm here:
- those that work only on octet sequences
- those that work only on character sequences
- those that can work on either

Python 2 lumped all 3 classes of algorithm together through the
multi-purpose 8-bit str type. The unicode type provided some scope to
separate out the second category, but the divisions were rather
blurry.

Python 3 forces the first two to be separated by using either octets
(bytes/bytearray) or characters (str). There are a *very small* number
of APIs where it is appropriate to be polymorphic, but this is
currently difficult due to the need to supply literals of the
appropriate type for the objects being operated on.

This isn't ever going to happen automagically due to the need to
explicitly provide two literals (one for octet sequences, one for
character sequences).

The virtues of a separate poly_str type are that:
1. It can be simple and implemented in Python, dispatching to str or
bytes as appropriate (probably in the strings module)
2. No chance of impacting the performance of the core interpreter (as
builtins are not affected)
3. Lower impact if it turns out to have been a bad idea

We could talk about this even longer, but the most effective way
forward is going to be a patch that improves the URL parsing
situation.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-25 Thread Stephen J. Turnbull

Guido van Rossum writes:
  On Thu, Jun 24, 2010 at 1:12 AM, Stephen J. Turnbull step...@xemacs.org 
  wrote:

  Understood, but both the majority of str/bytes methods and several
  existing APIs (e.g. many in the os module, like os.listdir()) do it
  this way.

Understood.

  Also, IMO a polymorphic function should *not* accept *mixed*
  bytes/text input -- join('x', b'y') should be rejected.

Agreed.

  But join('x', 'y') - 'x/y' and join(b'x', b'y') - b'x/y' make
  sense to me.
  
  So, actually, I *don't* understand what you mean by needing LBYL.

Consider docutils.  Some folks assert that URIs *are* bytes and should
be manipulated as such.  So base URIs should be bytes.  But there are
various ways to refer to a base URI and combine it with relative URI
taken from literal text in reST.  That literal text will be
represented as str.  So you want to use urljoin, but this usage isn't
polymorphic.

If you forget to do a conversion here, urljoin will raise, of course.
But late conversion may not be appropriate.  AIUI Philip at least
wants ways to raise exceptions earlier than that on some code paths.
That's LBYL, no?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-25 Thread Stephen J. Turnbull

P.J. Eby writes:

  This doesn't have to be in the functions; it can be in the 
  *types*.  Mixed-type string operations have to do type checking and 
  upcasting already, but if the protocol were open, you could make an 
  encoded-bytes type that would handle the error checking.

Don't you realize that encoded-bytes is equivalent to use of a very
limited profile of ISO 2022 coding extensions?  Such as Emacs/MULE
internal encoding or TRON code?  It has been tried.  It does not work.

I understand how types can do such checking; my point is that the
encoded-bytes type doesn't have enough information to do it in the
cases where you think it is better than converting to str.  There are
*no useful operations* that can be done on two encoded-bytes with
different encodings unless you know the ultimate target codec.  The
only sensible way to define the concatenation of ('ascii', 'English')
with ('euc-jp','ÆüËÜ¸ì') is something like ('ascii', 'English',
'euc-jp','ÆüËÜ¸ì'), and *not* ('euc-jp','EnglishÆüËÜ¸ì'), because you
don't know that the ultimate target codec is 'euc-jp'-compatible.
Worse, you need to build in all the information about which codecs are
mutually compatible into the encoded-bytes type.  For example, if the
ultimate target is known to be 'shift_jis', it's trivially compatible
with 'ascii' and 'euc-jp' requires a conversion, but latin-9 you can't
have.

  (Btw, in some earlier emails, Stephen, you implied that this could be 
  fixed with codecs -- but it can't, because the problem isn't with the 
  bytes containing invalid Unicode, it's with the Unicode containing 
  invalid bytes -- i.e., characters that can't be encoded to the 
  ultimate codec target.)

No, the problem is not with the Unicode, it is with the code that
allows characters not encodable with the target codec.  If you don't
have a target codec, there are ascii-safe source codecs, such as
'latin-1' or 'ascii' with surrogateescape, that will work any time
that bytes-oriented processing can work.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-25 Thread P.J. Eby


At 04:49 PM 6/25/2010 +0900, Stephen J. Turnbull wrote:

P.J. Eby writes:

  This doesn't have to be in the functions; it can be in the
  *types*.  Mixed-type string operations have to do type checking and
  upcasting already, but if the protocol were open, you could make an
  encoded-bytes type that would handle the error checking.

Don't you realize that encoded-bytes is equivalent to use of a very
limited profile of ISO 2022 coding extensions?  Such as Emacs/MULE
internal encoding or TRON code?  It has been tried.  It does not work.

I understand how types can do such checking; my point is that the
encoded-bytes type doesn't have enough information to do it in the
cases where you think it is better than converting to str.  There are
*no useful operations* that can be done on two encoded-bytes with
different encodings unless you know the ultimate target codec.


I do know the ultimate target codec -- that's the point.

IOW, I want to be able to do to all my operations by passing 
target-encoded strings to polymorphic functions.  Then, the moment 
something creeps in that won't go to the target codec, I'll be able 
to track down the hole in the legacy code that's letting bad data creep in.




  The
only sensible way to define the concatenation of ('ascii', 'English')
with ('euc-jp','ÆüËÜ¸ì') is something like ('ascii', 'English',
'euc-jp','ÆüËÜ¸ì'), and *not* ('euc-jp','EnglishÆüËÜ¸ì'), because you
don't know that the ultimate target codec is 'euc-jp'-compatible.
Worse, you need to build in all the information about which codecs are
mutually compatible into the encoded-bytes type.  For example, if the
ultimate target is known to be 'shift_jis', it's trivially compatible
with 'ascii' and 'euc-jp' requires a conversion, but latin-9 you can't
have.


The interaction won't be with other encoded bytes, it'll be with 
other *unicode* strings.  Ones coming from other code, and literals 
embedded in the stdlib.





No, the problem is not with the Unicode, it is with the code that
allows characters not encodable with the target codec.


And which code that is, precisely, is the thing that may be very 
difficult to find, unless I can identify it at the first point it 
enters (and corrupts) my output data.  When dealing with a large code 
base, this may be a nontrivial problem.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-25 Thread Ian Bicking

On Fri, Jun 25, 2010 at 2:05 AM, Stephen J. Turnbull step...@xemacs.orgwrote:

  But join('x', 'y') - 'x/y' and join(b'x', b'y') - b'x/y' make
   sense to me.
  
   So, actually, I *don't* understand what you mean by needing LBYL.

 Consider docutils.  Some folks assert that URIs *are* bytes and should
 be manipulated as such.  So base URIs should be bytes.


I don't get what you are arguing against.  Are you worried that if we make
URL code polymorphic that this will mean some code will treat URLs as bytes,
and that code will be incompatible with URLs as text?  No one is arguing we
remove text support from any of these functions, only that we allow bytes.


-- 
Ian Bicking  |  http://blog.ianbicking.org
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-25 Thread Stephen J. Turnbull

Ian Bicking writes:

  I don't get what you are arguing against.  Are you worried that if
  we make URL code polymorphic that this will mean some code will
  treat URLs as bytes, and that code will be incompatible with URLs
  as text?  No one is arguing we remove text support from any of
  these functions, only that we allow bytes.

No, I understand what Guido means by polymorphic.

I'm arguing that as I understand one of Philip Eby's use cases,
bytes is a misspelling of validated and unicode is a misspelling
of unvalidated.  In case of some kind of bug, polymorphic stdlib
functions would allow propagation of unvalidated/unicode within the
validated zone, aka errors passing silently.

Now that I understand that that use case doesn't actually care about
bytes vs. unicode *string* semantics at all, the argument becomes
moot, I guess.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-25 Thread P.J. Eby


At 01:18 AM 6/26/2010 +0900, Stephen J. Turnbull wrote:

It seems to me what is wanted here is something like Perl's taint
mechanism, for *both* kinds of strings.  Am I missing something?


You could certainly view it as a kind of tainting.  The part where 
the type would be bytes-based is indeed somewhat incidental to the 
actual use case -- it's just that if you already have the bytes, and 
all you want to do is tag them (e.g. the WSGI headers case), the 
extra encoding step seems pointless.


A string coercion protocol (that would be used by .join(), .format(), 
__contains__, __mod__, etc.) would allow you to do whatever sort of 
tainted-string or tainted-bytes implementations one might wish to 
have.  I suppose that tainting user inputs (as in Perl) would be just 
as useful of an application of the same coercion protocol.


Actually, I have another use case for this custom string coercion, 
which is that I once wrote a string subclass whose purpose was to 
track the original file and line number of some text.  Even though 
only my code was manipulating the strings, it was very difficult to 
get the tainting to work correctly without extreme care as to the 
string methods used.  (For example, I had to use string addition 
rather than %-formatting.)




But with your architecture, it seems to me that you actually don't
want polymorphic functions in the stdlib.  You want the stdlib
functions to be bytes-oriented if and only if they are reliable.  (This
is what I was saying to Guido elsewhere.)


I'm not sure I follow you.  What I want is for the stdlib to create 
stringlike objects of a type determined by the types of the inputs -- 
where the logic for deciding this coercion can be controlled by the 
input objects' types, rather than putting this in the hands of the 
stdlib function.


And of course, this applies to non-stdlib functions, too -- anything 
that simply manipulates user-defined string classes, should allow the 
user-defined classes to determine the coercion of the result.



BTW, this was a little unclear to me:

  [Collisions will] be with other *unicode* strings.  Ones coming
  from other code, and literals embedded in the stdlib.

What about the literals in the stdlib?  Are you saying they contain
invalid code points for your known output encoding?  Or are you saying
that with non-polymorphic unicode stdlib, you get lots of false
positives when combining with your validated bytes?


No, I mean that the current string coercion rules cause everything to 
be converted to unicode, thereby discarding the tainting information, 
so to speak.  This applies equally to other tainting use cases, and 
other uses for custom stringlike objects.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-25 Thread Stephen J. Turnbull

P.J. Eby writes:

  it's just that if you already have the bytes, and all you want to
  do is tag them (e.g. the WSGI headers case), the extra encoding
  step seems pointless.

Well, I'll have to concede that unless and until I get involved in the
WSGI development effort.wink

  But with your architecture, it seems to me that you actually don't
  want polymorphic functions in the stdlib.  You want the stdlib
  functions to be bytes-oriented if and only if they are reliable.  (This
  is what I was saying to Guido elsewhere.)
  
  I'm not sure I follow you.

What I'm saying here is that if bytes are the signal of validity, and
the stdlib functions preserve validity, then it's better to have the
stdlib functions object to unicode data as an argument.  Compare the
alternative: it returns a unicode object which might get passed around
for a while before one of your functions receives it and identifies it
as unvalidated data.

But you agree that there are better mechanisms for validation
(although not available in Python yet), so I don't see this as an
potential obstacle to polymorphism now.

  What I want is for the stdlib to create stringlike objects of a
  type determined by the types of the inputs --

In general this is a hard problem, though.  Polymorphism, OK, one-way
tainting OK, but in general combining related types is pretty
arbitrary, and as in the encoded-bytes case, the result type often
varies depending on expectations of callers, not the types of the
data.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread Stephen J. Turnbull

Guido van Rossum writes:

  For example: how we can make the suite of functions used for URL
  processing more polymorphic, so that each developer can choose for
  herself how URLs need to be treated in her application.

While you have come down on the side of polymorphism (as opposed to
separate functions), I'm a little nervous about it.  Specifically,
Philip Eby expressed a desire for earlier type errors, while
polymorphism seems to ensure that you'll need to Look Before You Leap
to get early error detection.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread Lennart Regebro

On Tue, Jun 22, 2010 at 20:07, James Y Knight f...@fuhm.net wrote:
 Yeah. This is a real issue I have with the direction Python3 went: it pushes
 you into decoding everything to unicode early, even when you don't care --

Well, yes, maybe even if *you* don't care. But often the functions you
need to call must care, and then you need to decode to unicode, even
if you personally don't care. And in those cases, you should deocde as
early as possible.

In the cases where neither you nor the functions you call care, then
you don't have to decode, and you can happily pass binary data from
one function to another.

So this is not really a question of the direction Python 3 went. It's
more a case that some methods that *could* do their transformations in
a well defined way on bytes don't, and then force you to decode to
unicode. But that's not a problem with direction, it's just a missing
feature in the stdlib.

-- 
Lennart Regebro: http://regebro.wordpress.com/
Python 3 Porting: http://python3porting.com/
+33 661 58 14 64
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread M.-A. Lemburg

Lennart Regebro wrote:
 On Tue, Jun 22, 2010 at 20:07, James Y Knight f...@fuhm.net wrote:
 Yeah. This is a real issue I have with the direction Python3 went: it pushes
 you into decoding everything to unicode early, even when you don't care --
 
 Well, yes, maybe even if *you* don't care. But often the functions you
 need to call must care, and then you need to decode to unicode, even
 if you personally don't care. And in those cases, you should deocde as
 early as possible.
 
 In the cases where neither you nor the functions you call care, then
 you don't have to decode, and you can happily pass binary data from
 one function to another.
 
 So this is not really a question of the direction Python 3 went. It's
 more a case that some methods that *could* do their transformations in
 a well defined way on bytes don't, and then force you to decode to
 unicode. But that's not a problem with direction, it's just a missing
 feature in the stdlib.

The discussion is showing that in at least a few application spaces,
the stdlib should be able to work on both bytes and Unicode, preferably
using the same interfaces using polymorphism, i.e.

some_function(bytes) - bytes
some_function(str) - str

In Python2 this partially works due to the automatic bytes-str
conversion (in some cases you get some_function(bytes) - str),
the codec base class implementations being a prime example.

In Python3, things have to be done explicity and I think we need
to add a few helpers to make writing such str/bytes interfaces
easier.

We've already had some suggestions in that area, but probably need
to collect a few more ideas based on real-life porting attempts.

I'd like to make this a topic at the upcoming language summit
in Birmingham, if Michael agrees.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jun 24 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2010-07-19: EuroPython 2010, Birmingham, UK24 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread Michael Foord


On 24/06/2010 11:58, M.-A. Lemburg wrote:

Lennart Regebro wrote:
   

On Tue, Jun 22, 2010 at 20:07, James Y Knightf...@fuhm.net  wrote:
 

Yeah. This is a real issue I have with the direction Python3 went: it pushes
you into decoding everything to unicode early, even when you don't care --
   

Well, yes, maybe even if *you* don't care. But often the functions you
need to call must care, and then you need to decode to unicode, even
if you personally don't care. And in those cases, you should deocde as
early as possible.

In the cases where neither you nor the functions you call care, then
you don't have to decode, and you can happily pass binary data from
one function to another.

So this is not really a question of the direction Python 3 went. It's
more a case that some methods that *could* do their transformations in
a well defined way on bytes don't, and then force you to decode to
unicode. But that's not a problem with direction, it's just a missing
feature in the stdlib.
 

The discussion is showing that in at least a few application spaces,
the stdlib should be able to work on both bytes and Unicode, preferably
using the same interfaces using polymorphism, i.e.

some_function(bytes) -  bytes
some_function(str) -  str

In Python2 this partially works due to the automatic bytes-str
conversion (in some cases you get some_function(bytes) -  str),
the codec base class implementations being a prime example.

In Python3, things have to be done explicity and I think we need
to add a few helpers to make writing such str/bytes interfaces
easier.

We've already had some suggestions in that area, but probably need
to collect a few more ideas based on real-life porting attempts.

I'd like to make this a topic at the upcoming language summit
in Birmingham, if Michael agrees.

   

Yep, it sounds like a great topic for the language summit.

Michael

--
http://www.ironpythoninaction.com/

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread Guido van Rossum

On Thu, Jun 24, 2010 at 1:12 AM, Stephen J. Turnbull step...@xemacs.org wrote:
 Guido van Rossum writes:

   For example: how we can make the suite of functions used for URL
   processing more polymorphic, so that each developer can choose for
   herself how URLs need to be treated in her application.

 While you have come down on the side of polymorphism (as opposed to
 separate functions), I'm a little nervous about it.  Specifically,
 Philip Eby expressed a desire for earlier type errors, while
 polymorphism seems to ensure that you'll need to Look Before You Leap
 to get early error detection.

Understood, but both the majority of str/bytes methods and several
existing APIs (e.g. many in the os module, like os.listdir()) do it
this way.

Also, IMO a polymorphic function should *not* accept *mixed*
bytes/text input -- join('x', b'y') should be rejected. But join('x',
'y') - 'x/y' and join(b'x', b'y') - b'x/y' make sense to me.

So, actually, I *don't* understand what you mean by needing LBYL.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread Nick Coghlan

On Fri, Jun 25, 2010 at 12:33 AM, Guido van Rossum gu...@python.org wrote:
 Also, IMO a polymorphic function should *not* accept *mixed*
 bytes/text input -- join('x', b'y') should be rejected. But join('x',
 'y') - 'x/y' and join(b'x', b'y') - b'x/y' make sense to me.

A policy of allowing arguments to be either str or bytes, but not a
mixture, actually avoids one of the more painful aspects of the 2.x
promote mixed operations to unicode approach. Specifically, you
either had to scan all the arguments up front to check for unicode, or
else you had to stop what you were doing and start again with the
unicode version if you encountered unicode partway through. Neither
was particularly nice to implement.

As you noted elsewhere, literals and string methods are still likely
to be a major sticking point with that approach - common operations
like ''.join(seq) and b''.join(seq) aren't polymorphic, so functions
that use them won't be polymorphic either. (It's only the str-unicode
promotion behaviour in 2.x that works around this problem there).

Would it be heretical to suggest that sum() be allowed to work on
strings to at least eliminate ''.join() as something that breaks bytes
processing? It already works for bytes, although it then fails with a
confusing message for bytearray:

 sum(ba b c.split(), b'')
b'abc'

 sum(bytearray(ba b c).split(), bytearray(b''))
Traceback (most recent call last):
  File stdin, line 1, in module
TypeError: sum() can't sum bytes [use b''.join(seq) instead]

 sum(a b c.split(), '')
Traceback (most recent call last):
  File stdin, line 1, in module
TypeError: sum() can't sum strings [use ''.join(seq) instead]

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread Guido van Rossum

On Thu, Jun 24, 2010 at 8:25 AM, Nick Coghlan ncogh...@gmail.com wrote:
 On Fri, Jun 25, 2010 at 12:33 AM, Guido van Rossum gu...@python.org wrote:
 Also, IMO a polymorphic function should *not* accept *mixed*
 bytes/text input -- join('x', b'y') should be rejected. But join('x',
 'y') - 'x/y' and join(b'x', b'y') - b'x/y' make sense to me.

 A policy of allowing arguments to be either str or bytes, but not a
 mixture, actually avoids one of the more painful aspects of the 2.x
 promote mixed operations to unicode approach. Specifically, you
 either had to scan all the arguments up front to check for unicode, or
 else you had to stop what you were doing and start again with the
 unicode version if you encountered unicode partway through. Neither
 was particularly nice to implement.

Right. Polymorphic functions should *not* allow mixing text and bytes.
It's all text or all bytes.

 As you noted elsewhere, literals and string methods are still likely
 to be a major sticking point with that approach - common operations
 like ''.join(seq) and b''.join(seq) aren't polymorphic, so functions
 that use them won't be polymorphic either. (It's only the str-unicode
 promotion behaviour in 2.x that works around this problem there).

 Would it be heretical to suggest that sum() be allowed to work on
 strings to at least eliminate ''.join() as something that breaks bytes
 processing? It already works for bytes, although it then fails with a
 confusing message for bytearray:

 sum(ba b c.split(), b'')
 b'abc'

 sum(bytearray(ba b c).split(), bytearray(b''))
 Traceback (most recent call last):
  File stdin, line 1, in module
 TypeError: sum() can't sum bytes [use b''.join(seq) instead]

 sum(a b c.split(), '')
 Traceback (most recent call last):
  File stdin, line 1, in module
 TypeError: sum() can't sum strings [use ''.join(seq) instead]

I don't think we should abuse sum for this. A simple idiom to get the
*empty* string of a particular type is x[:0] so you could write
something like this to concatenate a list or strings or bytes:
xs[:0].join(xs). Note that if xs is empty we wouldn't know what to do
anyway so this should be disallowed.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread Baptiste Carvello


P.J. Eby a écrit :

[...] stdlib constants are almost always ASCII, 
and the main use cases for ebytes would involve ascii-extended encodings.)


Then, how about a new ascii string literal? This would produce a special kind 
of string that would coerce to a normal string when mixed with a str, and to a 
bytes using ascii codec when mixed with a bytes. Then you could write


 a/.join(base, path)

and not worry if base and path are both str, or both bytes (mixed being of 
course forbidden).


B.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread P.J. Eby


At 05:12 PM 6/24/2010 +0900, Stephen J. Turnbull wrote:

Guido van Rossum writes:

  For example: how we can make the suite of functions used for URL
  processing more polymorphic, so that each developer can choose for
  herself how URLs need to be treated in her application.

While you have come down on the side of polymorphism (as opposed to
separate functions), I'm a little nervous about it.  Specifically,
Philip Eby expressed a desire for earlier type errors, while
polymorphism seems to ensure that you'll need to Look Before You Leap
to get early error detection.


This doesn't have to be in the functions; it can be in the 
*types*.  Mixed-type string operations have to do type checking and 
upcasting already, but if the protocol were open, you could make an 
encoded-bytes type that would handle the error checking.


(Btw, in some earlier emails, Stephen, you implied that this could be 
fixed with codecs -- but it can't, because the problem isn't with the 
bytes containing invalid Unicode, it's with the Unicode containing 
invalid bytes -- i.e., characters that can't be encoded to the 
ultimate codec target.)


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread Nick Coghlan

On Fri, Jun 25, 2010 at 3:07 AM, P.J. Eby p...@telecommunity.com wrote:
 (Btw, in some earlier emails, Stephen, you implied that this could be fixed
 with codecs -- but it can't, because the problem isn't with the bytes
 containing invalid Unicode, it's with the Unicode containing invalid bytes
 -- i.e., characters that can't be encoded to the ultimate codec target.)

That's what the surrogateescape error handler is for though - it will
happily accept mojibake on input (putting invalid bytes into the PUA),
and happily generate mojibake on output (recreating the invalid bytes
from the PUA) as well.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-24 Thread Nick Coghlan

On Fri, Jun 25, 2010 at 1:41 AM, Guido van Rossum gu...@python.org wrote:
 I don't think we should abuse sum for this. A simple idiom to get the
 *empty* string of a particular type is x[:0] so you could write
 something like this to concatenate a list or strings or bytes:
 xs[:0].join(xs). Note that if xs is empty we wouldn't know what to do
 anyway so this should be disallowed.

That's a good trick, although there's a [0] missing from your join
example (type(xs[0])() is another way to spell the same idea, but
the subscripting version would likely be faster since it skips the
builtin lookup). Promoting that over explicit use of empty str and
bytes literals is probably step 1 in eliminating gratuitous breakage
of bytes/str polymorphism (this trick also has the benefit of working
with non-builtin character sequence types).

Use of non-empty bytes/str literals is going to be harder to handle -
actually trying to apply a polymorphic philosophy to the Python 3 URL
parsing libraries may be a good way to learn more on that front.

Cheers,
Nick.

P.S. I'm off to Sydney for PyconAU this evening, so I'm not sure how
much time I'll get to follow python-dev until next week.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Stephen J. Turnbull

Ian Bicking writes:

  Just for perspective, I don't know if I've ever wanted to deal with a URL
  like that.

Ditto, I do many times a day for Japanese media sites and Wikipedia.

  I know how it is supposed to work, and I know what a browser does
  with that, but so many tools will clean that URL up *or* won't be
  able to deal with it at all that it's not something I'll be passing
  around.

I'm not suggesting that is something you want to be passing around;
it's a presentation form, and I prefer that the internal form use
Unicode.

  While it's nice to be correct about encodings, sometimes it is
  impractical.  And it is far nicer to avoid the situation entirely.

But you cannot avoid it entirely.  Processing bytes mean you are
assuming ASCII compatibility.  Granted, this is a pretty good
assumption, especially if you got the bytes off the wire, but it's not
universally so.

Maybe it's a YAGNI, but one reason I prefer the decode-process-encode
paradigm is that choice of codec is a specification of the assumptions
you're making about encoding.  So the Know-Nothing codec described
above assumes just enough ASCII compatibility to parse the scheme.
You could also have codecs which assume just enough ASCII
compatibility to parse a hierarchical scheme, etc.

  That is, decoding content you don't care about isn't just
  inefficient, it's complicated and can introduce errors.

That depends on the codec(s) used.

  Similarly I'd expect (from experience) that a programmer using
  Python to want to take the same approach, sticking with unencoded
  data in nearly all situations.

Indeed, a programmer using Python 2 would want to do so, because all
her literal strings are bytes by default (ie, if she doesn't mark them
with `u'), and interactive input is, too.  This is no longer so
obvious in Python 3 which takes the attitude that things that are
expected to be human-readable should be processed as str.  The obvious
example in URI space is the file:/// URL, which you'll typically build
up from a user string or a file browser, which will call the os.path
stuff which returns str.

Text editors and viewers will also use str for their buffers, and if
they provide a way to fish out URIs for their users, they'll probably
return str.

I won't pretend to judge the relative importance of such use cases.
But use cases for urllib which naturally favor str until you put the
URI on the wire do exist, as does the debugging presentation aspect.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Stephen J. Turnbull

James Y Knight writes:

  The surrogateescape method is a nice workaround for this, but I can't  
  help thinking that it might've been better to just treat stuff as  
  possibly-invalid-but-probably-utf8 byte-strings from input, through  
  processing, to output.

This is the world we already have, modulo s/utf8/ascii + random GR
charset/.  It doesn't work, and it can't, in Japan or China or Korea,
and probably not in Russia or Kazakhstan, for some time yet.

That's not to say that byte-oriented processing doesn't have its
place.  And in many cases it's reasonable (but not secure or
bulletproof!) to assume ASCII compatibility of the byte stream,
passing through syntactically unimportant bytes verbatim.  Syntactic
analysis of such streams will surely have a lot in common with that
for text streams, so the same tools should be available.  (That's the
point of Guido's endorsement of polymorphism, AIUI.)

But it's just not reasonable to assume that will work in a context
where text streams from various sources are mixed with byte streams.
In that case, the byte streams need to be converted to text before
mixing.  (You can't do it the other way around because there is no
guarantee that the text is compatible with the current encoding of the
byte stream, nor that all the byte streams have the same encoding.)

We do need str-based implementations of modules like urllib.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread M.-A. Lemburg

Nick Coghlan wrote:
 On Wed, Jun 23, 2010 at 4:09 AM, M.-A. Lemburg m...@egenix.com wrote:
 It would be great if we could have something like the above as
 builtin method:

 x.split(''.as(x))
 
 As per my other message, another possible (and reasonably intuitive)
 spelling would be:
 
   x.split(x.coerce(''))

You are right: there are two ways to adapt one object to another.
You can either adapt object 1 to object 2 or object 2 to object 1.
This is what the Python2 coercion protocol does for operators.
I just wanted to avoid using that term, since Python3 removes
the coercion protocol.

 Writing it as a helper function is also possible, although it be
 trickier to remember the correct argument ordering:
 
   def coerce_to(target, obj, encoding=None, errors='surrogateescape'):
 if isinstance(obj, type(target)):
 return obj
 if encoding is None:
 encoding = sys.getdefaultencoding()
 try::
 convert = obj.decode
 except AttributeError:
 convert = obj.encode
 return convert(encoding, errors)
 
   x.split(coerce_to(x, ''))
 
 Perhaps something to discuss on the language summit at EuroPython.

 Too bad we can't add such porting enhancements to Python2 anymore.
 
 Well, we can if we really want to, it just entails convincing Benjamin
 to reschedule the 2.7 final release. Given the UserDict/ABC/old-style
 classes issue, there's a fair chance there's going to be at least one
 more 2.7 RC anyway.
 
 That said, since this kind of coercion can be done in a helper
 function, that should be adequate for the 2.x to 3.x conversion case
 (for 2.x, the helper function can be defined to just return the second
 argument since bytes and str are the same type, while the 3.x version
 would look something like the code above)

True.

Note that the point of using a builtin method was to get
better performance. Such type adaptions are often needed in
loops, so adding a few extra Python function calls just to
convert a str object to a bytes object or vice-versa is a
bit much overhead.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jun 23 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2010-07-19: EuroPython 2010, Birmingham, UK25 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Nick Coghlan

On Wed, Jun 23, 2010 at 7:18 PM, M.-A. Lemburg m...@egenix.com wrote:
 Note that the point of using a builtin method was to get
 better performance. Such type adaptions are often needed in
 loops, so adding a few extra Python function calls just to
 convert a str object to a bytes object or vice-versa is a
 bit much overhead.

I actually agree with that, I just think we need more real world
experience as to what works with the Python 3 text model before we
start messing with the APIs for the builtin objects (fair point that
coerce is a loaded term given the existence of the old coercion
protocol. It's the right word for the task though).

One of the key points coming out of this thread (to my mind) is the
lack of a Text ABC or other way of making an object that can be passed
to functions expecting a str instance with a reasonable expectation of
having it work. Are there some core string capabilities that can be
identified and then expanded out to a full str-compatible API? (i.e.
something along the lines of what collections.MutableMapping now
provides for dict-alikes).

However, even if something like that was added, PJE is correct in
pointing out that builtin strings still don't play well with others in
many cases (usually due to underlying optimisations or other sound
reasons, but perhaps sometimes gratuitously). Most of the string
binary operations can be dealt with through their reflected forms, but
str.__mod__ will never return NotImplemented, __contains__ has no
reflected form and the actual method calls are of course right out
(e.g. the arguments to str.join() or str.split() calls have no ability
to affect the type of the result).

Third party number implementations couldn't provide comparable
funtionality to builtin int and long objects until the __index__
protocol was added. Perhaps PJE is right that what this is really
crying out for is a way to have third party real string
implementations so that there can actually be genuine experimentation
in the Unicode handling space outside the language core (comparable to
the difference between the you can turn me into an int __int__
method and the I am an int equivalent __index__ method).

That may be tapping in a nail with a sledgehammer (and would raise
significant moratorium questions if pursued further), but I think it's
a valid question to at least ask.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread P.J. Eby


At 08:34 PM 6/22/2010 -0400, Glyph Lefkowitz wrote:

I suspect the practical problem here is that there's no CharacterString ABC


That, and the absence of a string coercion protocol so that mixing 
your custom string with standard strings will do the right thing for 
your intended use.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Tres Seaver

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Stephen J. Turnbull wrote:

 We do need str-based implementations of modules like urllib.

Why would that be?  URLs aren't text, and never will be.  The fact that
to the eye they may seem to be text-ish doesn't make them text.  This
*is* a case where dont make me think is a losing propsition:
programmers who work with URLs in any non-opaque way as text are
eventually going to be bitten by this issue no matter how hard we wave
our hands.


Tres.
- --
===
Tres Seaver  +1 540-429-0999  tsea...@palladion.com
Palladion Software   Excellence by Designhttp://palladion.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkwiKI4ACgkQ+gerLs4ltQ56/QCbBPdj8jaPbcvPIDPb7ys04oHg
fLIAnR+kA2udazsnpzTp2INGz2CoWgzj
=Swjw
-END PGP SIGNATURE-

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Guido van Rossum

On Wed, Jun 23, 2010 at 8:30 AM, Tres Seaver tsea...@palladion.com wrote:
 Stephen J. Turnbull wrote:

 We do need str-based implementations of modules like urllib.

 Why would that be?  URLs aren't text, and never will be.  The fact that
 to the eye they may seem to be text-ish doesn't make them text.  This
 *is* a case where dont make me think is a losing propsition:
 programmers who work with URLs in any non-opaque way as text are
 eventually going to be bitten by this issue no matter how hard we wave
 our hands.

This has been asserted and contested several times now, and I don't
see the two positions getting any closer.

So I propose that we drop the discussion are URLs text or bytes and
try to find something more pragmatic to discuss.

For example: how we can make the suite of functions used for URL
processing more polymorphic, so that each developer can choose for
herself how URLs need to be treated in her application.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Barry Warsaw

On Jun 23, 2010, at 08:43 AM, Guido van Rossum wrote:

So I propose that we drop the discussion are URLs text or bytes and
try to find something more pragmatic to discuss.

email has exactly the same question, and the answer is yes. wink

For example: how we can make the suite of functions used for URL
processing more polymorphic, so that each developer can choose for
herself how URLs need to be treated in her application.

I think email package hackers should watch this effort closely.  RDM has
written some stuff up on how we think we're going to handle this, though it's
probably pretty email package specific.  Maybe there's a better, general, or
conventional approach lurking around somewhere.

http://wiki.python.org/moin/Email%20SIG

-Barry


signature.asc
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Bill Janssen

Tres Seaver tsea...@palladion.com wrote:

 Stephen J. Turnbull wrote:
 
  We do need str-based implementations of modules like urllib.
 
 Why would that be?  URLs aren't text, and never will be.  The fact that
 to the eye they may seem to be text-ish doesn't make them text.  This

URLs are exactly text (strings, representable as Unicode strings in
Py3K), and were designed as such from the start.  The fact that some of
the things tunneled or carried in URLs are string representations of
non-string data shouldn't obscure that point.  They're not text-ish,
they're text.  They're not opaque, either; they break down in
well-specified ways, mainly into strings.

The trouble comes in when we try to go beyond the spec, or handle things
that don't conform to the spec.  Sure, a path component of a URI might
actually be a %-escaped sequence of arbitrary bytes, even bytes that
don't represent a string in any known encoding, but that's only *after*
reversing the %-escapes, which should happen in a scheme-specific piece
of code, not in generic URL parsing or manipulation.

Bill

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Ian Bicking

On Wed, Jun 23, 2010 at 10:30 AM, Tres Seaver tsea...@palladion.com wrote:

  Stephen J. Turnbull wrote:

  We do need str-based implementations of modules like urllib.


 Why would that be?  URLs aren't text, and never will be.  The fact that
 to the eye they may seem to be text-ish doesn't make them text.  This
 *is* a case where dont make me think is a losing propsition:
 programmers who work with URLs in any non-opaque way as text are
 eventually going to be bitten by this issue no matter how hard we wave
 our hands.


HTML is text, and URLs are embedded in that text, so it's easy to get a URL
that is text.  Though, with a little testing, I notice that text alone can't
tell you what the right URL really is (at least the intended URL when unsafe
characters are embedded in HTML).

To test I created two pages, one in Latin-1 another in UTF-8, and put in the
link:

  ./test.html?param=Réunion

On a Latin-1 page it created a link to test.html?param=R%E9union and on a
UTF-8 page it created a link to test.html?param=R%C3%A9union (the second
link displays in the URL bar as test.html?param=Réunion but copies with
percent encoding).  Though if you link to ./Réunion.html then both pages
create UTF-8 links.  And both pages also link
http://Réunion.comhttp://xn--runion-bva.comto
http://xn--runion-bva.com/.  So really neither bytes nor text works
completely; query strings receive the encoding of the page, which would be
handled transparently if you worked on the page's bytes.  Path and domain
are consistently encoded with UTF-8 and punycode respectively and so would
be handled best when treated as text.  And of course if you are a page with
a non-ASCII-compatible encoding you really must handle encodings before the
URL is sensible.

Another issue here is that there's no encoding for turning a URL into
bytes if the URL is not already ASCII.  A proper way to encode a URL would
be:

(Totally as an aside, as I remind myself of new module names I notice it's
not easy to google specifically for Python 3 docs, e.g. python 3 urlsplit
gives me 2.6 docs)

from urllib.parse import urlsplit, urlunsplit
import encodings.idna

def encode_http_url(url, page_encoding='ASCII', errors='strict'):
scheme, netloc, path, query, fragment = urlsplit(url)
scheme = scheme.encode('ASCII', errors)
auth = port = None
if '@' in netloc:
auth, netloc = netloc.split('@', 1)
if ':' in netloc:
netloc, port = netloc.split(':', 1)
netloc = encodings.idna.ToASCII(netloc)
if port:
netloc = netloc + b':' + port.encode('ASCII', errors)
if auth:
netloc = auth.encode('UTF-8', errors) + b'@' + netloc
path = path.encode('UTF-8', errors)
query = query.encode(page_encoding, errors)
fragment = fragment.encode('UTF-8', errors)
return urlunsplit_bytes((scheme, netloc, path, query, fragment))

Where urlunsplit_bytes handles bytes (urlunsplit does not).  It's helpful
for me at least to look at that code specifically:

def urlunsplit(components):
scheme, netloc, url, query, fragment = components
if netloc or (scheme and scheme in uses_netloc and url[:2] != '//'):
if url and url[:1] != '/': url = '/' + url
url = '//' + (netloc or '') + url
if scheme:
url = scheme + ':' + url
if query:
url = url + '?' + query
if fragment:
url = url + '#' + fragment
return url

In this case it really would be best to have Python 2's system where things
are coerced to ASCII implicitly.  Or, more specifically, if all those string
literals in that routine could be implicitly converted to bytes using
ASCII.  Conceptually I think this is reasonable, as for URLs (at least with
HTTP, but in practice I think this applies to all URLs) the ASCII bytes
really do have meaning.  That is, '/' (*in the context of urlunsplit*)
really is \x2f specifically.  Or another example, making a GET request
really means sending the bytes \x47\x45\x54 and there is no other set of
bytes that has that meaning.  The WebSockets specification for instance
defines things like colon:
http://tools.ietf.org/html/draft-hixie-thewebsocketprotocol-76#page-5 -- in
an earlier version they even used bytes to describe HTTP (
http://tools.ietf.org/html/draft-hixie-thewebsocketprotocol-54#page-13),
though this annoyed many people.

-- 
Ian Bicking  |  http://blog.ianbicking.org
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Bill Janssen

Guido van Rossum gu...@python.org wrote:

 So I propose that we drop the discussion are URLs text or bytes and
 try to find something more pragmatic to discuss.
 
 For example: how we can make the suite of functions used for URL
 processing more polymorphic, so that each developer can choose for
 herself how URLs need to be treated in her application.

While I agree with find something more pragmatic to discuss, it also
seems to me that introducing polymorphic URL processing might make
things more confusing and error-prone.

The bigger problem seems to be that we're revisiting the design
discussion about urllib.parse from the summer of 2008.  See
http://bugs.python.org/issue3300 if you want to recall how we hashed
this out 2 years ago.  I didn't particularly like that design, but I had
to go off on vacation :-), and things got settled while I was away.  I
haven't heard much from Matt Giuca since he stopped by and lobbed that
patch into the standard library.

But since Guido is the one who settled it, why are we talking about it
again?

Bill
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Ian Bicking

Oops, I forgot some important quoting (important for the algorithm,
maybe not actually for the discussion)...

from urllib.parse import urlsplit, urlunsplit
import encodings.idna

# urllib.parse.quote both always returns str, and is not as
conservative in quoting as required here...
def quote_unsafe_bytes(b):
result = []
for c in b:
if c  0x20 or c = 0x80:
result.extend(('%%%02X' % c).encode('ASCII'))
else:
result.append(c)
return bytes(result)

def encode_http_url(url, page_encoding='ASCII', errors='strict'):
    scheme, netloc, path, query, fragment = urlsplit(url)
    scheme = scheme.encode('ASCII', errors)
    auth = port = None
    if '@' in netloc:
    auth, netloc = netloc.split('@', 1)
    if ':' in netloc:
    netloc, port = netloc.split(':', 1)
    netloc = encodings.idna.ToASCII(netloc)
    if port:
    netloc = netloc + b':' + port.encode('ASCII', errors)
    if auth:
    netloc = quote_unsafe_bytes(auth.encode('UTF-8', errors)) +
b'@' + netloc
    path = quote_unsafe_bytes(path.encode('UTF-8', errors))
    query = quote_unsafe_bytes(query.encode(page_encoding, errors))
    fragment = quote_unsafe_bytes(fragment.encode('UTF-8', errors))
    return urlunsplit_bytes((scheme, netloc, path, query, fragment))



--
Ian Bicking  |  http://blog.ianbicking.org
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Glyph Lefkowitz


On Jun 22, 2010, at 8:57 PM, Robert Collins wrote:

 bzr has a cache of decoded strings in it precisely because decode is
 slow. We accept slowness encoding to the users locale because thats
 typically much less data to examine than we've examined while
 generating the commit/diff/whatever. We also face memory pressure on a
 regular basis, and that has been, at least partly, due to UCS4 - our
 translation cache helps there because we have less duplicate UCS4
 strings.

Thanks for setting the record straight - apologies if I missed this earlier in 
the thread.  It does seem vaguely familiar.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Tres Seaver

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Bill Janssen wrote:

 The bigger problem seems to be that we're revisiting the design
 discussion about urllib.parse from the summer of 2008.  See
 http://bugs.python.org/issue3300 if you want to recall how we hashed
 this out 2 years ago.  I didn't particularly like that design, but I had
 to go off on vacation :-), and things got settled while I was away.  I
 haven't heard much from Matt Giuca since he stopped by and lobbed that
 patch into the standard library.
 
 But since Guido is the one who settled it, why are we talking about it
 again?

Perhaps such decisions need revisiting in light of subsequent experience
/ pain / learning.  E.g:

- - the repeated inability of the web-sig to converge on appropriate
  semantics for a Python3-compatible version of the WSGI spec;

- - the subsequent quirkiness of the Python3 wsgiref implementation;

- - the breakage in cgi.py which prevents handling file uploads in a
  web application;

- - the slow adoption / porting rate of major web frameworks and libraries
  to Python 3.



Tres.
- --
===
Tres Seaver  +1 540-429-0999  tsea...@palladion.com
Palladion Software   Excellence by Designhttp://palladion.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkwiUSAACgkQ+gerLs4ltQ49EwCeLYwrZs6QfairPP5zpeeUlxao
qg8An37kRz1CrzGc3kScvSqVx8FPnO1M
=lR6R
-END PGP SIGNATURE-

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Antoine Pitrou

On Wed, 23 Jun 2010 14:23:33 -0400
Tres Seaver tsea...@palladion.com wrote:
 
 Perhaps such decisions need revisiting in light of subsequent experience
 / pain / learning.  E.g:
 
 - - the repeated inability of the web-sig to converge on appropriate
   semantics for a Python3-compatible version of the WSGI spec;
 
 - - the subsequent quirkiness of the Python3 wsgiref implementation;

The way wsgiref was adapted is admittedly suboptimal. It was totally
broken at first, and PJE didn't want to look very deeply into it. We
therefore had to settle on a series of small modifications that seemed
rather reasonable, but without any in-depth discussion of what WSGI had
to look like under Python 3 (since it was not our job and responsibility).

Therefore, I don't think wsgiref should be taken as a guide to what
a cleaned up, Python 3-specific WSGI must look like.

 - - the slow adoption / porting rate of major web frameworks and libraries
   to Python 3.

Some of the major web frameworks and libraries have a ton of
dependencies, which would explain why they really haven't bothered yet.

I don't think you can't claim, though, that Python 3 makes things
significantly harder for these frameworks. The proof is that many of
them already give the user unicode strings in Python 2.x. They must
have somehow got the decoding right.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Toshio Kuratomi

On Wed, Jun 23, 2010 at 09:36:45PM +0200, Antoine Pitrou wrote:
 On Wed, 23 Jun 2010 14:23:33 -0400
 Tres Seaver tsea...@palladion.com wrote:
  - - the slow adoption / porting rate of major web frameworks and libraries
to Python 3.
 
 Some of the major web frameworks and libraries have a ton of
 dependencies, which would explain why they really haven't bothered yet.
 
 I don't think you can't claim, though, that Python 3 makes things
 significantly harder for these frameworks. The proof is that many of
 them already give the user unicode strings in Python 2.x. They must
 have somehow got the decoding right.
 
Note that this assumption seems optimistic to me.  I started talking to Graham
Dumpleton, author of mod_wsgi a couple years back because mod_wsgi and paste
do decoding of bytes to unicode at different layers which caused problems
for application level code that should otherwise run fine when being served
by mod_wsgi or paste httpserver.  That was the beginning of Graham starting
to talk about what the wsgi spec really should look like under python3
instead of the broken way that the appendix to the current wsgi spec states.

-Toshio


pgpRSbaUGJzcz.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Antoine Pitrou

On Wed, 23 Jun 2010 17:30:22 -0400
Toshio Kuratomi a.bad...@gmail.com wrote:
 Note that this assumption seems optimistic to me.  I started talking to Graham
 Dumpleton, author of mod_wsgi a couple years back because mod_wsgi and paste
 do decoding of bytes to unicode at different layers which caused problems
 for application level code that should otherwise run fine when being served
 by mod_wsgi or paste httpserver.  That was the beginning of Graham starting
 to talk about what the wsgi spec really should look like under python3
 instead of the broken way that the appendix to the current wsgi spec states.

Ok, but the reason would be that the WSGI spec is broken. Not Python 3
itself.

Regards

Antoine.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-23 Thread Toshio Kuratomi

On Wed, Jun 23, 2010 at 11:35:12PM +0200, Antoine Pitrou wrote:
 On Wed, 23 Jun 2010 17:30:22 -0400
 Toshio Kuratomi a.bad...@gmail.com wrote:
  Note that this assumption seems optimistic to me.  I started talking to 
  Graham
  Dumpleton, author of mod_wsgi a couple years back because mod_wsgi and paste
  do decoding of bytes to unicode at different layers which caused problems
  for application level code that should otherwise run fine when being served
  by mod_wsgi or paste httpserver.  That was the beginning of Graham starting
  to talk about what the wsgi spec really should look like under python3
  instead of the broken way that the appendix to the current wsgi spec states.
 
 Ok, but the reason would be that the WSGI spec is broken. Not Python 3
 itself.
 
Agreed.  Neither python2 nor python3 is broken.  It's the wsgi spec and the
implementation of that spec where things fall down.  From your first post,
I thought you were claiming that python3 was broken since web frameworks got
decoding right on python2 and I just wanted to defend python3 by showing
that python2 wasn't all sunshine and roses.

-Toshio


pgp8xQXfAPrYT.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Toshio Kuratomi

On Tue, Jun 22, 2010 at 11:58:57AM +0900, Stephen J. Turnbull wrote:
 Toshio Kuratomi writes:
 
   One comment here -- you can also have uri's that aren't decodable into 
 their
   true textual meaning using a single encoding.
   
   Apache will happily serve out uris that have utf-8, shift-jis, and
   euc-jp components inside of their path but the textual
   representation that was intended will be garbled (or be represented
   by escaped byte sequences).  For that matter, apache will serve
   requests that have no true textual representation as it is working
   on the byte level rather than the character level.
 
 Sure.  I've never seen that combination, but I have seen Shift JIS and
 KOI8-R in the same path.
 
 But in that case, just using 'latin-1' as the encoding allows you to
 use the (unicode) string operations internally, and then spew your
 mess out into the world for someone else to clean up, just as using
 bytes would.
 
This is true.  I'm giving this as a real-world counter example to the
assertion that URIs are text.  In fact, I think you're confusing things
a little by asserting that the RFC says that URIs are text.  I'll address
that in two sections down.

   So a complete solution really should allow the programmer to pass
   in uris as bytes when the programmer knows that they need it.
 
 Other than passing bytes into a constructor, I would argue if a
 complete solution requires, eg, an interface that allows
 urljoin(base,subdir) where the types of base and subdir are not
 required to match, then it doesn't belong in the stdlib.  For stdlib
 usage, that's premature optimization IMO.
 
I'll definitely buy that.  Would urljoin(b_base, b_subdir) = bytes and
urljoin(u_base, u_subdir) = unicode be acceptable though?  (I think, given
other options, I'd rather see two separate functions, though.  It seems more
discoverable and less prone to taking bad input some of the time to have two
functions that clearly only take one type of data apiece.)

 The RFC says that URIs are text, and therefore they can (and IMO
 should) be operated on as text in the stdlib.

If I'm reading the RFC correctly, you're actually operating on two different
levels here.  Here's the section 2 that you quoted earlier, now in its
entirety::
2.  Characters

   The URI syntax provides a method of encoding data, presumably for the
   sake of identifying a resource, as a sequence of characters.  The URI
   characters are, in turn, frequently encoded as octets for transport or
   presentation.  This specification does not mandate any particular
   character encoding for mapping between URI characters and the octets used
   to store or transmit those characters.  When a URI appears in a protocol
   element, the character encoding is defined by that protocol; without such
   a definition, a URI is assumed to be in the same character encoding as
   the surrounding text.

   The ABNF notation defines its terminal values to be non-negative integers
   (codepoints) based on the US-ASCII coded character set [ASCII].  Because
   a URI is a sequence of characters, we must invert that relation in order
   to understand the URI syntax.  Therefore, the integer values used by the
   ABNF must be mapped back to their corresponding characters via US-ASCII
   in order to complete the syntax rules.

   A URI is composed from a limited set of characters consisting of digits,
   letters, and a few graphic symbols.  A reserved subset of those
   characters may be used to delimit syntax components within a URI while
   the remaining characters, including both the unreserved set and those
   reserved characters not acting as delimiters, define each component's
   identifying data.

So here's some data that matches those terms up to actual steps in the
process::

  # We start off with some arbitrary data that defines a resource.  This is
  # not necessarily text.  It's the data from the first sentence:
  data = b\xff\xf0\xef\xe0

  # We encode that into text and combine it with the scheme and host to form
  # a complete uri.  This is the URI characters mentioned in section #2.
  # It's also the sequence of characters mentioned in 1.1 as it is not
  # until this point that we actually have a URI.
  uri = bhttp://host/; + percentencoded(data)
  # 
  # Note1: percentencoded() needs to take any bytes or characters outside of
  # the characters listed in section 2.3 (ALPHA / DIGIT / - / . / _
  # / ~) and percent encode them.  The URI can only consist of characters
  # from this set and the reserved character set (2.2).
  #
  # Note2: in this simplistic example, we're only dealing with one piece of
  # data.  With multiple pieces, we'd need to combine them with separators,
  # for instance like this:
  # uri = b'http://host/' + percentencoded(data1) + b'/'
  # + percentencoded(data2)
  #
  # Note3: at this point, the uri could be stored as unicode or bytes in
  # python3.  It doesn't matter.  It will be a subset of ASCII in either
  # case.

  # Then we

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Glyph Lefkowitz

On Jun 21, 2010, at 10:58 PM, Stephen J. Turnbull wrote:

 The RFC says that URIs are text, and therefore they can (and IMO
 should) be operated on as text in the stdlib.


No, *blue* is the best color for a shed.

Oops, wait, let me try that again.

While I broadly agree with this statement, it is really an oversimplification.  
An URI is a structured object, with many different parts, which are transformed 
from bytes to ASCII (or something latin1-ish, which is really just bytes with a 
nice face on them) to real, honest-to-goodness text via the IRI specification: 
http://tools.ietf.org/html/rfc3987.

 Note also that the complete solution argument cuts both ways.  Eg, a
 complete solution should implement UTS 39 confusables detection[1]
 and IDNA[2].  Good luck doing that with bytes!

And good luck doing that with just characters, too.  You need a parsed 
representation of the URI that you can encode different parts of in different 
ways.  (My understanding is that you should only really implement confusables 
detection in the netloc... while that may be a bogus example, you're certainly 
only supposed to do IDNA in the netloc!)

You can just call urlsplit() all over the place to emulate this, but this does 
not give you the ability to go back to the original bytes, and thereby preserve 
things like brokenly-encoded segments, which seems to be what a lot of this 
hand-wringing is about.

To put it another way, there is no possible information-preserving string or 
bytes type that will make everyone happy as a result from urljoin().  The only 
return-type that gives you *everything* is URI.

 just using 'latin-1' as the encoding allows you to
 use the (unicode) string operations internally, and then spew your
 mess out into the world for someone else to clean up, just as using
 bytes would.

This is the limitation that everyone seems to keep dancing around.  If you are 
using the stdlib, with functions that operate on sequences like 'str' or 
'bytes', you need to choose from one of three options:

  1. decode everything to latin1 (although I prefer to call it charmap when 
used in this way) so that you can have some mojibake that will fool a function 
that needs a unicode object, but not lose any information about your input so 
that it can be transformed back into exact bytes (and be very careful to never 
pass it somewhere that it will interact with real text!),
  2. actually decode things to an appropriate encoding to be displayed to the 
user and manipulated with proper text-manipulation tools, and throw away 
information about the bytes,
  3. keep both the bytes and the characters together (perhaps in a data 
structure) so that you can both display the data and encode it in 
situationally-appropriate ways.

The stdlib as it is today is not going to handle the 3rd case for anyone.  I 
think that's fine; it is not the stdlib's job to solve everyone's problems.  
I've been happy with it providing correctly-functioning pieces that can be used 
to build more elaborate solutions.  This is what I meant when I said I agree 
with Stephen's first point: the stdlib *should* just keep operating entirely on 
strings, because URIs are defined, by the spec, to be sequences of ASCII 
characters.  But that's not the whole story.

PJE's bstr and ebytes proposals set my teeth on edge.  I can totally 
understand the motivation for them, but I think it would be a big step 
backwards for python 3 to succumb to that temptation, even in the form of a 
third-party library.  It is really trying to cram more information into a pile 
of bytes than truly exists there.  (Also, if we're going to have encodings 
attached to bytes objects, I would very much like to add JPEG and FLAC to 
the list of possibilities.)

The real tension there is that WSGI is desperately trying to avoid defining any 
data structures (i.e. classes), while still trying to work with structured 
data.  An URI class with a 'child' method could handily solve this problem.  
You could happily call IRI(...).join(some bytes).join(some text) and then just 
say give me some bytes, it's time to put this on the network, or give me 
some characters, I have to show something to the user, or even give me some 
characters appropriate for an 'href=' target in some HTML I'm generating - 
although that last one could be left to the HTML generator, provided it could 
get enough information from the URI/IRI object's various parts itself.

I don't mean to pick on WSGI, either.  This is a common pain-point for porting 
software to 3.x - you had a string, it kinda worked most of the time before, 
but now you need to keep track of text too and the functions which seemed to 
work on bytes no longer do.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Raymond Hettinger


On Jun 21, 2010, at 10:31 PM, Glyph Lefkowitz wrote:
   This is a common pain-point for porting software to 3.x - you had a string, 
 it kinda worked most of the time before, but now you need to keep track of 
 text too and the functions which seemed to work on bytes no longer do.

Thanks Glyph.  That is a nice summary of one kind of challenge facing 
programmers.


Raymond

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Stephen J. Turnbull

Glyph Lefkowitz writes:
  On Jun 21, 2010, at 10:58 PM, Stephen J. Turnbull wrote:

   Note also that the complete solution argument cuts both ways.  Eg, a
   complete solution should implement UTS 39 confusables detection[1]
   and IDNA[2].  Good luck doing that with bytes!
  
  And good luck doing that with just characters, too.

I agree with you, sorry.  I meant to cast doubt on the idea of
complete solutions, or at least claims that completeness is an excuse
for putting it in the stdlib.

  This is the limitation that everyone seems to keep dancing around.
  If you are using the stdlib, with functions that operate on
  sequences like 'str' or 'bytes', you need to choose from one of
  three options: 

There's a *fourth* way: specially designed codecs to preserve as much
metainformation as you need, while always using the str format
internally.  This can be done for at least 100,000 separate
(character, encoding) pairs by multiplexing into private space with an
auxiliary table of encodings and equivalences.  That's probably
overkill.  In many cases, adding simple PEP 383 mechanism (to preserve
uninterpreted bytes) might be enough though, and that's pretty
plausible IMO.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Stephen J. Turnbull

Toshio Kuratomi writes:

  I'll definitely buy that.  Would urljoin(b_base, b_subdir) = bytes and
  urljoin(u_base, u_subdir) = unicode be acceptable though?

Probably.  

But it doesn't matter what I say, since Guido has defined that as
polymorphism and approved it in principle.

  (I think, given other options, I'd rather see two separate
  functions, though.

Yes.

  If you want to deal with things like this::
http://host/café

Yes.

  At that point you are no longer dealing with the sequence of
  characters talked about in the RFC.  You are dealing with data
  which may or may not be text.

That's right, and I think that in most cases that is what programmers
want to be dealing with.  Let the library make sure that what goes on
the wire conforms to the RFC.  I don't want to know about it, I want
to work with the content of the URI.

  The proliferation of encoding I agree is a thing that is ugly.
  Although, if I'm thinking correctly, that only matters when you
  want to allow mixing bytes and unicode, correct?

Well you need to know a fair amount about the encoding: that the
reserved bytes are used as defined in the RFC, for example.

  For debugging, I'm either not understanding or you're wrong.  If I'm given
  an arbitrary sequence of bytes how do I sanely store them as str internally?

If it's really arbitrary, you use either a mapping to private space or
PEP 383, and accept that it won't make sense.  But in most cases you
should be able to achieve a fair degree of sanity.

  If I transform them using an encoding that anticipates the full range of
  bytes I may be able to display some representation of them but it's not
  necessarily the sanest method of display (for instance, if I know that path
  element 1 is always going to be a utf8 encoded string and path element 2 is
  always shift-jis encoded, and path element 3 is binary data, I could
  construct a much saner display method than treating the whole thing as
  latin1).

And I think in most cases you will know, although the cases where
you'll know will be because of a system-wide encoding.

  What is your basis for asserting that URIs that aren't sanely treated as
  text are garbage?

I don't mean we can throw them away, I mean we can't do any sensible
processing on them.  You at least need to know about the reseved
delimiters.  In the same way that Philip used 'garbage' for the
unknown encoding.  And in the sense of garbage in, garbage out.

  unicode handling redesign.  I'm stating my reading of the RFC not to defend
  the use case Philip has, but because I think that the outlook that non-text
  uris (before being percentencoded) are violations of the RFC

That's not what I'm saying.  What I'm trying to point out is that
manipulating a bytes object as an URI sort of presumes a lot about its
encoding as text.  Since many of the URIs we deal with are more or
less textual, why not take advantage of that?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Guido van Rossum

[Just addressing one little issue here; generally I'm just happy that
we're discussing this issue in such detail from so many points of
view.]

On Mon, Jun 21, 2010 at 10:50 PM, Toshio Kuratomi a.bad...@gmail.com wrote:
[...] Would urljoin(b_base, b_subdir) = bytes and
 urljoin(u_base, u_subdir) = unicode be acceptable though?  (I think, given
 other options, I'd rather see two separate functions, though.  It seems more
 discoverable and less prone to taking bad input some of the time to have two
 functions that clearly only take one type of data apiece.)

Hm. I'd rather see a single function (it would be polymorphic in my
earlier terminology). After all a large number of string method calls
(and some other utility function calls) already look the same
regardless of whether they are handling bytes or text (as long as it's
uniform). If the building blocks are all polymorphic it's easier to
create additional polymorphic functions.

FWIW, there are two problems with polymorphic functions, though they
can be overcome:

(1) Literals.

If you write something like x.split('') you are implicitly assuming x
is text. I don't see a very clean way to overcome this; you'll have to
implement some kind of type check e.g.

x.split('') if isinstance(x, str) else x.split(b'')

A handy helper function can be written:

  def literal_as(constant, variable):
  if isinstance(variable, str):
  return constant
  else:
  return constant.encode('utf-8')

So now you can write x.split(literal_as('', x)).

(2) Data sources.

These can be functions that produce new data from non-string data,
e.g. str(int), read it from a named file, etc. An example is read()
vs. write(): it's easy to create a (hypothetical) polymorphic stream
object that accepts both f.write('booh') and f.write(b'booh'); but you
need some other hack to make read() return something that matches a
desired return type. I don't have a generic suggestion for a solution;
for streams in particular, the existing distinction between binary and
text streams works, of course, but there are other situations where
this doesn't generalize (I think some XML interfaces have this
awkwardness in their API for converting a tree to a string).

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Ian Bicking

On Tue, Jun 22, 2010 at 6:31 AM, Stephen J. Turnbull step...@xemacs.orgwrote:

 Toshio Kuratomi writes:

   I'll definitely buy that.  Would urljoin(b_base, b_subdir) = bytes and
   urljoin(u_base, u_subdir) = unicode be acceptable though?

 Probably.

 But it doesn't matter what I say, since Guido has defined that as
 polymorphism and approved it in principle.

   (I think, given other options, I'd rather see two separate
   functions, though.

 Yes.

   If you want to deal with things like this::
 http://host/café http://host/caf%C3%A9

 Yes.


Just for perspective, I don't know if I've ever wanted to deal with a URL
like that.  I know how it is supposed to work, and I know what a browser
does with that, but so many tools will clean that URL up *or* won't be able
to deal with it at all that it's not something I'll be passing around.  So
from a practical point of view this really doesn't come up, and if it did it
would be in a situation where you could easily do something ad hoc (though
there is not currently a routine to quote unsafe characters in a URL... that
would be helpful, though maybe urllib.quote(url.encode('utf8'), '%/:') would
do it).

Also while it is problematic to treat the URL-unquoted value as text
(because it has an unknown encoding, no encoding, or regularly a mixture of
encodings), the URL-quoted value is pretty easy to pass around, and
normalization (in this case to http://host/caf%C3%A9) is generally fine.

While it's nice to be correct about encodings, sometimes it is impractical.
And it is far nicer to avoid the situation entirely.  That is, decoding
content you don't care about isn't just inefficient, it's complicated and
can introduce errors.  The encoding of the underlying bytes of a %-decoded
URL is largely uninteresting.  Browsers (whose behavior drives a lot of
convention) don't touch any of that encoding except lately occasionally to
*display* some data in a more friendly way.  But it's only display, and
errors just make it revert to the old encoded display.

Similarly I'd expect (from experience) that a programmer using Python to
want to take the same approach, sticking with unencoded data in nearly all
situations.


-- 
Ian Bicking  |  http://blog.ianbicking.org
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Toshio Kuratomi

On Tue, Jun 22, 2010 at 08:31:13PM +0900, Stephen J. Turnbull wrote:
 Toshio Kuratomi writes:
   unicode handling redesign.  I'm stating my reading of the RFC not to defend
   the use case Philip has, but because I think that the outlook that non-text
   uris (before being percentencoded) are violations of the RFC
 
 That's not what I'm saying.  What I'm trying to point out is that
 manipulating a bytes object as an URI sort of presumes a lot about its
 encoding as text.

I think we're more or less in agreement now but here I'm not sure.  What
manipulations are you thinking about?  Which stage of URI construction are
you considering?

I've just taken a quick look at python3.1's urllib module and I see that
there is a bit of confusion there.  But it's not about unicode vs bytes but
about whether a URI should be operated on at the real URI level or the
data-that-makes-a-uri level.

* all functions I looked at take python3 str rather than bytes so there's no
  confusing stuff here
* urllib.request.urlopen takes a strict uri.  That means that you must have
  a percent encoded uri at this point
* urllib.parse.urljoin takes regular string values
* urllib.parse and urllib.unparse take regular string values

 Since many of the URIs we deal with are more or
 less textual, why not take advantage of that?

Cool, so to summarize what I think we agree on:

* Percent encoded URIs are text according to the RFC.
* The data that is used to construct the URI is not defined as text by the
  RFC.
* However, it is very often text in an unspecified encoding
* It is extremely convenient for programmers to be able to treat the data
  that is used to form a URI as text in nearly all common cases.

-Toshio


pgpDvecDxPAjV.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread James Y Knight



On Jun 22, 2010, at 1:03 PM, Ian Bicking wrote:
Similarly I'd expect (from experience) that a programmer using  
Python to want to take the same approach, sticking with unencoded  
data in nearly all situations.


Yeah. This is a real issue I have with the direction Python3 went: it  
pushes you into decoding everything to unicode early, even when you  
don't care -- all you really wanted to do is pass it from one API to  
another, with some well-defined transformations, which don't actually  
depend on it having being decoded properly. (For example, extracting  
the path from the URL and attempting to open it as a file on the  
filesystem.)


This means that Python3 programs can become *more* fragile in the face  
of random data you encounter out in the real world, rather than less  
fragile, which was the goal of the whole exercise.


The surrogateescape method is a nice workaround for this, but I can't  
help thinking that it might've been better to just treat stuff as  
possibly-invalid-but-probably-utf8 byte-strings from input, through  
processing, to output. It seems kinda too late for that, though: next  
time someone designs a language, they can try that. :)


James___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread M.-A. Lemburg

Guido van Rossum wrote:
 [Just addressing one little issue here; generally I'm just happy that
 we're discussing this issue in such detail from so many points of
 view.]
 
 On Mon, Jun 21, 2010 at 10:50 PM, Toshio Kuratomi a.bad...@gmail.com wrote:
 [...] Would urljoin(b_base, b_subdir) = bytes and
 urljoin(u_base, u_subdir) = unicode be acceptable though?  (I think, given
 other options, I'd rather see two separate functions, though.  It seems more
 discoverable and less prone to taking bad input some of the time to have two
 functions that clearly only take one type of data apiece.)
 
 Hm. I'd rather see a single function (it would be polymorphic in my
 earlier terminology). After all a large number of string method calls
 (and some other utility function calls) already look the same
 regardless of whether they are handling bytes or text (as long as it's
 uniform). If the building blocks are all polymorphic it's easier to
 create additional polymorphic functions.
 
 FWIW, there are two problems with polymorphic functions, though they
 can be overcome:
 
 (1) Literals.
 
 If you write something like x.split('') you are implicitly assuming x
 is text. I don't see a very clean way to overcome this; you'll have to
 implement some kind of type check e.g.
 
 x.split('') if isinstance(x, str) else x.split(b'')
 
 A handy helper function can be written:
 
   def literal_as(constant, variable):
   if isinstance(variable, str):
   return constant
   else:
   return constant.encode('utf-8')
 
 So now you can write x.split(literal_as('', x)).

This polymorphism is what we used in Python2 a lot to write
code that works for both Unicode and 8-bit strings.

Unfortunately, this no longer works as easily in Python3 due
to the literals sometimes having the wrong type and using
such a helper function slows things down a lot.

It would be great if we could have something like the above as
builtin method:

x.split(''.as(x))

Perhaps something to discuss on the language summit at EuroPython.

Too bad we can't add such porting enhancements to Python2 anymore.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jun 22 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2010-07-19: EuroPython 2010, Birmingham, UK26 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Terry Reedy


On 6/22/2010 1:22 AM, Glyph Lefkowitz wrote:


The thing that I have heard in passing from a couple of folks with
experience in this area is that some older software in asia would
present characters differently if they were originally encoded in a
japanese encoding versus a chinese encoding, even though they were
really the same characters.


As I tried to say in another post, that to me is similar to wanting to 
present English text is different fonts depending on whether spoken by 
an American or Brit, or a modern person versus a Renaissance person.



I do know that Han Unification is a giant political mess
(http://en.wikipedia.org/wiki/Han_unification makes for some


Thanks, I will take a look.


interesting reading), but my understanding is that it has handled enough
of the cases by now that one can write software to display asian
languages and it will basically work with a modern version of unicode.
(And of course, there's always the private use area, as Stephen Turnbull
pointed out.)

Regardless, this is another example where keeping around a string isn't
really enough. If you need to display a japanese character in a distinct
way because you are operating in the japanese *script*, you need a tag
surrounding your data that is a hint to its presentation. The fact that
these presentation hints were sometimes determined by their encoding is
an unfortunate historical accident.


Yes. The asian languages I know anything about seems to natively have 
almost none of the symbols English has, many borrowed from math, that 
have been pressed into service for text markup.



--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Terry Reedy


On 6/22/2010 12:53 PM, Guido van Rossum wrote:

On Mon, Jun 21, 2010 at 11:47 PM, Raymond Hettinger
raymond.hettin...@gmail.com  wrote:


On Jun 21, 2010, at 10:31 PM, Glyph Lefkowitz wrote:

   This is a common pain-point for porting software to 3.x - you had a
string, it kinda worked most of the time before, but now you need to keep
track of text too and the functions which seemed to work on bytes no longer
do.

Thanks Glyph.  That is a nice summary of one kind of challenge facing
programmers.


Ironically, Glyph also described the pain in 2.x: it only kinda worked.


The people with problematic code to convert must imclude some who 
managed to tolerate and perhaps suppress the pain. I suspect that 
conversion attempts brings it back to the surface. It is natural to 
blame the re-surfacer rather than the original source. (As in 'blame the 
messenger').



--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Ian Bicking

On Tue, Jun 22, 2010 at 1:07 PM, James Y Knight f...@fuhm.net wrote:

 The surrogateescape method is a nice workaround for this, but I can't help
 thinking that it might've been better to just treat stuff as
 possibly-invalid-but-probably-utf8 byte-strings from input, through
 processing, to output. It seems kinda too late for that, though: next time
 someone designs a language, they can try that. :)


surrogateescape does help a lot, my only problem with it is that it's
out-of-band information.  That is, if you have data that went through
data.decode('utf8', 'surrogateescape') you can restore it to bytes or
transcode it to another encoding, but you have to know that it was decoded
specifically that way.  And of course if you did have to transcode it (e.g.,
text.encode('utf8', 'surrogateescape').decode('latin1')) then if you had
actually handled the text in any way you may have broken it; you don't
*really* have valid text.  A lazier solution feels like it would be easier
and more transparent to work with.

But... I also don't see any major language constraint to having another kind
of string that is bytes+encoding.  I think PJE brought up a problem with a
couple coercion aspects.

-- 
Ian Bicking  |  http://blog.ianbicking.org
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Robert Collins

On Wed, Jun 23, 2010 at 6:09 AM, M.-A. Lemburg m...@egenix.com wrote:

           return constant.encode('utf-8')

 So now you can write x.split(literal_as('', x)).

 This polymorphism is what we used in Python2 a lot to write
 code that works for both Unicode and 8-bit strings.

 Unfortunately, this no longer works as easily in Python3 due
 to the literals sometimes having the wrong type and using
 such a helper function slows things down a lot.

I didn't work in 2 either - see for instance the traceback module with
an Exception with unicode args and a non-ascii file path - the file
path is in its bytes form, the string joining logic triggers an
implicit upcast and *boom*.

 Too bad we can't add such porting enhancements to Python2 anymore

Perhaps a 'py3compat' module on pypi, with things like the py._builtin
reraise helper and so forth ?

-Rob
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Nick Coghlan

On Wed, Jun 23, 2010 at 2:17 AM, Guido van Rossum gu...@python.org wrote:
 (1) Literals.

 If you write something like x.split('') you are implicitly assuming x
 is text. I don't see a very clean way to overcome this; you'll have to
 implement some kind of type check e.g.

    x.split('') if isinstance(x, str) else x.split(b'')

 A handy helper function can be written:

  def literal_as(constant, variable):
      if isinstance(variable, str):
          return constant
      else:
          return constant.encode('utf-8')

 So now you can write x.split(literal_as('', x)).

I think this is a key point. In checking the behaviour of the os
module bytes APIs (see below), I used a simple filter along the lines
of:

  [x for x in seq if x.endswith(b)]

It would be nice if code along those lines could easily be made polymorphic.

Maybe what we want is a new class method on bytes and str (this idea
is similar to what MAL suggests later in the thread):

  def coerce(cls, obj, encoding=None, errors='surrogateescape'):
if isinstance(obj, cls):
return existing
if encoding is None:
encoding = sys.getdefaultencoding()
# This is the str version, bytes,coerce would use obj.encode() instead
return obj.decode(encoding, errors)

Then my example above could be made polymorphic (for ASCII compatible
encodings) by writing:

  [x for x in seq if x.endswith(x.coerce(b))]

I'm trying to see downsides to this idea, and I'm not really seeing
any (well, other than 2.7 being almost out the door and the fact we'd
have to grant ourselves an exception to the language moratorium)

 (2) Data sources.

 These can be functions that produce new data from non-string data,
 e.g. str(int), read it from a named file, etc. An example is read()
 vs. write(): it's easy to create a (hypothetical) polymorphic stream
 object that accepts both f.write('booh') and f.write(b'booh'); but you
 need some other hack to make read() return something that matches a
 desired return type. I don't have a generic suggestion for a solution;
 for streams in particular, the existing distinction between binary and
 text streams works, of course, but there are other situations where
 this doesn't generalize (I think some XML interfaces have this
 awkwardness in their API for converting a tree to a string).

We may need to use the os and io modules as the precedents here:

os: normal API is text using the surrogateescape error handler,
parallel bytes API exposes raw bytes. Parallel API is polymorphic if
possible (e.g. os.listdir), but appends a 'b' to the name if the
polymorphic approach isn't practical (e.g. os.environb, os.getcwdb,
os.getenvb).
io. layered API, where both the raw bytes of the wire protocol and the
decoded bytes of the text layer are available

Regards,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Nick Coghlan

On Wed, Jun 23, 2010 at 4:09 AM, M.-A. Lemburg m...@egenix.com wrote:
 It would be great if we could have something like the above as
 builtin method:

 x.split(''.as(x))

As per my other message, another possible (and reasonably intuitive)
spelling would be:

  x.split(x.coerce(''))

Writing it as a helper function is also possible, although it be
trickier to remember the correct argument ordering:

  def coerce_to(target, obj, encoding=None, errors='surrogateescape'):
if isinstance(obj, type(target)):
return obj
if encoding is None:
encoding = sys.getdefaultencoding()
try::
convert = obj.decode
except AttributeError:
convert = obj.encode
return convert(encoding, errors)

  x.split(coerce_to(x, ''))

 Perhaps something to discuss on the language summit at EuroPython.

 Too bad we can't add such porting enhancements to Python2 anymore.

Well, we can if we really want to, it just entails convincing Benjamin
to reschedule the 2.7 final release. Given the UserDict/ABC/old-style
classes issue, there's a fair chance there's going to be at least one
more 2.7 RC anyway.

That said, since this kind of coercion can be done in a helper
function, that should be adequate for the 2.x to 3.x conversion case
(for 2.x, the helper function can be defined to just return the second
argument since bytes and str are the same type, while the 3.x version
would look something like the code above)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Michael Foord


On 22/06/2010 22:40, Robert Collins wrote:

On Wed, Jun 23, 2010 at 6:09 AM, M.-A. Lemburgm...@egenix.com  wrote:

   

   return constant.encode('utf-8')

So now you can write x.split(literal_as('', x)).
   

This polymorphism is what we used in Python2 a lot to write
code that works for both Unicode and 8-bit strings.

Unfortunately, this no longer works as easily in Python3 due
to the literals sometimes having the wrong type and using
such a helper function slows things down a lot.
 

I didn't work in 2 either - see for instance the traceback module with
an Exception with unicode args and a non-ascii file path - the file
path is in its bytes form, the string joining logic triggers an
implicit upcast and *boom*.

   
Yeah, there are still a few places in unittest where a unicode exception 
can cause the whole test run to bomb out. No-one has *yet* reported 
these as bugs and I try and ferret them out as I find them.


All the best,

Michael


Too bad we can't add such porting enhancements to Python2 anymore
 

Perhaps a 'py3compat' module on pypi, with things like the py._builtin
reraise helper and so forth ?

-Rob
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk
   



--
http://www.ironpythoninaction.com/
http://www.voidspace.org.uk/blog

READ CAREFULLY. By accepting and reading this email you agree, on behalf of 
your employer, to release me from all obligations and waivers arising from any 
and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, 
clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and 
acceptable use policies (”BOGUS AGREEMENTS”) that I have entered into with your 
employer, its partners, licensors, agents and assigns, in perpetuity, without 
prejudice to my ongoing rights and privileges. You further represent that you 
have the authority to release me from any BOGUS AGREEMENTS on behalf of your 
employer.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Michael Foord


On 22/06/2010 19:07, James Y Knight wrote:


On Jun 22, 2010, at 1:03 PM, Ian Bicking wrote:
Similarly I'd expect (from experience) that a programmer using Python 
to want to take the same approach, sticking with unencoded data in 
nearly all situations.


Yeah. This is a real issue I have with the direction Python3 went: it 
pushes you into decoding everything to unicode early,


Well, both .NET and Java take this approach as well. I wonder how they 
cope with the particular issues that have been mentioned for web 
applications - both platforms are used extensively for web apps.


Having used IronPython, which has .NET unicode strings (although it does 
a lot of magic to *allow* you to store binary data in strings for 
compatibility with CPython),  I have to say that this approach makes a 
lot of programming *so* much more pleasant.


We did a lot of I/O (can you do useful programming without I/O?) 
including working with databases, but I didn't work *much* with wire 
protocols (fetching a fair bit of data from the web though now I think 
about it). I think wire protocols can present particular problems; 
sometimes having mixed encodings in the same data it seems. Where you 
don't have these problems keeping bytes data and all Unicode text data 
separate and encoding / decoding at the boundaries is really much more 
sane and pleasant.


It would be a real shame if we decided that the way forward for Python 3 
was to try and move closer to how bytes/text was handled in Python 2.


All the best,

Michael

even when you don't care -- all you really wanted to do is pass it 
from one API to another, with some well-defined transformations, which 
don't actually depend on it having being decoded properly. (For 
example, extracting the path from the URL and attempting to open it as 
a file on the filesystem.)


This means that Python3 programs can become *more* fragile in the face 
of random data you encounter out in the real world, rather than less 
fragile, which was the goal of the whole exercise.


The surrogateescape method is a nice workaround for this, but I can't 
help thinking that it might've been better to just treat stuff as 
possibly-invalid-but-probably-utf8 byte-strings from input, through 
processing, to output. It seems kinda too late for that, though: next 
time someone designs a language, they can try that. :)


James


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk
   



--
http://www.ironpythoninaction.com/
http://www.voidspace.org.uk/blog

READ CAREFULLY. By accepting and reading this email you agree, on behalf of your 
employer, to release me from all obligations and waivers arising from any and all 
NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, 
confidentiality, non-disclosure, non-compete and acceptable use policies (BOGUS 
AGREEMENTS) that I have entered into with your employer, its partners, licensors, 
agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. 
You further represent that you have the authority to release me from any BOGUS AGREEMENTS 
on behalf of your employer.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Ian Bicking

On Tue, Jun 22, 2010 at 11:17 AM, Guido van Rossum gu...@python.org wrote:

 (2) Data sources.

 These can be functions that produce new data from non-string data,
 e.g. str(int), read it from a named file, etc. An example is read()
 vs. write(): it's easy to create a (hypothetical) polymorphic stream
 object that accepts both f.write('booh') and f.write(b'booh'); but you
 need some other hack to make read() return something that matches a
 desired return type. I don't have a generic suggestion for a solution;
 for streams in particular, the existing distinction between binary and
 text streams works, of course, but there are other situations where
 this doesn't generalize (I think some XML interfaces have this
 awkwardness in their API for converting a tree to a string).


This reminds me of the optimization ElementTree and lxml made in Python 2
(not sure what they do in Python 3?) where they use str when a string is
ASCII to avoid the memory and performance overhead of unicode.  Also at
least lxml is also dealing with the divide between the internal libxml2
string representation and the Python representation.  This is a place where
bytes+encoding might also have some benefit.  XML is someplace where you
might load a bunch of data but only touch a little bit of it, and the amount
of data is frequently large enough that the efficiencies are important.

-- 
Ian Bicking  |  http://blog.ianbicking.org
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread P.J. Eby


At 07:41 AM 6/23/2010 +1000, Nick Coghlan wrote:

Then my example above could be made polymorphic (for ASCII compatible
encodings) by writing:

  [x for x in seq if x.endswith(x.coerce(b))]

I'm trying to see downsides to this idea, and I'm not really seeing
any (well, other than 2.7 being almost out the door and the fact we'd
have to grant ourselves an exception to the language moratorium)


Notice, however, that if multi-string operations used a coercion 
protocol (they currently have to do type checks already for 
byte/unicode mixes), then you could make the entire stdlib 
polymorphic by default, even for other kinds of strings that don't exist yet.


If you invent a new numeric type, generally speaking you can pass it 
to existing stdlib functions taking numbers, as long as it implements 
the appropriate protocols.  Why not do the same for strings?


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Glyph Lefkowitz


On Jun 22, 2010, at 12:53 PM, Guido van Rossum wrote:

 On Mon, Jun 21, 2010 at 11:47 PM, Raymond Hettinger
 raymond.hettin...@gmail.com wrote:
 
 On Jun 21, 2010, at 10:31 PM, Glyph Lefkowitz wrote:
 
 This is a common pain-point for porting software to 3.x - you had a
 string, it kinda worked most of the time before, but now you need to keep
 track of text too and the functions which seemed to work on bytes no longer
 do.
 
 Thanks Glyph.  That is a nice summary of one kind of challenge facing
 programmers.
 
 Ironically, Glyph also described the pain in 2.x: it only kinda worked.

It was not my intention to be ironic about it - that was exactly what I meant 
:).  3.x is forcing you to confront an issue that you _should_ have confronted 
for 2.x anyway. 

(And, I hope, most libraries doing a 3.x migration will take the opportunity to 
make their 2.x APIs unicode-clean while still in 2to3 mode, and jump ship to 
3.x source only _after_ there's a nice transition path for their clients that 
can be taken in 2 steps.)

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Glyph Lefkowitz


On Jun 22, 2010, at 2:07 PM, James Y Knight wrote:

 Yeah. This is a real issue I have with the direction Python3 went: it pushes 
 you into decoding everything to unicode early, even when you don't care -- 
 all you really wanted to do is pass it from one API to another, with some 
 well-defined transformations, which don't actually depend on it having being 
 decoded properly. (For example, extracting the path from the URL and 
 attempting to open it as a file on the filesystem.)

But you _do_ need to decode it in this case.  If you got your URL from some 
funky UTF-32 datasource, b\x00\x00\x00/ is not a path separator, / is.  
Plus, you should really be separating path segments and looking at them 
individually so that you don't fall victim to %2F bugs.  And if you want your 
code to be portable, you need a Unicode representation of your pathname anyway 
for Windows; plus, there, you need to care about \ as well as /.

The fact that your wire-bytes were probably ASCII(-ish) and your filesystem 
probably encodes pathnames as UTF-8 and so everything looks like it lines up is 
no excuse not to be explicit about your expectations there.

You may want to transcode your characters into some other characters later, but 
that shouldn't stop you from treating them as characters of some variety in the 
meanwhile.

 The surrogateescape method is a nice workaround for this, but I can't help 
 thinking that it might've been better to just treat stuff as 
 possibly-invalid-but-probably-utf8 byte-strings from input, through 
 processing, to output. It seems kinda too late for that, though: next time 
 someone designs a language, they can try that. :)

I can think of lots of optimizations that might be interesting for Python (or 
perhaps some other runtime less concerned with cleverness overload, like PyPy) 
to implement, like a UTF-8 combining-characters overlay that would allow for 
fast indexing, lazily populated as random access dictates.  But this could all 
be implemented as smartness inside .encode() and .decode() and the str and 
bytes types without changing the way the API works.

I realize that there are implications at the C level, but as long as you can 
squeeze a function call in to certain places, it could still work.

I can also appreciate what's been said in this thread a bunch of times: to my 
knowledge, nobody has actually shown a profile of an application where encoding 
is significant overhead.  I believe that encoding _will_ be a significant 
overhead for some applications (and actually I think it will be very 
significant for some applications that I work on), but optimizations should 
really be implemented once that's been demonstrated, so that there's a better 
understanding of what the overhead is, exactly.  Is memory a big deal?  Is CPU? 
 Is it both?  Do you want to tune for the tradeoff?  etc, etc.  Clever 
data-structures seem premature until someone has a good idea of all those 
things.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Glyph Lefkowitz


On Jun 22, 2010, at 7:23 PM, Ian Bicking wrote:

 This is a place where bytes+encoding might also have some benefit.  XML is 
 someplace where you might load a bunch of data but only touch a little bit of 
 it, and the amount of data is frequently large enough that the efficiencies 
 are important.

Different encodings have different characteristics, though, which makes them 
amenable to different types of optimizations.  If you've got an ASCII string or 
a latin1 string, the optimizations of unicode are pretty obvious; if you've got 
one in UTF-16 with no multi-code-unit sequences, you could also hypothetically 
cheat for a while if you're on a UCS4 build of Python.

I suspect the practical problem here is that there's no CharacterString ABC in 
the collections module for third-party libraries to provide their own 
peculiarly-optimized implementations that could lazily turn into real 'str's as 
needed.  I'd volunteer to write a PEP if I thought I could actually get it done 
:-\.  If someone else wants to be the primary author though, I'll try to help 
out.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Mike Klaas

On Tue, Jun 22, 2010 at 4:23 PM, Ian Bicking i...@colorstudy.com wrote:

 This reminds me of the optimization ElementTree and lxml made in Python 2
 (not sure what they do in Python 3?) where they use str when a string is
 ASCII to avoid the memory and performance overhead of unicode.

An optimization that forces me to typecheck the return value of the
function and that I only discovered after code started breaking.  I
can't say was enthused about that decision when I discovered it.

-Mike
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Robert Collins

On Wed, Jun 23, 2010 at 12:25 PM, Glyph Lefkowitz
gl...@twistedmatrix.com wrote:
 I can also appreciate what's been said in this thread a bunch of times: to my 
 knowledge, nobody has actually shown a profile of an application where 
 encoding is significant overhead.  I believe that encoding _will_ be a 
 significant overhead for some applications (and actually I think it will be 
 very significant for some applications that I work on), but optimizations 
 should really be implemented once that's been demonstrated, so that there's a 
 better understanding of what the overhead is, exactly.  Is memory a big deal? 
  Is CPU?  Is it both?  Do you want to tune for the tradeoff?  etc, etc.  
 Clever data-structures seem premature until someone has a good idea of all 
 those things.

bzr has a cache of decoded strings in it precisely because decode is
slow. We accept slowness encoding to the users locale because thats
typically much less data to examine than we've examined while
generating the commit/diff/whatever. We also face memory pressure on a
regular basis, and that has been, at least partly, due to UCS4 - our
translation cache helps there because we have less duplicate UCS4
strings.

You're welcome to dig deeper into this, but I don't have more detail
paged into my head at the moment.

-Rob
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Stephen J. Turnbull

Robert Collins writes:

  Also, url's are bytestrings - by definition;

Eh?  RFC 3896 explicitly says

A URI is an identifier consisting of a sequence of characters
matching the syntax rule named URI in Section 3.

(where the phrase sequence of characters appears in all ancestors I
found back to RFC 1738), and

2.  Characters

The URI syntax provides a method of encoding data, presumably for
the sake of identifying a resource, as a sequence of characters.
The URI characters are, in turn, frequently encoded as octets for
transport or presentation.  This specification does not mandate any
particular character encoding for mapping between URI characters
and the octets used to store or transmit those characters.  When a
URI appears in a protocol element, the character encoding is
defined by that protocol; without such a definition, a URI is
assumed to be in the same character encoding as the surrounding
text.

  if the standard library has made them unicode objects in 3, I
  expect a lot of pain in the webserver space.

Yup.  But pain is inevitable if people are treating URIs (whether URLs
or otherwise) as octet sequences.  Then your base URL is gonna be
b'mailto:step...@xemacs.org', but the natural thing the UI will want
to do is 

formurl = baseurl + '?subject=うるさいやつだなぁ…'

IMO, the UI is right.  Something like the above ought to work.

So the function that actually handles composing the URL should take a
string (ie, unicode), and do all escaping.  The UI code should not
need to know about escaping.  If nothing escapes except the function
that puts the URL in composed form, and that function always escapes,
life is easy.

Of course, in real life it's not that easy.  But it's possible to make
things unnecessarily hard for the users of your URI API(s), and one
way to do that is to make URIs into just bytes (and just unicode
is probably nearly as bad, except that at least you know it's not
ready for the wire).

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Lennart Regebro

2010/6/21 Stephen J. Turnbull step...@xemacs.org:
 IMO, the UI is right.  Something like the above ought to work.

Right. That said, many times when you want to do urlparse etc they
might be binary, and you might want binary. So maybe the methods
should work with both?

-- 
Lennart Regebro: http://regebro.wordpress.com/
Python 3 Porting: http://python3porting.com/
+33 661 58 14 64
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Nick Coghlan

On Mon, Jun 21, 2010 at 12:30 PM, P.J. Eby p...@telecommunity.com wrote:
 I also find it weird that there seem to be two camps on this subject, one of
 which claims that All Is Well And There Is No Problem -- but I do not recall
 seeing anyone who was in the What do I do; this doesn't seem ready camp
 who switched sides and took the time to write down what made them realize
 that they were wrong about there being a problem, and what steps they had to
 take.  The existence of one or more such documents would certainly ease my
 mind, and I imagine that of other people who are less waiting for others'
 libraries, than for the stdlib (and/or language) itself to settle.

 (Or more precisely, for it to be SEEN to have settled.)

I don't know that the all is well camp actually exists. The camp
that I do see existing is the one that says without a bug report,
inconsistencies in the standard library's unicode handling won't get
fixed.

The issues picked up by the regression test suite have already been
dealt with, but that suite is unfortunately far from comprehensive.
Just like a lot of Python code that is out there, the standard library
isn't immune to the poor coding practices that were permitted by the
blurry lines between text and octet streams in 2.x.

It may be that there are places where we need to rewrite standard
library algorithms to be bytes/str neutral (e.g. by using length one
slices instead of indexing). It may be that there are more APIs that
need to grow encoding keyword arguments that they then pass on to
the functions they call or use to convert str arguments to bytes (or
vice-versa). But without people trying to port affected libraries and
reporting bugs when they find issues, the situation isn't going to
improve.

Now, if these bugs are already being reported against 3.1 and just
aren't getting fixed, that's a completely different story...

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Stephen J. Turnbull

Lennart Regebro writes:

  2010/6/21 Stephen J. Turnbull step...@xemacs.org:
   IMO, the UI is right.  Something like the above ought to work.
  
  Right. That said, many times when you want to do urlparse etc they
  might be binary, and you might want binary. So maybe the methods
  should work with both?

First, a caveat: I'm a Unicode/encodings person, not an experienced
web programmer.  My opinions on whether this would work well in
practice should be taken with a grain of salt.

Speaking for myself, I live in a country where the natives have
saddled themselves with no less than 4 encodings in common use, and I
would never want binary since none of them would display as anything
useful in a traceback.  Wherever possible, I decode blobs into
structured objects, I do it as soon as possible, and if for efficiency
reasons I want to do this lazily, I store the blob in a separate
.raw_object attribute.  If they're textual, I decode them to text.  I
can't see an efficiency argument for decoding URIs lazily in most
applications.

In the case of structured text like URIs, I would create a separate
class for handling them with string-like operations.  Internally, all
text would be raw Unicode (ie, not url-encoded); repr(uri) would use
some kind of readable quoting convention (not url-encoding) to
disambiguate random reserved characters from separators, while
str(uri) would produce an url-encoded string.  Converting to and from
wire format is just .encode and .decode, then, and in this country you
need to be flexible about which encoding you use.

Agreed, this stuff is really annoying.  But I think that just comes
with the territory.  PJE reports that folks don't like doing encoding
and decoding all over the place.  I understand that, but if they're
doing a lot of that, I have to wonder why.  Why not define the one
line function and get on with life?

The thing is, where I live, it's not going to be a one line function.
I'm going to be dealing with URLs that are url-encoded representations
of UTF-8, Shift-JIS, EUC-JP, and occasionally RFC 2047!  So I need an
API that explicitly encodes and decodes.  And I need an API that
presents Japanese as Japanese rather than as line noise.

Eg, PJE writes

Ugh.  I meant: 

newurl = urljoin(str(base, 'latin-1'), 'subdir').encode('latin-1')

Which just goes to the point of how ridiculous it is to have to  
convert things to strings and back again to use APIs that ought to  
just handle bytes properly in the first place. 

But if you need that everywhere, what's so hard about

def urljoin_wrapper (base, subdir):
return urljoin(str(base, 'latin-1'), subdir).encode('latin-1')

Now, note how that pattern fails as soon as you want to use
non-ISO-8859-1 languages for subdir names.  In Python 3, the code
above is just plain buggy, IMHO.  The original author probably will
never need the generalization.  But her name will be cursed unto the
nth generation by people who use her code on a different continent.

The net result is that bytes are *not* a programmer- or user-friendly
way to do this, except for the minority of the world for whom Latin-1
is a good approximation to their daily-use unibyte encoding (eg, it's
probably usable for debugging in Dansk, but you won't win any
popularity contests in Tel Aviv or Shanghai).

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread P.J. Eby


At 10:51 PM 6/21/2010 +1000, Nick Coghlan wrote:

It may be that there are places where we need to rewrite standard
library algorithms to be bytes/str neutral (e.g. by using length one
slices instead of indexing). It may be that there are more APIs that
need to grow encoding keyword arguments that they then pass on to
the functions they call or use to convert str arguments to bytes (or
vice-versa). But without people trying to port affected libraries and
reporting bugs when they find issues, the situation isn't going to
improve.

Now, if these bugs are already being reported against 3.1 and just
aren't getting fixed, that's a completely different story...


The overall impression, though, is that this isn't really a step 
forward.  Now, bytes are the special case instead of unicode, but 
that special case isn't actually handled any better by the stdlib - 
in fact, it's arguably worse.  And, the burden of addressing this 
seems to have been shifted from the people who made the change, to 
the people who are going to use it.  But those people are not 
necessarily in a position to tell you anything more than, give me 
something that works with bytes.


What I can tell you is that before, since string constants in the 
stdlib were ascii bytes, and transparently promoted to unicode, 
stdlib behavior was *predictable* in the presence of special cases: 
you got back either bytes or unicode, but either way, you could 
idempotently upgrade the result to unicode, or just pass it on.  APIs 
were str safe, unicode aware.  If you passed in bytes, you weren't 
going to get unicode without a warning, and if you passed in unicode, 
it'd work and you'd get unicode back.


Now, the APIs are neither safe nor aware -- if you pass bytes in, you 
get unpredictable results back.


Ironically, it almost *would* have been better if bytes simply didn't 
work as strings at all, *ever*, but if you could wrap them with a 
bstr() to *treat* them as text.  You could still have restrictions on 
combining them, as long as it was a restriction on the unicode you 
mixed with them.  That is, if you could combine a bstr and a str if 
the *str* was restricted to ASCII.


If we had the Python 3 design discussions to do over again, I think I 
would now have stuck with the position of not letting bytes be 
string-compatible at all, and instead proposed an explicit bstr() 
wrapper/adapter to use them as strings, that would (in that case) 
force coercion in the direction of bytes rather than strings.  (And 
bstr need not have been a builtin - it could have been something you 
import, to help discourage casual usage.)


Might this approach lead to some people doing things wrong in the 
case of porting?  Sure.  But there'd be little reason to use it in 
new code that didn't have a real need for bytestring manipulation.


It might've been a better balance between practicality and purity, in 
that it keeps the language pure, while offering a practical way to 
deal with things in bytes if you really need to.  And, bytes wouldn't 
silently succeed *some* of the time, leading to a trap.  An easy 
inconsistency is worse than a bit of uniform chicken-waving.


Is it too late to make that tradeoff?  Probably.  Certainly it's not 
practical to *implement* outside the language core, and removing 
string methods would fux0r anybody whose currently-ported code relies 
on bytes objects having string-like methods.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Michael Foord


On 21/06/2010 17:46, P.J. Eby wrote:

At 10:51 PM 6/21/2010 +1000, Nick Coghlan wrote:

It may be that there are places where we need to rewrite standard
library algorithms to be bytes/str neutral (e.g. by using length one
slices instead of indexing). It may be that there are more APIs that
need to grow encoding keyword arguments that they then pass on to
the functions they call or use to convert str arguments to bytes (or
vice-versa). But without people trying to port affected libraries and
reporting bugs when they find issues, the situation isn't going to
improve.

Now, if these bugs are already being reported against 3.1 and just
aren't getting fixed, that's a completely different story...


The overall impression, though, is that this isn't really a step 
forward. Now, bytes are the special case instead of unicode, but that 
special case isn't actually handled any better by the stdlib - in 
fact, it's arguably worse. And, the burden of addressing this seems to 
have been shifted from the people who made the change, to the people 
who are going to use it. But those people are not necessarily in a 
position to tell you anything more than, give me something that works 
with bytes.


What I can tell you is that before, since string constants in the 
stdlib were ascii bytes, and transparently promoted to unicode, stdlib 
behavior was *predictable* in the presence of special cases: you got 
back either bytes or unicode, but either way, you could idempotently 
upgrade the result to unicode, or just pass it on. APIs were str 
safe, unicode aware. If you passed in bytes, you weren't going to get 
unicode without a warning, and if you passed in unicode, it'd work and 
you'd get unicode back.


Now, the APIs are neither safe nor aware -- if you pass bytes in, you 
get unpredictable results back.


Ironically, it almost *would* have been better if bytes simply didn't 
work as strings at all, *ever*, but if you could wrap them with a 
bstr() to *treat* them as text. You could still have restrictions on 
combining them, as long as it was a restriction on the unicode you 
mixed with them. That is, if you could combine a bstr and a str if the 
*str* was restricted to ASCII.


If we had the Python 3 design discussions to do over again, I think I 
would now have stuck with the position of not letting bytes be 
string-compatible at all, and instead proposed an explicit bstr() 
wrapper/adapter to use them as strings, that would (in that case) 
force coercion in the direction of bytes rather than strings. (And 
bstr need not have been a builtin - it could have been something you 
import, to help discourage casual usage.)


Might this approach lead to some people doing things wrong in the case 
of porting? Sure. But there'd be little reason to use it in new code 
that didn't have a real need for bytestring manipulation.


It might've been a better balance between practicality and purity, in 
that it keeps the language pure, while offering a practical way to 
deal with things in bytes if you really need to. And, bytes wouldn't 
silently succeed *some* of the time, leading to a trap. An easy 
inconsistency is worse than a bit of uniform chicken-waving.


Is it too late to make that tradeoff? Probably. Certainly it's not 
practical to *implement* outside the language core, and removing 
string methods would fux0r anybody whose currently-ported code relies 
on bytes objects having string-like methods.




Why is your proposed bstr wrapper not practical to implement outside the 
core and use in your own libraries and frameworks?


Michael


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk 




--
http://www.ironpythoninaction.com/
http://www.voidspace.org.uk/blog

READ CAREFULLY. By accepting and reading this email you agree, on behalf of 
your employer, to release me from all obligations and waivers arising from any 
and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, 
clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and 
acceptable use policies (”BOGUS AGREEMENTS”) that I have entered into with your 
employer, its partners, licensors, agents and assigns, in perpetuity, without 
prejudice to my ongoing rights and privileges. You further represent that you 
have the authority to release me from any BOGUS AGREEMENTS on behalf of your 
employer.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread P.J. Eby


At 01:08 AM 6/22/2010 +0900, Stephen J. Turnbull wrote:

But if you need that everywhere, what's so hard about

def urljoin_wrapper (base, subdir):
return urljoin(str(base, 'latin-1'), subdir).encode('latin-1')

Now, note how that pattern fails as soon as you want to use
non-ISO-8859-1 languages for subdir names.


Bear in mind that the use cases I'm talking about here are WSGI 
stacks with components written by multiple authors -- each of whom 
may have to define that function, and still get it right.


Sure, there are some things that could go in wsgiref in the 
stdlib.  However, as of this moment, there's only a very uneasy rough 
consensus in Web-Sig as to how the heck WSGI should actually *work* 
on Python 3, because of issues like these.


That makes it tough to actually say what should happen in the stdlib 
-- e.g., which things should be classed as stdlib bugs, which things 
should be worked around with wrappers or new functions, etc.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Toshio Kuratomi

On Tue, Jun 22, 2010 at 01:08:53AM +0900, Stephen J. Turnbull wrote:
 Lennart Regebro writes:
 
   2010/6/21 Stephen J. Turnbull step...@xemacs.org:
IMO, the UI is right.  Something like the above ought to work.
   
   Right. That said, many times when you want to do urlparse etc they
   might be binary, and you might want binary. So maybe the methods
   should work with both?
 
 First, a caveat: I'm a Unicode/encodings person, not an experienced
 web programmer.  My opinions on whether this would work well in
 practice should be taken with a grain of salt.
 
 Speaking for myself, I live in a country where the natives have
 saddled themselves with no less than 4 encodings in common use, and I
 would never want binary since none of them would display as anything
 useful in a traceback.  Wherever possible, I decode blobs into
 structured objects, I do it as soon as possible, and if for efficiency
 reasons I want to do this lazily, I store the blob in a separate
 .raw_object attribute.  If they're textual, I decode them to text.  I
 can't see an efficiency argument for decoding URIs lazily in most
 applications.
 
 In the case of structured text like URIs, I would create a separate
 class for handling them with string-like operations.  Internally, all
 text would be raw Unicode (ie, not url-encoded); repr(uri) would use
 some kind of readable quoting convention (not url-encoding) to
 disambiguate random reserved characters from separators, while
 str(uri) would produce an url-encoded string.  Converting to and from
 wire format is just .encode and .decode, then, and in this country you
 need to be flexible about which encoding you use.
 
 Agreed, this stuff is really annoying.  But I think that just comes
 with the territory.  PJE reports that folks don't like doing encoding
 and decoding all over the place.  I understand that, but if they're
 doing a lot of that, I have to wonder why.  Why not define the one
 line function and get on with life?
 
 The thing is, where I live, it's not going to be a one line function.
 I'm going to be dealing with URLs that are url-encoded representations
 of UTF-8, Shift-JIS, EUC-JP, and occasionally RFC 2047!  So I need an
 API that explicitly encodes and decodes.  And I need an API that
 presents Japanese as Japanese rather than as line noise.
 
 Eg, PJE writes
 
 Ugh.  I meant: 
 
 newurl = urljoin(str(base, 'latin-1'), 'subdir').encode('latin-1')
 
 Which just goes to the point of how ridiculous it is to have to  
 convert things to strings and back again to use APIs that ought to  
 just handle bytes properly in the first place. 
 
 But if you need that everywhere, what's so hard about
 
 def urljoin_wrapper (base, subdir):
 return urljoin(str(base, 'latin-1'), subdir).encode('latin-1')
 
 Now, note how that pattern fails as soon as you want to use
 non-ISO-8859-1 languages for subdir names.  In Python 3, the code
 above is just plain buggy, IMHO.  The original author probably will
 never need the generalization.  But her name will be cursed unto the
 nth generation by people who use her code on a different continent.
 
 The net result is that bytes are *not* a programmer- or user-friendly
 way to do this, except for the minority of the world for whom Latin-1
 is a good approximation to their daily-use unibyte encoding (eg, it's
 probably usable for debugging in Dansk, but you won't win any
 popularity contests in Tel Aviv or Shanghai).
 
One comment here -- you can also have uri's that aren't decodable into their
true textual meaning using a single encoding.

Apache will happily serve out uris that have utf-8, shift-jis, and euc-jp
components inside of their path but the textual representation that was intended
will be garbled (or be represented by escaped byte sequences).  For that
matter, apache will serve requests that have no true textual representation
as it is working on the byte level rather than the character level.

So a complete solution really should allow the programmer to pass in uris as
bytes when the programmer knows that they need it.

-Toshio


pgpAvx546YBxD.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Terry Reedy


On 6/20/2010 11:56 PM, Terry Reedy wrote:


The specific example is

  urllib.parse.parse_qsl('a=b%e0')
[('a', 'b�')]

where the character after 'b' is white ? in dark diamond, indicating an
error.

parse_qsl() splits that input on '=' and sends each piece to
urllib.parse.unquote
unquote() attempts to Replace %xx escapes by their single-character
equivalent.. unquote has an encoding parameter that defaults to 'utf-8'
in *its* call to .decode. parse_qsl does not have an encoding parameter.
If it did, and it passed that to unquote, then
the above example would become (simulated interaction)

  urllib.parse.parse_qsl('a=b%e0', encoding='latin-1')
[('a', 'bà')]

I got that output by copying the file and adding encoding-'latin-1' to
the unquote call.

Does this solve this problem?
Has anything like this been added for 3.2?
Should it be?


With a little searching, I found
http://bugs.python.org/issue5468
with Miles Kaufmann's year-old comment parse_qs and parse_qsl should 
also grow encoding and errors parameters to pass to the underlying 
unquote(). Patch review is needed.


Terry Jan Reedy


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Guido van Rossum

On Mon, Jun 21, 2010 at 9:46 AM, P.J. Eby p...@telecommunity.com wrote:
 At 10:51 PM 6/21/2010 +1000, Nick Coghlan wrote:

 It may be that there are places where we need to rewrite standard
 library algorithms to be bytes/str neutral (e.g. by using length one
 slices instead of indexing). It may be that there are more APIs that
 need to grow encoding keyword arguments that they then pass on to
 the functions they call or use to convert str arguments to bytes (or
 vice-versa). But without people trying to port affected libraries and
 reporting bugs when they find issues, the situation isn't going to
 improve.

 Now, if these bugs are already being reported against 3.1 and just
 aren't getting fixed, that's a completely different story...

 The overall impression, though, is that this isn't really a step forward.
  Now, bytes are the special case instead of unicode, but that special case
 isn't actually handled any better by the stdlib - in fact, it's arguably
 worse.  And, the burden of addressing this seems to have been shifted from
 the people who made the change, to the people who are going to use it.  But
 those people are not necessarily in a position to tell you anything more
 than, give me something that works with bytes.

 What I can tell you is that before, since string constants in the stdlib
 were ascii bytes, and transparently promoted to unicode, stdlib behavior was
 *predictable* in the presence of special cases: you got back either bytes or
 unicode, but either way, you could idempotently upgrade the result to
 unicode, or just pass it on.  APIs were str safe, unicode aware.  If you
 passed in bytes, you weren't going to get unicode without a warning, and if
 you passed in unicode, it'd work and you'd get unicode back.

Actually, the big problem with Python 2 is that if you mix str and
unicode, things work or crash depending on whether any of the str
objects involved contain non-ASCII bytes.

If one API decides to upgrade to Unicode, the result, when passed to
another API, may well cause a UnicodeError because not all arguments
have had the same treatment.

 Now, the APIs are neither safe nor aware -- if you pass bytes in, you get
 unpredictable results back.

This seems an overgeneralization of a particular bug. There are APIs
that are strictly text-in, text-out. There are others that are
bytes-in, bytes-out. Let's call all those *pure*. For some operations
it makes sense that the API is *polymorphic*, with which I mean that
text-in causes text-out, and bytes-in causes byte-out. All of these
are fine.

Perhaps there are more situations where a polymorphic API would be
helpful. Such APIs are not always so easy to implement, because they
have to be careful with literals or other constants (and even more so
mutable state) used internally -- but it can be done, and there are
plenty of examples in the stdlib.

The real problem apparently lies in (what I believe is only a few
rare) APIs that are text-or-bytes-in and always-text-out (or
always-bytes-out). Let's call them *hybrid*. Clearly, mixing hybrid
APIs in a stream of pure or polymorphic API calls is a problem,
because they turn a pure or polymorphic overall operation into a
hybrid one.

There are also text-in, bytes-out or bytes-in, text-out APIs that are
intended for encoding/decoding of course, but these are in a totally
different class.

Abstractly, it would be good if there were as few as possible hybrid
APIs, many pure or polymorphic APIs (which it should be in a
particular case is a pragmatic choice), and a limited number of
encoding/decoding APIs, which should generally be invoked at the edges
of the program (e.g., I/O).

 Ironically, it almost *would* have been better if bytes simply didn't work
 as strings at all, *ever*, but if you could wrap them with a bstr() to
 *treat* them as text.  You could still have restrictions on combining them,
 as long as it was a restriction on the unicode you mixed with them.  That
 is, if you could combine a bstr and a str if the *str* was restricted to
 ASCII.

ISTR that we considered something like this and decided to stay away
from it. At this point I think that a successful 3rd party bstr
implementation would be required before we rush to add one to the
stdlib.

 If we had the Python 3 design discussions to do over again, I think I would
 now have stuck with the position of not letting bytes be string-compatible
 at all,

They aren't, unless you consider the presence of some methods with
similar behavior (.lower(), .split() and so on) and the existence of
some polymorphic APIs (see above) as compatibility.

 and instead proposed an explicit bstr() wrapper/adapter to use them
 as strings, that would (in that case) force coercion in the direction of
 bytes rather than strings.  (And bstr need not have been a builtin - it
 could have been something you import, to help discourage casual usage.)

I'm stil unclear on exactly what bstr is supposed to be, but it sounds
a bit like one of the rejected

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread P.J. Eby


At 05:49 PM 6/21/2010 +0100, Michael Foord wrote:
Why is your proposed bstr wrapper not practical to implement outside 
the core and use in your own libraries and frameworks?


__contains__ doesn't have a converse operation, so you can't code a 
type that works around this (Python 3.1 shown):


 from os.path import join
 join(b'x','y')
Traceback (most recent call last):
  File stdin, line 1, in module
  File c:\Python31\lib\ntpath.py, line 161, in join
if b[:1] in seps:
TypeError: Type str doesn't support the buffer API
 join('y',b'x')
Traceback (most recent call last):
  File stdin, line 1, in module
  File c:\Python31\lib\ntpath.py, line 161, in join
if b[:1] in seps:
TypeError: 'in string' requires string as left operand, not bytes

IOW, only one of these two cases can be worked around by using a bstr 
(or ebytes) that doesn't have support from the core string type.


I'm not sure if the in operator is the only case where implementing 
such a type would fail, but it's the most obvious one.  String 
formatting, of both the % and .format() varieties is 
another.  (__rmod__ doesn't help if your bytes object is one of 
several data items in a tuple or dict -- the common case for % formatting.)


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Terry Reedy


On 6/21/2010 8:51 AM, Nick Coghlan wrote:



I don't know that the all is well camp actually exists. The camp
that I do see existing is the one that says without a bug report,
inconsistencies in the standard library's unicode handling won't get
fixed.

The issues picked up by the regression test suite have already been
dealt with, but that suite is unfortunately far from comprehensive.
Just like a lot of Python code that is out there, the standard library
isn't immune to the poor coding practices that were permitted by the
blurry lines between text and octet streams in 2.x.

It may be that there are places where we need to rewrite standard
library algorithms to be bytes/str neutral (e.g. by using length one
slices instead of indexing). It may be that there are more APIs that
need to grow encoding keyword arguments that they then pass on to
the functions they call or use to convert str arguments to bytes (or
vice-versa). But without people trying to port affected libraries and
reporting bugs when they find issues, the situation isn't going to
improve.

Now, if these bugs are already being reported against 3.1 and just
aren't getting fixed, that's a completely different story...


Some of the above have been, over a year ago. See, for instance,
http://bugs.python.org/issue5468
I am getting the impression that the people who use the web modules 
tend, like me, to not have the tools to write and test patches . So they 
can squeak but not grease.


Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread P.J. Eby


At 12:56 PM 6/21/2010 -0400, Toshio Kuratomi wrote:

One comment here -- you can also have uri's that aren't decodable into their
true textual meaning using a single encoding.

Apache will happily serve out uris that have utf-8, shift-jis, and euc-jp
components inside of their path but the textual representation that 
was intended

will be garbled (or be represented by escaped byte sequences).  For that
matter, apache will serve requests that have no true textual representation
as it is working on the byte level rather than the character level.

So a complete solution really should allow the programmer to pass in uris as
bytes when the programmer knows that they need it.


ebytes(somebytes, 'garbage'), perhaps, which would be like ascii, but 
where combining with non-garbage would results in another 'garbage' ebytes?


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread P.J. Eby


At 10:29 AM 6/21/2010 -0700, Guido van Rossum wrote:

Perhaps there are more situations where a polymorphic API would be
helpful. Such APIs are not always so easy to implement, because they
have to be careful with literals or other constants (and even more so
mutable state) used internally -- but it can be done, and there are
plenty of examples in the stdlib.


What if we could use the time machine to make the APIs that *were* 
polymorphic, regain their previously-polymorphic status, without 
needing to actually *change* any of the code of those functions?


That's what Barry's ebytes proposal would do, with appropriate 
coercion rules.  Passing ebytes into such a function would yield back 
ebytes, even if the function used strings internally, as long as 
those strings could be encoded back to bytes using the ebytes' 
encoding.  (Which would normally be the case, since stdlib constants 
are almost always ASCII, and the main use cases for ebytes would 
involve ascii-extended encodings.)




I'm stil unclear on exactly what bstr is supposed to be, but it sounds
a bit like one of the rejected proposals for having a single
(Unicode-capable) str type that is implemented using different width
encodings (Latin-1, UCS-2, UCS-4) underneath.


Not quite - as modified by Barry's proposal (which I like better than 
mine) it'd be an object that just combines bytes with an attribute 
indicating the underlying encoding.  When it interacts with strings, 
the strings are *encoded* to bytes, rather than upgrading the bytes to text.


This is actually a big advantage for error-detection in any 
application where you're working with data that *must* be encodable 
in a specific encoding for output, as it allows you to catch errors 
much *earlier* than you would if you only did the encoding at your 
output boundary.


Anyway, this would not be the normal bytes type or string type; it's 
bytes with an encoding.  It's also more general than Unicode, in 
the sense that it allows you to work with character sets that don't 
really *have* a proper Unicode mapping.


One issue I remember from my enterprise days is some of the 
Asian-language developers at NTT/Verio explaining to me that unicode 
doesn't actually solve certain issues -- that there are use cases 
where you really *do* need bytes plus encoding in order to properly 
express something.  Unfortunately, I never quite wrapped my head 
around the idea, I just remember it had something to do with the fact 
that Unicode has single character codes that mean different things in 
different languages, such that you were actually losing information 
by converting to unicode, or something like that.  (Or maybe the 
characters were expressed differently in certain encodings according 
to what language they came from, so you couldn't roundtrip them 
through unicode without losing information.  I think that's probably 
was what it was; maybe somebody here can chime in more on that point.)


Anyway, a type like this would need to have at least a bit of support 
from the core language, because the str type would need to be able to 
handle at least the __contains__ and %/.format() coercion cases, 
since these functions don't have __r*__ equivalents that a 
user-implemented type could provide...  and strings don't have 
anything like a '__coerce__' either.


If sufficient hooks existed, then an ebytes could be implemented 
outside the stdlib, and still used within it.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Robert Collins

2010/6/21 Stephen J. Turnbull step...@xemacs.org:
 Robert Collins writes:

   Also, url's are bytestrings - by definition;

 Eh?  RFC 3896 explicitly says

?Definitions of Managed Objects for the DS3/E3 Interface Type

Perhaps you mean 3986 ? :)

    A URI is an identifier consisting of a sequence of characters
    matching the syntax rule named URI in Section 3.

 (where the phrase sequence of characters appears in all ancestors I
 found back to RFC 1738), and

Sure, ok, let me unpack what I meant just a little. An abstract URI is
neither unicode nor bytes per se - see section 1.2.1  A URI is a
sequence of characters from a very limited set: the letters of the
basic Latin alphabet, digits, and a few special characters. 

URI interpretation is fairly strictly separated between producers and
consumers. A consumer can manipulate a url with other url fragments -
e.g. doing urljoin. But it needs to keep the url as a url and not try
to decode it to a unicode representation.

The producer of the url however, can decode via whatever heuristics it
wants - because it defines the encoding used to go from unicode to URL
encoding.

As an example, if I give the uri http://server/%c3%83;, rendering
that as http://server/Ã is able to lead to transcription errors and
reinterpretation problems unless you know - out of band - that the
server is using utf8 to encode. Conversely if someone enters in
http://server/Ã in their browser window, choosing utf8 or their local
encoding is quite arbitrary and able to not match how the server would
represent that resource.

Beyond that, producers can do odd things - like when there are a
series of servers stacked and forwarding requests amongst themselves -
where they generate different parts of the same URL using different
encodings.

    2.  Characters

    The URI syntax provides a method of encoding data, presumably for
    the sake of identifying a resource, as a sequence of characters.
    The URI characters are, in turn, frequently encoded as octets for
    transport or presentation.  This specification does not mandate any
    particular character encoding for mapping between URI characters
    and the octets used to store or transmit those characters.  When a
    URI appears in a protocol element, the character encoding is
    defined by that protocol; without such a definition, a URI is
    assumed to be in the same character encoding as the surrounding
    text.

Thats true, but its been taken out of context; the set of characters
permitted in a URL is a strict subset of characters found in  ASCII;
there is a BNF that defines it and it is quite precise. While it
doesn't define a set of octets, it also doesn't define support for
unicode characters - individual schemes need to define the mapping
used between characters define as safe and those that get percent
encoded. E.g. unicode (abstract) - utf8 - percent encoded.

See also the section on comparing URL's - Unicode isn't at all relevant.

   if the standard library has made them unicode objects in 3, I
   expect a lot of pain in the webserver space.

 Yup.  But pain is inevitable if people are treating URIs (whether URLs
 or otherwise) as octet sequences.  Then your base URL is gonna be
 b'mailto:step...@xemacs.org', but the natural thing the UI will want
 to do is

    formurl = baseurl + '?subject=うるさいやつだなぁ…'

 IMO, the UI is right.  Something like the above ought to work.

I wish it would. The problem is not in Python here though - and
casually handwaving will exacerbate it, not fix it. Modelling URL's as
string like things is great from a convenience perspective, but, like
file paths, they are much more complex difficult.

For your particular case, subject contains characters outside the URL
specification, so someone needs to choose an encoding to get them into
a sequence-of-bytes-that-can-be-percent-escaped.

Section 2.5, identifying data goes into this to some degree. Note a
trap - the last paragraph says 'when a *NEW* URI scheme...' (emphasis
mine). Existing schemes do not mandate UTF8, which is why the
producer/consumer split matters. I spent a few minutes looking, but
its lost in the minutiae somewhere - HTTP does not specify UTF8
(though I wish it would) for its URI's, and std66 is the generic
definition and rules for new URI schemes, preserving intact the
mistake of HTTP.

 So the function that actually handles composing the URL should take a
 string (ie, unicode), and do all escaping.  The UI code should not
 need to know about escaping.  If nothing escapes except the function
 that puts the URL in composed form, and that function always escapes,
 life is easy.

Arg. The problem is very similar to the file system problem:
 - We get given a sequence of bytes
 - we have some rules that will let us manipulate the sequence to get
hostnames, query parameters and so forth
 - and others to let use walk a directory structure
 - and no guarantee that any of the data is in any particular encoding
other than 'URL'.

In

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Terry Reedy


On 6/21/2010 1:29 PM, P.J. Eby wrote:

At 05:49 PM 6/21/2010 +0100, Michael Foord wrote:

Why is your proposed bstr wrapper not practical to implement outside
the core and use in your own libraries and frameworks?


__contains__ doesn't have a converse operation, so you can't code a type
that works around this (Python 3.1 shown):

  from os.path import join
  join(b'x','y')



  join('y',b'x')


I am really unclear what result you intend for such mixed pairs, for all 
possible mixed pairs, sensible or not. It would seem to me best to write 
your own pjoin function that did exactly what you want over the whole 
input domain.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Terry Reedy


On 6/21/2010 1:29 PM, Guido van Rossum wrote:


Actually, the big problem with Python 2 is that if you mix str and
unicode, things work or crash depending on whether any of the str
objects involved contain non-ASCII bytes.

If one API decides to upgrade to Unicode, the result, when passed to
another API, may well cause a UnicodeError because not all arguments
have had the same treatment.


Now, the APIs are neither safe nor aware -- if you pass bytes in, you get
unpredictable results back.


This seems an overgeneralization of a particular bug. There are APIs
that are strictly text-in, text-out. There are others that are
bytes-in, bytes-out. Let's call all those *pure*. For some operations
it makes sense that the API is *polymorphic*, with which I mean that
text-in causes text-out, and bytes-in causes byte-out. All of these
are fine.

Perhaps there are more situations where a polymorphic API would be
helpful. Such APIs are not always so easy to implement, because they
have to be careful with literals or other constants (and even more so
mutable state) used internally -- but it can be done, and there are
plenty of examples in the stdlib.

The real problem apparently lies in (what I believe is only a few
rare) APIs that are text-or-bytes-in and always-text-out (or
always-bytes-out). Let's call them *hybrid*. Clearly, mixing hybrid
APIs in a stream of pure or polymorphic API calls is a problem,
because they turn a pure or polymorphic overall operation into a
hybrid one.

There are also text-in, bytes-out or bytes-in, text-out APIs that are
intended for encoding/decoding of course, but these are in a totally
different class.

Abstractly, it would be good if there were as few as possible hybrid
APIs, many pure or polymorphic APIs (which it should be in a
particular case is a pragmatic choice), and a limited number of
encoding/decoding APIs, which should generally be invoked at the edges
of the program (e.g., I/O).


Nice summary of part of the 'why' for Python3.


I still believe that believe that the instances of bytes silently
succeeding *some* of the time refers to specific bugs in specific
APIs, either intentional because of misguided compatibility desires,
or accidental in the haste of trying to convert the entire stdlib to
Python 3 in a finite time.


I think http://bugs.python.org/issue5468 reports one aspect of haste, 
missing encoding and errors paramaters. But it has not gotten much 
attention.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Stephen J. Turnbull

Toshio Kuratomi writes:

  One comment here -- you can also have uri's that aren't decodable into their
  true textual meaning using a single encoding.
  
  Apache will happily serve out uris that have utf-8, shift-jis, and
  euc-jp components inside of their path but the textual
  representation that was intended will be garbled (or be represented
  by escaped byte sequences).  For that matter, apache will serve
  requests that have no true textual representation as it is working
  on the byte level rather than the character level.

Sure.  I've never seen that combination, but I have seen Shift JIS and
KOI8-R in the same path.

But in that case, just using 'latin-1' as the encoding allows you to
use the (unicode) string operations internally, and then spew your
mess out into the world for someone else to clean up, just as using
bytes would.

  So a complete solution really should allow the programmer to pass
  in uris as bytes when the programmer knows that they need it.

Other than passing bytes into a constructor, I would argue if a
complete solution requires, eg, an interface that allows
urljoin(base,subdir) where the types of base and subdir are not
required to match, then it doesn't belong in the stdlib.  For stdlib
usage, that's premature optimization IMO.

The RFC says that URIs are text, and therefore they can (and IMO
should) be operated on as text in the stdlib.  It's not just a matter
of manipulating the URIs themselves, where working directly on bytes
will work just as well and and with the same string operations (as
long as everything is bytes).  It's also a question of API complexity
(eg, Barry's bugaboo of proliferation of encoding= parameters) and of
debugging (if URIs are internally str, then they will display sanely
in tracebacks and the interpreter).

The cases where URIs can't be sanely treated as text are garbage
input, and the stdlib should not try to provide a solution.  Just
passing in bytes and getting out bytes is GIGO.  Trying to do some
error-checking is going to be insufficient much of the time and overly
strict most of the rest of the time.  The programmer in the trenches
is going to need to decide what to allow and what not; I don't think
there are general answers because we know that allowing random URLs on
the web leads to various kinds of problems.  Some sites will need to
address some of them.

Note also that the complete solution argument cuts both ways.  Eg, a
complete solution should implement UTS 39 confusables detection[1]
and IDNA[2].  Good luck doing that with bytes!

If you *need* bytes (rather than simply trying to avoid conversion
overhead), you're in a hazmat handling situation.  Passing bytes in to
stdlib APIs here is the equivalent of carrying around kilograms of
fissionables in an open bucket.  While the Tokaimura comparison is
hyperbole, it can't be denied that use of bytes here shortcuts a lot
of processing strongly suggested by the RFCs, and prevents use of
various programming conveniences (such as reasonable display of URI
values in debugging).  Does the efficiency really justify including
that in the stdlib?  I dunno, I'm not a web programmer in the
trenches.  But I take my cue from MvL and MAL who don't seem real
enthusiastic about this.

And as Martin says, there is as yet no evidence offered that the
overhead of conversion is a general problem.


Footnotes: 
[1]  http://www.unicode.org/reports/tr39/

[2]  http://www.rfc-editor.org/rfc/rfc3490.txt
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Stephen J. Turnbull

Robert Collins writes:

  Perhaps you mean 3986 ? :)

Thank you for the correction.

      A URI is an identifier consisting of a sequence of characters
      matching the syntax rule named URI in Section 3.
  
   (where the phrase sequence of characters appears in all ancestors I
   found back to RFC 1738), and
  
  Sure, ok, let me unpack what I meant just a little. An abstract URI is
  neither unicode nor bytes per se - see section 1.2.1  A URI is a
  sequence of characters from a very limited set: the letters of the
  basic Latin alphabet, digits, and a few special characters. 

My position is that this describes the network protocol, not the
abstract URI.  It in no way suggests that uri-encoded forms should be
handled internally.  And the RFC explicitly says this is text, and
therefore sanctions the user- and programmer-friendly practice of
doing internal processing as text.

Note that in a hypothetical bytes-oriented API

base = convert_uri_to_wire_format('http://www.example.org/')
formuri = uri_join(base,b'home/steve/public_html')

the bytes literal b'/home/steve/public_html' clearly is intended as
readable text.  This is mixing types in the programmer's mind, even
though base is internally in bytes format and the relative URI is also
in bytes format.  This is un-Pythonic IMO.

  URI interpretation is fairly strictly separated between producers and
  consumers. A consumer can manipulate a url with other url fragments -
  e.g. doing urljoin. But it needs to keep the url as a url and not try
  to decode it to a unicode representation.

Unfortunately, outside of Kansas and Canberra, it don't work that
way.  How do you propose to uri_join base as above and
'/home/スティーブ/public_html'?  Encoding and/or decoding must be done
somewhere, and it would be damn unfriendly to make the browser user do
it!

In the bytes-oriented API, the programmer must be continually making
decisions about whether and how to handle non-ASCII components from
outside (or, more likely, cursing the existence of the damned
foreigners, and then ignoring the possibility ... let them eat
UnicodeException!)

  As an example, if I give the uri http://server/%c3%83;, rendering
  that as http://server/Ã is able to lead to transcription errors and
  reinterpretation problems unless you know - out of band - that the
  server is using utf8 to encode. Conversely if someone enters in
  http://server/Ã in their browser window, choosing utf8 or their local
  encoding is quite arbitrary and able to not match how the server would
  represent that resource.

Sure.  Using bytes doesn't solve either problem.  It just allows you
to wash your hands of it and pass it on to someone else, who probably
has even less information than you do.

Eg, in the case of passing the uri http://server/%c3%83; to someone
else without telling them the encoding means that effectively they're
limited to ASCII if they want to append meaningful relative paths
without guessing the encoding.

In the case of the user entering http://server/Ã;, you have to do
*something* to produce bytes eventually.  When was the last time you
typed %c3%83 at the end of a URL in a browser address field?

      2.  Characters
  
      The URI syntax provides a method of encoding data, presumably for
      the sake of identifying a resource, as a sequence of characters.
      The URI characters are, in turn, frequently encoded as octets for
      transport or presentation.  This specification does not mandate any
      particular character encoding for mapping between URI characters
      and the octets used to store or transmit those characters.  When a
      URI appears in a protocol element, the character encoding is
      defined by that protocol; without such a definition, a URI is
      assumed to be in the same character encoding as the surrounding
      text.
  
  Thats true, but its been taken out of context; the set of characters
  permitted in a URL is a strict subset of characters found in  ASCII;

No.  Again, you're confounding the URL with its network format.
There's no question that the network format is in bytes, and before
putting the URI into a wire protocol, you need to encode non-URI
characters.  However, the abstract URI is text, and may not even be
represented by octets or Unicode at all (eg, represented by carbon
residue on recycled wood pulp).

  See also the section on comparing URL's - Unicode isn't at all relevant.

Not to the RFC, which talks about *characters* and gives examples that
imply transcoding (eg, between EBCDIC and UTF-16), see the section you
cite.  However, Unicode is the canonical representation of text inside
Python, and therefore TOOWTDI for URL comparison in Python.

Thank you for that killer argument for my position; I hadn't thought
of it.

  I wish it would. The problem is not in Python here though - and
  casually handwaving will exacerbate it, not fix it. 

Using bytes because we just don't know is exactly casual handwaving.
Well, maybe not

Re: [Python-Dev] bytes / unicode

2010-06-21 Thread Glyph Lefkowitz

On Jun 21, 2010, at 2:17 PM, P.J. Eby wrote:

 One issue I remember from my enterprise days is some of the Asian-language 
 developers at NTT/Verio explaining to me that unicode doesn't actually solve 
 certain issues -- that there are use cases where you really *do* need bytes 
 plus encoding in order to properly express something.

The thing that I have heard in passing from a couple of folks with experience 
in this area is that some older software in asia would present characters 
differently if they were originally encoded in a japanese encoding versus a 
chinese encoding, even though they were really the same characters.

I do know that Han Unification is a giant political mess 
(http://en.wikipedia.org/wiki/Han_unification makes for some interesting 
reading), but my understanding is that it has handled enough of the cases by 
now that one can write software to display asian languages and it will 
basically work with a modern version of unicode.  (And of course, there's 
always the private use area, as Stephen Turnbull pointed out.)

Regardless, this is another example where keeping around a string isn't really 
enough.  If you need to display a japanese character in a distinct way because 
you are operating in the japanese *script*, you need a tag surrounding your 
data that is a hint to its presentation.  The fact that these presentation 
hints were sometimes determined by their encoding is an unfortunate historical 
accident.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-20 Thread Benjamin Peterson

2010/6/20 Antoine Pitrou solip...@pitrou.net:
 On Sun, 20 Jun 2010 14:40:56 -0400
 P.J. Eby p...@telecommunity.com wrote:

 Actually, I would say that it's more that (in the network protocol
 case) we *have* bytes, some of which we would like to *treat* as
 text, yet do not wish to constantly convert back and forth to
 full-blown unicode

 Well, then why don't you just stick with a bytes object?

There are not many tools for treating bytes as text.



-- 
Regards,
Benjamin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-20 Thread Robert Collins

Also, url's are bytestrings - by definition; if the standard library
has made them unicode objects in 3, I expect a lot of pain in the
webserver space.

-Rob
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-20 Thread Terry Reedy


On 6/20/2010 5:55 PM, Benjamin Peterson wrote:

2010/6/20 Antoine Pitrousolip...@pitrou.net:

On Sun, 20 Jun 2010 14:40:56 -0400
P.J. Ebyp...@telecommunity.com  wrote:


Actually, I would say that it's more that (in the network protocol
case) we *have* bytes, some of which we would like to *treat* as
text, yet do not wish to constantly convert back and forth to
full-blown unicode


Well, then why don't you just stick with a bytes object?


There are not many tools for treating bytes as text.


If one writes a function (most easily in Python)
1. in terms of the methods and operations shared by unicode and bytes, 
which is nearly all of them, and
2. does not gratuitously (and dare I say, unpythonically) do a class 
check to unnecessarily exclude one or the other, and
3. does not specialize by assuming only one of the possible values for 
type-specific constants, such as number of chars/codes, and

4. does not do something unicode specific such as normalization,
then the function should be agnostic and operate generically.

I think there was some temptation to be 'pure' and limit text methods to 
str and enforce the decode-manipulate-encode paradigm (which is 
extremely common in various forms, and nothing unusual). But for 
practicality and efficiency, that was not done.


Do you have in mind any tools that could and should operate on both, but 
do not? (I realize that at the C level, code is not just specialized to 
'unicode', but to 2-byte versus 4-byte representations.)


Terry Jan Reedy


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-20 Thread P.J. Eby


At 07:33 PM 6/20/2010 -0400, Terry Reedy wrote:
Do you have in mind any tools that could and should operate on both, 
but do not?


From http://mail.python.org/pipermail/web-sig/2009-September/004105.html :

The problem which arises is that unquoting of URLs in Python 3.X
stdlib can only be done on unicode strings. If though a string
contains non UTF-8 encoded characters it can fail.

I don't have any direct experience with the specific issue 
demonstrated in that post, but in the context of the discussion as a 
whole, I understood the overall issue as if you pass bytes to 
certain stdlib functions, you might get back unicode, an explicit 
error, or (at least in the case shown above) something that's just 
plain wrong.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-20 Thread P.J. Eby


At 11:47 PM 6/20/2010 +0200, Antoine Pitrou wrote:

On Sun, 20 Jun 2010 14:40:56 -0400
P.J. Eby p...@telecommunity.com wrote:

 Actually, I would say that it's more that (in the network protocol
 case) we *have* bytes, some of which we would like to *treat* as
 text, yet do not wish to constantly convert back and forth to
 full-blown unicode

Well, then why don't you just stick with a bytes object?


Because the stdlib is not consistent in how well it handles bytes objects.



 While reading over this thread, I'm wondering whether at least my
 (WSGI-related) problems in this area would be solved by the
 availability of a type (say bstr) that was simply a wrapper
 providing string-like behavior over an underlying bytes, byte array,
 or memoryview, that would produce objects of compatible type when
 combined with strings (by encoding them to match).

This really sounds horrible. Python 3 was designed precisely to
discourage ad hoc mixing of bytes and unicode.


Who said ad hoc mixing?  The point is to have a simple way to ensure 
that my bytes don't get implicitly converted to unicode, and 
(ideally) don't have to get converted *back*, either.


The idea that by passing bytes to the stdlib, I randomly get back 
either bytes or unicode (i.e. undocumentedly and inconsistently 
between different stdlib APIs, as well as possibly dependent on 
runtime conditions), is NOT discouraging ad hoc mixing.




 seems so much saner than writing *this* everywhere:

   newurl = str(urljoin(str(base, 'latin-1'), 'subdir'), 'latin-1')

urljoin already returns an str object. Why do you want to decode it
again?


Ugh.  I meant:

   newurl = urljoin(str(base, 'latin-1'), 'subdir').encode('latin-1')

Which just goes to the point of how ridiculous it is to have to 
convert things to strings and back again to use APIs that ought to 
just handle bytes properly in the first place.


(I don't know if there are actually any problems in the case of 
urljoin; I wasn't the person who originally brought up the stdlib 
not treating URLs as bytestrings in 3.x issue on the 
Web-SIG.  Somewhere along the line I got the impression that urljoin 
was one such API, but in researching the issue it looks like maybe 
the canonical example was qsl_parse.)


It's possible that the stdlib situation has improved tremendously 
since then, of course.  I don't know if the bug was reported, or how 
many remain.


And it's precisely the part where I don't know how many remain that 
keeps me from doing more than idly thinking about porting any of my 
libraries (let alone apps) to Python 3.x.  The fact that the stdlib 
itself has these sorts of issues raises major red flags to me about 
whether the One Obvious Way has yet been found.  If the stdlib 
maintainers don't agree on the One Obvious Way, that seems even 
worse.  Or if there is such a Way, but nobody has documented its 
practices yet, that's almost the same thing.


I also find it weird that there seem to be two camps on this subject, 
one of which claims that All Is Well And There Is No Problem -- but I 
do not recall seeing anyone who was in the What do I do; this 
doesn't seem ready camp who switched sides and took the time to 
write down what made them realize that they were wrong about there 
being a problem, and what steps they had to take.  The existence of 
one or more such documents would certainly ease my mind, and I 
imagine that of other people who are less waiting for others' 
libraries, than for the stdlib (and/or language) itself to settle.


(Or more precisely, for it to be SEEN to have settled.)

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-20 Thread Terry Reedy


On 6/20/2010 9:33 PM, P.J. Eby wrote:

At 07:33 PM 6/20/2010 -0400, Terry Reedy wrote:

Do you have in mind any tools that could and should operate on both,
but do not?


 From http://mail.python.org/pipermail/web-sig/2009-September/004105.html :


Thank for the concrete examples in this and your other post.
I am cc-ing the author of the above.


The problem which arises is that unquoting of URLs in Python 3.X
stdlib can only be done on unicode strings.


Actually, I believe this is an encoding rather than bytes versus unicode 
issue.


 If though a string

contains non UTF-8 encoded characters it can fail.


Which is to say, I believe, if the ascii text in the (unicode) string 
has a % encoding of a byte that that is not a legal utf-8 encoding of 
anything.


The specific example is

 urllib.parse.parse_qsl('a=b%e0')
[('a', 'b�')]

where the character after 'b' is white ? in dark diamond, indicating an 
error.


parse_qsl() splits that input on '=' and sends each piece to 
urllib.parse.unquote
unquote() attempts to Replace %xx escapes by their single-character 
equivalent.. unquote has an encoding parameter that defaults to 'utf-8' 
in *its* call to .decode. parse_qsl does not have an encoding parameter. 
If it did, and it passed that to unquote, then

the above example would become (simulated interaction)

 urllib.parse.parse_qsl('a=b%e0', encoding='latin-1')
[('a', 'bà')]

I got that output by copying the file and adding encoding-'latin-1' to 
the unquote call.


Does this solve this problem?
Has anything like this been added for 3.2?
Should it be?


I don't have any direct experience with the specific issue demonstrated
in that post, but in the context of the discussion as a whole, I
understood the overall issue as if you pass bytes to certain stdlib
functions, you might get back unicode, an explicit error, or (at least
in the case shown above) something that's just plain wrong.


As indicated above, I so far think that the problem is with the 
application of the new model, not the model itself.


Just for 'fun', I tried feeding bytes to the function.
 p.parse_qsl(b'a=b%e0')
Traceback (most recent call last):
  File pyshell#2, line 1, in module
p.parse_qsl(b'a=b%e0')
  File C:\Programs\Python31\lib\urllib\parse.py, line 377, in parse_qsl
pairs = [s2 for s1 in qs.split('') for s2 in s1.split(';')]
TypeError: Type str doesn't support the buffer API

I do not know if that message is correct, but certainly trying to split 
bytes with unicode is (now, at least) a mistake. This could be 'fixed' 
by replacing the typed literals with expressions that match the type of 
the input. But I am not sure if that is sensible since the next step is 
to unquote and decode to unicode anyway. I just do not know the use case.


Terry Jan Reedy




___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-20 Thread Lennart Regebro

On Sun, Jun 20, 2010 at 23:55, Benjamin Peterson benja...@python.org wrote:
 There are not many tools for treating bytes as text.

Well, what tools would you need that can be used also on bytes? Bytes
objects has a lot of the same methods like strings do, and that will
cover 99% of the cases. Most text tools assume that the text really is
text, and much of it doesn't make sense unless you've converted it to
Unicode first.

But most of the things you would need to do, such as in a web-server
doesn't really involve treating the text as something linguistic, but
it's a matter of replacing and escaping and such, and that could be
done while the text is in bytes form.But the tools for that exists...

Is there some specific tool that is missing?
-- 
Lennart Regebro: http://regebro.wordpress.com/
Python 3 Porting: http://python3porting.com/
+33 661 58 14 64
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

1 2 >

100 matches

Mail list logo