On 16 July 2013 19:18, Terry Reedy <tjre...@udel.edu> wrote:
> I wonder if the change was an artifact of changing the code to prohibit
> mixing Unicode and bytes.

I'm pretty sure we the only thing we changed in 3.x is to migrate re
to the PEP 3118 buffer API, and the behavioural change Guido is seeing
is actually the one between the 2.x buffer (which returns 8-bit
strings when sliced) and other types (including memoryview) which
return instances of themselves.

Getting the old buffer behaviour in 3.x without an extra copy
operation should just be a matter of wrapping the input with
memoryview (to avoid copying the group elements in the match object)
and the output with bytes (to avoid keeping the entire original object
alive just to reference a few small pieces of it that were matched by
the regex):

>>> import re
>>> data = bytearray(b"aaabbbcccddd")
>>> re.match(b"(a*)b*c*(d*)", data).group(2)
bytearray(b'ddd')
>>> bytes(re.match(b"(a*)b*c*(d*)", memoryview(data)).group(2))
b'ddd'

Given that, I'm inclined to keep the existing behaviour on backwards
compatibility grounds. To make the above code work on both 2.x *and*
3.x without making an extra copy, it's possible to keep the bytes call
(it should be a no-op on 2.x) and dynamically switch the type used to
wrap the input between buffer in 2.x and memoryview in 3.x
(unfortunately, the 2.x memoryview doesn't work for this case, as the
2.x re API doesn't accept it as valid input).

Cheers,
Nick.

--
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to