Hi,

On Thursday 09 August 2007 06:07:12 Guido van Rossum wrote:
> A quick temporary hack is to use buffer(b'abc') instead. (buffer() is
> so incredibly broken that it lets you hash() even if the underlying
> object is broken. :-)

I prefer str8 which looks to be a good candidate for "frozenbytes" type.

> The correct solution is to fix the re library to avoid using hash()
> directly on the underlying data type altogether; that never had sound
> semantics (as proven by the buffer() hack above).

re module uses a dictionary to store compiled expressions and the key is a 
tuple (pattern, flags) where pattern is a bytes (str8) or str and flags is an 
int.

re module bugs:
 1. _compile() doesn't support bytes
 2. escape() doesn't support bytes

My attached patch fix both bugs:
 - convert bytes to str8 in _compile() to be able to hash it
 - add a special version of escape() for bytes

I don't know the best method to create a bytes in a for. In Python 2.x, the 
best method is to use a list() and ''.join(). Since bytes is mutable I 
choosed to use append() and concatenation (a += b).

I also added new unit test for escape() function with bytes argument.

You may not apply my patch directly. I don't know Python 3000 very well nor 
Python coding style. But my patch should help to fix the bugs ;-)

-----

Why re module has code for Python < 2.2 (optional finditer() function)? Since 
the code is now specific to Python 3000, we should use new types like set 
(use a set for _alphanum instead of a dictionary) and functions like 
enumerate (in _escape for str block).

Victor Stinner
http://hachoir.org/
Index: Lib/re.py
===================================================================
--- Lib/re.py	(révision 56838)
+++ Lib/re.py	(copie de travail)
@@ -177,6 +177,9 @@
 
 def compile(pattern, flags=0):
     "Compile a regular expression pattern, returning a pattern object."
+    if isinstance(pattern, bytes):
+        # Use str8 instead of bytes because bytes isn't hashable
+        pattern = str8(pattern)
     return _compile(pattern, flags)
 
 def purge():
@@ -193,18 +196,34 @@
     _alphanum[c] = 1
 del c
 
+_alphanum_bytes = set(b'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890')
+
 def escape(pattern):
     "Escape all non-alphanumeric characters in pattern."
-    s = list(pattern)
-    alphanum = _alphanum
-    for i in range(len(pattern)):
-        c = pattern[i]
-        if c not in alphanum:
-            if c == "\000":
-                s[i] = "\\000"
+    if isinstance(pattern, bytes):
+        alphanum = _alphanum_bytes
+        s = b''
+        for c in pattern:
+            if c not in alphanum:
+                if not c:
+                    s += b"\\000"
+                else:
+                    s.append(92)
+                    s.append(c)
             else:
-                s[i] = "\\" + c
-    return pattern[:0].join(s)
+                s.append(c)
+        return s
+    else:
+        alphanum = _alphanum
+        s = list(pattern)
+        for i in range(len(pattern)):
+            c = pattern[i]
+            if c not in alphanum:
+                if c == "\000":
+                    s[i] = "\\000"
+                else:
+                    s[i] = "\\" + c
+        return ''.join(s)
 
 # --------------------------------------------------------------------
 # internals
Index: Lib/test/test_re.py
===================================================================
--- Lib/test/test_re.py	(révision 56838)
+++ Lib/test/test_re.py	(copie de travail)
@@ -397,18 +397,32 @@
         self.assertEqual(re.search("\s(b)", " b").group(1), "b")
         self.assertEqual(re.search("a\s", "a ").group(0), "a ")
 
-    def test_re_escape(self):
-        p=""
-        for i in range(0, 256):
-            p = p + chr(i)
-            self.assertEqual(re.match(re.escape(chr(i)), chr(i)) is not None,
-                             True)
-            self.assertEqual(re.match(re.escape(chr(i)), chr(i)).span(), (0,1))
+    def _test_re_escape(self, use_bytes):
+        if use_bytes:
+            p=bytes()
+            for i in range(0, 256):
+                p.append(i)
+                self.assertEqual(re.match(re.escape(chr(i)), chr(i)) is not None,
+                                 True)
+                self.assertEqual(re.match(re.escape(chr(i)), chr(i)).span(), (0,1))
+        else:
+            p=""
+            for i in range(0, 256):
+                p = p + chr(i)
+                self.assertEqual(re.match(re.escape(chr(i)), chr(i)) is not None,
+                                 True)
+                self.assertEqual(re.match(re.escape(chr(i)), chr(i)).span(), (0,1))
 
         pat=re.compile(re.escape(p))
         self.assertEqual(pat.match(p) is not None, True)
         self.assertEqual(pat.match(p).span(), (0,256))
 
+    def test_re_escape_str(self):
+        self._test_re_escape(False)
+
+    def test_re_escape_bytes(self):
+        self._test_re_escape(True)
+
     def test_pickling(self):
         import pickle
         self.pickle_test(pickle)
_______________________________________________
Python-3000 mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Reply via email to