Re: [PATCH v2] dfa: optimize UTF-8 period

Paolo Bonzini Tue, 20 Apr 2010 02:12:33 -0700

On 04/20/2010 12:47 AM, Eric Blake wrote:

On 04/19/2010 06:14 AM, Paolo Bonzini wrote:

+  /* A valid UTF-8 character is
+
+          ([0x00-0x7f]
+           |[0xc2-0xdf][0x80-0xbf]
+           |[0xe0-0xef[0x80-0xbf][0x80-0xbf]
+           |[0xf0-f7][0x80-0xbf][0x80-0xbf][0x80-0xbf])


Yes, but in POSIX XBD 9.3.4,
http://www.opengroup.org/onlinepubs/9699919799/toc.htm, the ANYCHAR does
not match NUL.  Do you need to adjust this patch to exclude 0x00?


Yes (following the syntax bits).

Does this seem okay?

Paolo

diff --git a/gnulib b/gnulib
index 5fbd6e3..bfffe40 160000
--- a/gnulib
+++ b/gnulib
@@ -1 +1 @@
-Subproject commit 5fbd6e3e571c6e59270fa486bd7c83dfe04c87cf
+Subproject commit bfffe408f8b375fd0989266bd8c01580be26d1a8
diff --git a/src/dfa.c b/src/dfa.c
index 61322d1..d9c5ba2 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -1487,7 +1487,17 @@ add_utf8_anychar (void)
   /* Define the five character classes that are needed below.  */
   if (dfa->utf8_anychar_classes[0] == 0)
     for (i = 0; i < n; i++)
-      dfa->utf8_anychar_classes[i] = CSET + charclass_index(utf8_classes[i]);
+      {
+        charclass c = utf8_classes[i];
+        if (i == 1)
+          {
+            if (!(syntax_bits & RE_DOT_NEWLINE))
+              clrbit (c, eolbyte);
+            if (syntax_bits & RE_DOT_NOT_NULL)
+              clrbit (c, '\0');
+          }
+        dfa->utf8_anychar_classes[i] = CSET + charclass_index(c);
+      }
 
   /* A valid UTF-8 character is

Re: [PATCH v2] dfa: optimize UTF-8 period

Reply via email to