[lucy-dev] utf8proc, control chars and non-character code points

Marvin Humphrey Tue, 13 Dec 2011 16:35:32 -0800

Greets,

I just committed a test to trunk which verifies that utf8proc's normalization
works properly, in that normalizing a second time is a no-op.  However, I had
to disable the test because utf8proc chokes when fed strings which contain
either control characters or non-character code points.


    http://svn.apache.org/viewvc?view=revision&revision=1213996

The test uses random UTF-8 data, generated by TestUtils_random_string().  With
the hack below my sig, the test passes.

Strings which contain control characters are valid UTF-8, as are strings which
contain noncharacters.  Noncharacters are not supposed to be used for
interchange, but Lucy is a library, not an application, and thus should pass
noncharacters cleanly.

    http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Noncharacters

Looking at the code for Lucy::Analysis::Normalizer, it seems that if utf8proc
reports an error, we simply leave the token alone.  That seems appropriate in
the case of malformed UTF-8, but I question whether it is appropriate for
valid UTF-8 sequences containing control characters or non-character code
points.

Marvin Humphrey

Index: core/Lucy/Test/TestUtils.c
===================================================================
--- core/Lucy/Test/TestUtils.c  (revision 1213967)
+++ core/Lucy/Test/TestUtils.c  (working copy)
@@ -17,6 +17,7 @@
 #define C_LUCY_TESTUTILS
 #include "Lucy/Util/ToolSet.h"
 #include <string.h>
+#include <ctype.h>
 
 #include "Lucy/Test/TestUtils.h"
 #include "Lucy/Test.h"
@@ -106,6 +107,15 @@
         if (code_point > 0xD7FF && code_point < 0xE000) {
             continue; // UTF-16 surrogate.
         }
+        if (iscntrl(code_point)) {
+            continue;
+        }
+        if ((code_point & 0xFFFF) == 0xFFEF
+            || (code_point & 0xFFFF) == 0xFFFF
+            || (code_point >= 0xFDD0 && code_point <= 0xFDEF)
+           ) {
+            continue; // Unicode non-character code point.
+        }
         break;
     }
     return code_point;

[lucy-dev] utf8proc, control chars and non-character code points

Reply via email to