Greets,
I just committed a test to trunk which verifies that utf8proc's normalization
works properly, in that normalizing a second time is a no-op. However, I had
to disable the test because utf8proc chokes when fed strings which contain
either control characters or non-character code points.
http://svn.apache.org/viewvc?view=revision&revision=1213996
The test uses random UTF-8 data, generated by TestUtils_random_string(). With
the hack below my sig, the test passes.
Strings which contain control characters are valid UTF-8, as are strings which
contain noncharacters. Noncharacters are not supposed to be used for
interchange, but Lucy is a library, not an application, and thus should pass
noncharacters cleanly.
http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Noncharacters
Looking at the code for Lucy::Analysis::Normalizer, it seems that if utf8proc
reports an error, we simply leave the token alone. That seems appropriate in
the case of malformed UTF-8, but I question whether it is appropriate for
valid UTF-8 sequences containing control characters or non-character code
points.
Marvin Humphrey
Index: core/Lucy/Test/TestUtils.c
===================================================================
--- core/Lucy/Test/TestUtils.c (revision 1213967)
+++ core/Lucy/Test/TestUtils.c (working copy)
@@ -17,6 +17,7 @@
#define C_LUCY_TESTUTILS
#include "Lucy/Util/ToolSet.h"
#include <string.h>
+#include <ctype.h>
#include "Lucy/Test/TestUtils.h"
#include "Lucy/Test.h"
@@ -106,6 +107,15 @@
if (code_point > 0xD7FF && code_point < 0xE000) {
continue; // UTF-16 surrogate.
}
+ if (iscntrl(code_point)) {
+ continue;
+ }
+ if ((code_point & 0xFFFF) == 0xFFEF
+ || (code_point & 0xFFFF) == 0xFFFF
+ || (code_point >= 0xFDD0 && code_point <= 0xFDEF)
+ ) {
+ continue; // Unicode non-character code point.
+ }
break;
}
return code_point;