Pádraig Brady wrote:
> In the first 65535 code points there are also 404 chars which are
> not classed as combining in the unicode database, but are classed
> as zero width in the glibc locale data at least (zero-width space
> being one of them like you mentioned). I determined this with the
> attached progs:
> 
> ./zw | python unidata.py | grep " 0 " | wc -l


Hi Pádraig,

Wow, I knew there were some stand-alone zero-width characters, but I had
no idea there were so many!

I poked around a little in gnulib and found a function for determining
the combining class of a Unicode character.

I think the attached patch does what you were intending to do, and it
also counts all of the stand-alone zero-width characters you found:

----
$ ./zw | python unidata.py | grep " 0 " | perl packu.pl | src/wc -m
404

$ src/wc -m 2char
2 2char
----

Please note that this requires a re-run of `./bootstrap', since it needs
to bring some extra stuff in from gnulib.

Hope that helps.

Bo
diff --git a/bootstrap.conf b/bootstrap.conf
index 8bde0ad..ef5a328 100644
--- a/bootstrap.conf
+++ b/bootstrap.conf
@@ -82,6 +82,7 @@ gnulib_modules="
 	stpncpy
 	strftime
 	strpbrk strtoimax strtoumax strverscmp sys_stat timespec tzset
+	unictype/combining-class
 	unicodeio unistd-safer unlink-busy unlinkdir unlocked-io
 	uptime
 	useless-if-before-free
diff --git a/src/wc.c b/src/wc.c
index 61ab485..ed6630c 100644
--- a/src/wc.c
+++ b/src/wc.c
@@ -32,6 +32,8 @@
 #include "readtokens0.h"
 #include "safe-read.h"
 
+#include "unictype.h"
+
 #if !defined iswspace && !HAVE_ISWSPACE
 # define iswspace(wc) \
     ((wc) == to_uchar (wc) && isspace (to_uchar (wc)))
@@ -368,6 +370,8 @@ wc (int fd, char const *file_x, struct fstatus *fstatus)
 			    linepos += width;
 			  if (iswspace (wide_char))
 			    goto mb_word_separator;
+			  else if (uc_combining_class (wide_char) != 0)
+			    chars--; /* don't count combining chars */
 			  in_word = true;
 			}
 		      break;

Attachment: packu.pl
Description: Perl program

éé
_______________________________________________
Bug-coreutils mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/bug-coreutils

Reply via email to