Daiki Ueno <[email protected]> writes: > I have rebased the patch against the latest git master and pushed into > 'ueno/unicode-9.0.0' branch in the gnulib repository: > http://git.savannah.gnu.org/cgit/gnulib.git/log/?h=ueno/unicode-9.0.0
The attached is the corresponding documentation change to libunistring. Bruno, did you have time to look at the Gnulib changes? I would like to merge the branch soon before I completely forget about it ;-) Regards, -- Daiki Ueno
>From 3968938b1d21d87e2f6c03e9fe5453bf413d7d7c Mon Sep 17 00:00:00 2001 From: Daiki Ueno <[email protected]> Date: Mon, 13 Nov 2017 17:48:27 +0100 Subject: [PATCH] unigbrk: Update from Gnulib --- ChangeLog | 10 ++++++++++ autogen.sh | 1 + doc/unigbrk.texi | 30 +++++++++++++++++++++++++++++- lib/unigbrk/.gitignore | 2 ++ tests/unigbrk/.gitignore | 2 ++ 5 files changed, 44 insertions(+), 1 deletion(-) diff --git a/ChangeLog b/ChangeLog index f2e4563..c600849 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,13 @@ +2017-11-13 Daiki Ueno <[email protected]> + + * autogen.sh (GNULIB_MODULES): Pull unigbrk/uc-grapheme-breaks. + * doc/unigbrk.texi (Grapheme cluster breaks in a string): Mention + the limitations of *_grapheme_next and *_grapheme_prev functions + and suggest *_grapheme_breaks instead. + (Grapheme cluster break property): Document newly added + properties; mention the limitations of uc_is_grapheme_break and + suggest to use uc_grapheme_breaks instead. + 2017-10-21 Bruno Haible <[email protected]> Upgrade to newer libtool. diff --git a/autogen.sh b/autogen.sh index e836db8..269ab0d 100755 --- a/autogen.sh +++ b/autogen.sh @@ -329,6 +329,7 @@ if test $skip_gnulib = false; then unigbrk/uc-gbrk-prop unigbrk/uc-is-grapheme-break unigbrk/ulc-grapheme-breaks + unigbrk/uc-grapheme-breaks uniwbrk/base uniwbrk/u8-wordbreaks uniwbrk/u16-wordbreaks diff --git a/doc/unigbrk.texi b/doc/unigbrk.texi index 196bd9f..d7847cc 100644 --- a/doc/unigbrk.texi +++ b/doc/unigbrk.texi @@ -44,6 +44,11 @@ clusters in a string. Returns the start of the next grapheme cluster following @var{s}, or @var{end} if no grapheme cluster break is encountered before it. Returns NULL if and only if @code{@var{s} == @var{end}}. + +Note that these functions do not handle the case when a character +outside of the range between @var{s} and @var{end} is needed to +determine the boundary. Use @func{_grapheme_breaks} functions for such +cases. @end deftypefun @deftypefun void u8_grapheme_prev (const uint8_t *@var{s}, const uint8_t *@var{start}) @@ -52,6 +57,11 @@ Returns NULL if and only if @code{@var{s} == @var{end}}. Returns the start of the grapheme cluster preceding @var{s}, or @var{start} if no grapheme cluster break is encountered before it. Returns NULL if and only if @code{@var{s} == @var{start}}. + +Note that these functions do not handle the case when a character +outside of the range between @var{start} and @var{s} is needed to +determine the boundary. Use @func{_grapheme_breaks} functions for such +cases. @end deftypefun The following functions determine all of the grapheme cluster @@ -61,8 +71,9 @@ boundaries in a string. @deftypefunx void u16_grapheme_breaks (const uint16_t *@var{s}, size_t @var{n}, char *@var{p}) @deftypefunx void u32_grapheme_breaks (const uint32_t *@var{s}, size_t @var{n}, char *@var{p}) @deftypefunx void ulc_grapheme_breaks (const char *@var{s}, size_t @var{n}, char *@var{p}) +@deftypefunx void uc_grapheme_breaks (const ucs_t *@var{s}, size_t @var{n}, char *@var{p}) Determines the grapheme cluster break points in @var{s}, an array of -@var{n} units, and stores the result at @code{@var{p}[0..@var{n}-1]}. +@var{n} units, and stores the result at @code{@var{p}[0..@var{nx}-1]}. @table @asis @item @code{@var{p}[i] = 1} means that there is a grapheme cluster boundary between @@ -73,6 +84,13 @@ same grapheme cluster. @end table @code{@var{p}[0]} is always set to 1, because there is always a grapheme cluster break at start of text. + +In addition to the above variants for UTF-8, UTF-16, and UTF-32 strings, +@code{<unigbrk.h>} provides another variant: @func{uc_grapheme_breaks}. + +This is similar to @func{u32_grapheme_breaks}, but it accepts any +characters which may not be represented in UTF-32, such as control +characters. @end deftypefun @node Grapheme cluster break property @@ -99,6 +117,12 @@ property. More values may be added in the future. @deftypevrx Constant int GBP_T @deftypevrx Constant int GBP_LV @deftypevrx Constant int GBP_LVT +@deftypevrx Constant int GBP_RI +@deftypevrx Constant int GBP_ZWJ +@deftypevrx Constant int GBP_EB +@deftypevrx Constant int GBP_EM +@deftypevrx Constant int GBP_GAZ +@deftypevrx Constant int GBP_EBG @end deftypevr The following function looks up the grapheme cluster break property of a @@ -123,4 +147,8 @@ of text, respectively. This implements the extended (not legacy) grapheme cluster rules described in the Unicode standard, because the standard says that they are preferred. + +Note that this function do not handle the case when three ore more +consecutive characters are needed to determine the boundary. Use +@func{uc_grapheme_breaks} for such cases. @end deftypefun diff --git a/lib/unigbrk/.gitignore b/lib/unigbrk/.gitignore index a7507c9..a9ae5e6 100644 --- a/lib/unigbrk/.gitignore +++ b/lib/unigbrk/.gitignore @@ -1,5 +1,6 @@ # Files brought in by gnulib-tool: /gbrkprop.h +/u-grapheme-breaks.h /u16-grapheme-breaks.c /u16-grapheme-next.c /u16-grapheme-prev.c @@ -10,6 +11,7 @@ /u8-grapheme-next.c /u8-grapheme-prev.c /uc-gbrk-prop.c +/uc-grapheme-breaks.c /uc-is-grapheme-break.c /ulc-grapheme-breaks.c diff --git a/tests/unigbrk/.gitignore b/tests/unigbrk/.gitignore index 9e1dc4c..a8f7f51 100644 --- a/tests/unigbrk/.gitignore +++ b/tests/unigbrk/.gitignore @@ -11,6 +11,8 @@ /test-u8-grapheme-prev.c /test-uc-gbrk-prop.c /test-uc-gbrk-prop.h +/test-uc-grapheme-breaks.c +/test-uc-grapheme-breaks.sh /test-uc-is-grapheme-break.c /test-uc-is-grapheme-break.sh /test-ulc-grapheme-breaks.c -- 2.13.6
