On Thu, 16 Dec 2021 02:45:54 -0800
Michael Forney <[email protected]> wrote:
Dear Michael,
I know this thread is already long enough, but I took my time now to
read deeper into the topic. Please read below, as we might come to a
conclusion there now.
> Both of these observations are true, but just because uint8_t is 8-bit
> and unsigned char is 8-bit doesn't mean that uint8_t == unsigned char.
> A C implementation can have implementation-defined extended integer
> types, so it is possible that it defines uint8_t as an 8-bit extended
> integer type, distinct from unsigned char (similar to how long long
> and long may be distinct 64-bit integer types). As far as I know, this
> would be still be POSIX compliant.
>
> Yes, I believe this is a possibility.
>
> If you are assuming that unsigned char == uint8_t, I think you should
> just use unsigned char in your API. You could document the API as
> expecting one UTF-8 code unit per byte if you are worried about
> confusion regarding CHAR_BIT.
I found that _a lot_ of code relies on casting to and from (uint8_t *),
but this, as you already explained very well, breaks strict aliasing
as uint8_t is not a character type. This is not a problem in practice
because only gcc enforces strict aliasing and uint8_t is typedef'd to
unsigned char in all (?) cases, which lets uint8_t inherit the
aliasing-exception, however, nobody stops an implementer to define a
separate integral type that then does not work.
Many projects I found casting to and from (uint8_t *) explicitly
disable strict aliasing with the flag -fno-strict-aliasing and
technically have no problem in this regard, but this is such a
technical thing most users of the library, if we also pretty much
forced them to cast to and from (uint8_t *)), would just not know.
Interestingly, there was even an internal discussion on the
gcc-bugtracker[0] about this. They were thinking about adding an
attribute __attribute__((no_alias)) to the uint8_t typedef so it would
explicitly lose the aliasing-exception.
There's a nice rant on [1] and a nice discussion on [2] about this
whole thing. And to be honest, at this point I still wasn't 100%
satisfied.
What convinced me was how they added UTF-8-literals in C11. There you
can define explicit UTF-8 literals as u8"Hällö Wörld!" and they're of
type char[]. So even though char * is a bit ambiguous, we document well
that we expect an UTF-8 string. C11 goes further and accomodates us
with ways to portably define them.
> Ah, okay, I see what you mean. To be honest I'm not really sure how
> something like file encoding and I/O would work on such a system, but
> I was assuming that files would contain one code unit per byte, rather
> than packing multiple code units into a single byte. For instance, on
> a hypothetical system with 9-bit bytes, I wouldn't expect a code unit
> to cross the byte boundary.
To also address this point, here's what we can do to make us all happy:
1) Change the API to accept char*
2) Cast the pointers internally to (unsigned char *) for bitwise
modifications. We may do that as we may alias with char, unsigned
char and signed char.
3) Treat it as an invalid code point when any bit higher than the 9th
is set. This is actually already in the implementation, as we have
strict ranges.
Please take a look at the attached diff and let me know what you think.
Is this portable and am I correct to assume we might even handle
chars longer than 8 bit properly?
There's just one open question: Do you know of a better way than to do
(char *)(unsigned char[]){ 0xff, 0xef, 0xa0 }
to specify a literal char-array with specific bit-patterns?
With best regards and thanks again for your help and this very
interesting discussion!
Laslo
[0]:https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66110
[1]:https://gist.github.com/jibsen/da6be27cde4d526ee564
[2]:https://github.com/RIOT-OS/RIOT/issues/5497
diff --git a/grapheme.h b/grapheme.h
index bd5244b..3294c8e 100644
--- a/grapheme.h
+++ b/grapheme.h
@@ -19,11 +19,11 @@ typedef struct lg_internal_segmentation_state {
#define LG_CODEPOINT_INVALID UINT32_C(0xFFFD)
-size_t lg_grapheme_nextbreak(const uint8_t *);
+size_t lg_grapheme_nextbreak(const char *);
bool lg_grapheme_isbreak(uint_least32_t, uint_least32_t, LG_SEGMENTATION_STATE *);
-size_t lg_utf8_decode(const uint8_t *, size_t, uint_least32_t *);
-size_t lg_utf8_encode(uint_least32_t, uint8_t *, size_t);
+size_t lg_utf8_decode(const char *, size_t, uint_least32_t *);
+size_t lg_utf8_encode(uint_least32_t, char *, size_t);
#endif /* GRAPHEME_H */
diff --git a/man/lg_grapheme_nextbreak.3 b/man/lg_grapheme_nextbreak.3
index 795e1b4..ff78395 100644
--- a/man/lg_grapheme_nextbreak.3
+++ b/man/lg_grapheme_nextbreak.3
@@ -7,7 +7,7 @@
.Sh SYNOPSIS
.In grapheme.h
.Ft size_t
-.Fn lg_grapheme_nextbreak "const uint8_t *str"
+.Fn lg_grapheme_nextbreak "const char *str"
.Sh DESCRIPTION
.Fn lg_grapheme_nextbreak
computes the offset (in bytes) to the next grapheme
@@ -52,7 +52,7 @@ main(void)
/* print each grapheme cluster with byte-length */
for (; *s != '\\0';) {
- len = lg_grapheme_nextbreak((uint8_t *)s);
+ len = lg_grapheme_nextbreak(s);
printf("%2zu bytes | %.*s\\n", len, (int)len, s, len);
s += len;
}
diff --git a/src/grapheme.c b/src/grapheme.c
index 56993af..78d0993 100644
--- a/src/grapheme.c
+++ b/src/grapheme.c
@@ -179,7 +179,7 @@ hasbreak:
}
size_t
-lg_grapheme_nextbreak(const uint8_t *str)
+lg_grapheme_nextbreak(const char *str)
{
uint_least32_t cp0, cp1;
size_t ret, len = 0;
diff --git a/src/utf8.c b/src/utf8.c
index b21c920..327deea 100644
--- a/src/utf8.c
+++ b/src/utf8.c
@@ -48,7 +48,7 @@ static const struct {
};
size_t
-lg_utf8_decode(const uint8_t *s, size_t n, uint_least32_t *cp)
+lg_utf8_decode(const char *s, size_t n, uint_least32_t *cp)
{
size_t off, i;
@@ -60,13 +60,14 @@ lg_utf8_decode(const uint8_t *s, size_t n, uint_least32_t *cp)
/* identify sequence type with the first byte */
for (off = 0; off < LEN(lut); off++) {
- if (BETWEEN(s[0], lut[off].lower, lut[off].upper)) {
+ if (BETWEEN(((unsigned char *)s)[0], lut[off].lower,
+ lut[off].upper)) {
/*
* first byte is within the bounds; fill
* p with the the first bits contained in
* the first byte (by subtracting the high bits)
*/
- *cp = s[0] - lut[off].lower;
+ *cp = ((unsigned char *)s)[0] - lut[off].lower;
break;
}
}
@@ -74,6 +75,9 @@ lg_utf8_decode(const uint8_t *s, size_t n, uint_least32_t *cp)
/*
* first byte does not match a sequence type;
* set cp as invalid and return 1 byte processed
+ *
+ * this also includes the cases where bits higher than
+ * the 8th are set on systems with CHAR_BIT > 8
*/
*cp = LG_CODEPOINT_INVALID;
return 1;
@@ -92,12 +96,16 @@ lg_utf8_decode(const uint8_t *s, size_t n, uint_least32_t *cp)
* (i.e. between 0x80 (10000000) and 0xBF (10111111))
*/
for (i = 1; i <= off; i++) {
- if(!BETWEEN(s[i], 0x80, 0xBF)) {
+ if(!BETWEEN(((unsigned char *)s)[i], 0x80, 0xBF)) {
/*
* byte does not match format; return
* number of bytes processed excluding the
* unexpected character as recommended since
* Unicode 6 (chapter 3)
+ *
+ * this also includes the cases where bits
+ * higher than the 8th are set on systems
+ * with CHAR_BIT > 8
*/
*cp = LG_CODEPOINT_INVALID;
return 1 + (i - 1);
@@ -106,7 +114,7 @@ lg_utf8_decode(const uint8_t *s, size_t n, uint_least32_t *cp)
* shift code point by 6 bits and add the 6 stored bits
* in s[i] to it using the bitmask 0x3F (00111111)
*/
- *cp = (*cp << 6) | (s[i] & 0x3F);
+ *cp = (*cp << 6) | (((unsigned char *)s)[i] & 0x3F);
}
if (*cp < lut[off].mincp ||
@@ -125,7 +133,7 @@ lg_utf8_decode(const uint8_t *s, size_t n, uint_least32_t *cp)
}
size_t
-lg_utf8_encode(uint_least32_t cp, uint8_t *s, size_t n)
+lg_utf8_encode(uint_least32_t cp, char *s, size_t n)
{
size_t off, i;
@@ -165,7 +173,7 @@ lg_utf8_encode(uint_least32_t cp, uint8_t *s, size_t n)
* We do not overwrite the mask because we guaranteed earlier
* that there are no bits higher than the mask allows.
*/
- s[0] = lut[off].lower | (uint8_t)(cp >> (6 * off));
+ ((unsigned char *)s)[0] = lut[off].lower | (uint8_t)(cp >> (6 * off));
for (i = 1; i <= off; i++) {
/*
@@ -174,7 +182,8 @@ lg_utf8_encode(uint_least32_t cp, uint8_t *s, size_t n)
* extract from the properly-shifted value using the
* mask 00111111 (0x3F)
*/
- s[i] = 0x80 | ((cp >> (6 * (off - i))) & 0x3F);
+ ((unsigned char *)s)[i] = 0x80 |
+ ((cp >> (6 * (off - i))) & 0x3F);
}
return 1 + off;
diff --git a/test/utf8-decode.c b/test/utf8-decode.c
index 0fd6f77..b4dc7f2 100644
--- a/test/utf8-decode.c
+++ b/test/utf8-decode.c
@@ -8,7 +8,7 @@
#include "util.h"
static const struct {
- uint8_t *arr; /* UTF-8 byte sequence */
+ char *arr; /* UTF-8 byte sequence */
size_t len; /* length of UTF-8 byte sequence */
size_t exp_len; /* expected length returned */
uint_least32_t exp_cp; /* expected code point returned */
@@ -28,7 +28,7 @@ static const struct {
* [ 11111101 ] ->
* INVALID
*/
- .arr = (uint8_t[]){ 0xFD },
+ .arr = (char *)(unsigned char[]){ 0xFD },
.len = 1,
.exp_len = 1,
.exp_cp = LG_CODEPOINT_INVALID,
@@ -38,7 +38,7 @@ static const struct {
* [ 00000001 ] ->
* 0000001
*/
- .arr = (uint8_t[]){ 0x01 },
+ .arr = (char *)(unsigned char[]){ 0x01 },
.len = 1,
.exp_len = 1,
.exp_cp = 0x1,
@@ -48,7 +48,7 @@ static const struct {
* [ 11000011 10111111 ] ->
* 00011111111
*/
- .arr = (uint8_t[]){ 0xC3, 0xBF },
+ .arr = (char *)(unsigned char[]){ 0xC3, 0xBF },
.len = 2,
.exp_len = 2,
.exp_cp = 0xFF,
@@ -58,7 +58,7 @@ static const struct {
* [ 11000011 ] ->
* INVALID
*/
- .arr = (uint8_t[]){ 0xC3 },
+ .arr = (char *)(unsigned char[]){ 0xC3 },
.len = 1,
.exp_len = 2,
.exp_cp = LG_CODEPOINT_INVALID,
@@ -68,7 +68,7 @@ static const struct {
* [ 11000011 11111111 ] ->
* INVALID
*/
- .arr = (uint8_t[]){ 0xC3, 0xFF },
+ .arr = (char *)(unsigned char[]){ 0xC3, 0xFF },
.len = 2,
.exp_len = 1,
.exp_cp = LG_CODEPOINT_INVALID,
@@ -78,7 +78,7 @@ static const struct {
* [ 11000001 10111111 ] ->
* INVALID
*/
- .arr = (uint8_t[]){ 0xC1, 0xBF },
+ .arr = (char *)(unsigned char[]){ 0xC1, 0xBF },
.len = 2,
.exp_len = 2,
.exp_cp = LG_CODEPOINT_INVALID,
@@ -88,7 +88,7 @@ static const struct {
* [ 11100000 10111111 10111111 ] ->
* 0000111111111111
*/
- .arr = (uint8_t[]){ 0xE0, 0xBF, 0xBF },
+ .arr = (char *)(unsigned char[]){ 0xE0, 0xBF, 0xBF },
.len = 3,
.exp_len = 3,
.exp_cp = 0xFFF,
@@ -98,7 +98,7 @@ static const struct {
* [ 11100000 ] ->
* INVALID
*/
- .arr = (uint8_t[]){ 0xE0 },
+ .arr = (char *)(unsigned char[]){ 0xE0 },
.len = 1,
.exp_len = 3,
.exp_cp = LG_CODEPOINT_INVALID,
@@ -108,7 +108,7 @@ static const struct {
* [ 11100000 01111111 10111111 ] ->
* INVALID
*/
- .arr = (uint8_t[]){ 0xE0, 0x7F, 0xBF },
+ .arr = (char *)(unsigned char[]){ 0xE0, 0x7F, 0xBF },
.len = 3,
.exp_len = 1,
.exp_cp = LG_CODEPOINT_INVALID,
@@ -118,7 +118,7 @@ static const struct {
* [ 11100000 10111111 ] ->
* INVALID
*/
- .arr = (uint8_t[]){ 0xE0, 0xBF },
+ .arr = (char *)(unsigned char[]){ 0xE0, 0xBF },
.len = 2,
.exp_len = 3,
.exp_cp = LG_CODEPOINT_INVALID,
@@ -128,7 +128,7 @@ static const struct {
* [ 11100000 10111111 01111111 ] ->
* INVALID
*/
- .arr = (uint8_t[]){ 0xE0, 0xBF, 0x7F },
+ .arr = (char *)(unsigned char[]){ 0xE0, 0xBF, 0x7F },
.len = 3,
.exp_len = 2,
.exp_cp = LG_CODEPOINT_INVALID,
@@ -138,7 +138,7 @@ static const struct {
* [ 11100000 10011111 10111111 ] ->
* INVALID
*/
- .arr = (uint8_t[]){ 0xE0, 0x9F, 0xBF },
+ .arr = (char *)(unsigned char[]){ 0xE0, 0x9F, 0xBF },
.len = 3,
.exp_len = 3,
.exp_cp = LG_CODEPOINT_INVALID,
@@ -148,7 +148,7 @@ static const struct {
* [ 11101101 10100000 10000000 ] ->
* INVALID
*/
- .arr = (uint8_t[]){ 0xED, 0xA0, 0x80 },
+ .arr = (char *)(unsigned char[]){ 0xED, 0xA0, 0x80 },
.len = 3,
.exp_len = 3,
.exp_cp = LG_CODEPOINT_INVALID,
@@ -158,7 +158,7 @@ static const struct {
* [ 11110011 10111111 10111111 10111111 ] ->
* 011111111111111111111
*/
- .arr = (uint8_t[]){ 0xF3, 0xBF, 0xBF, 0xBF },
+ .arr = (char *)(unsigned char[]){ 0xF3, 0xBF, 0xBF, 0xBF },
.len = 4,
.exp_len = 4,
.exp_cp = UINT32_C(0xFFFFF),
@@ -168,7 +168,7 @@ static const struct {
* [ 11110011 ] ->
* INVALID
*/
- .arr = (uint8_t[]){ 0xF3 },
+ .arr = (char *)(unsigned char[]){ 0xF3 },
.len = 1,
.exp_len = 4,
.exp_cp = LG_CODEPOINT_INVALID,
@@ -178,7 +178,7 @@ static const struct {
* [ 11110011 01111111 10111111 10111111 ] ->
* INVALID
*/
- .arr = (uint8_t[]){ 0xF3, 0x7F, 0xBF, 0xBF },
+ .arr = (char *)(unsigned char[]){ 0xF3, 0x7F, 0xBF, 0xBF },
.len = 4,
.exp_len = 1,
.exp_cp = LG_CODEPOINT_INVALID,
@@ -188,7 +188,7 @@ static const struct {
* [ 11110011 10111111 ] ->
* INVALID
*/
- .arr = (uint8_t[]){ 0xF3, 0xBF },
+ .arr = (char *)(unsigned char[]){ 0xF3, 0xBF },
.len = 2,
.exp_len = 4,
.exp_cp = LG_CODEPOINT_INVALID,
@@ -198,7 +198,7 @@ static const struct {
* [ 11110011 10111111 01111111 10111111 ] ->
* INVALID
*/
- .arr = (uint8_t[]){ 0xF3, 0xBF, 0x7F, 0xBF },
+ .arr = (char *)(unsigned char[]){ 0xF3, 0xBF, 0x7F, 0xBF },
.len = 4,
.exp_len = 2,
.exp_cp = LG_CODEPOINT_INVALID,
@@ -208,7 +208,7 @@ static const struct {
* [ 11110011 10111111 10111111 ] ->
* INVALID
*/
- .arr = (uint8_t[]){ 0xF3, 0xBF, 0xBF },
+ .arr = (char *)(unsigned char[]){ 0xF3, 0xBF, 0xBF },
.len = 3,
.exp_len = 4,
.exp_cp = LG_CODEPOINT_INVALID,
@@ -218,7 +218,7 @@ static const struct {
* [ 11110011 10111111 10111111 01111111 ] ->
* INVALID
*/
- .arr = (uint8_t[]){ 0xF3, 0xBF, 0xBF, 0x7F },
+ .arr = (char *)(unsigned char[]){ 0xF3, 0xBF, 0xBF, 0x7F },
.len = 4,
.exp_len = 3,
.exp_cp = LG_CODEPOINT_INVALID,
@@ -228,7 +228,7 @@ static const struct {
* [ 11110000 10000000 10000001 10111111 ] ->
* INVALID
*/
- .arr = (uint8_t[]){ 0xF0, 0x80, 0x81, 0xBF },
+ .arr = (char *)(unsigned char[]){ 0xF0, 0x80, 0x81, 0xBF },
.len = 4,
.exp_len = 4,
.exp_cp = LG_CODEPOINT_INVALID,
@@ -238,7 +238,7 @@ static const struct {
* [ 11110100 10010000 10000000 10000000 ] ->
* INVALID
*/
- .arr = (uint8_t[]){ 0xF4, 0x90, 0x80, 0x80 },
+ .arr = (char *)(unsigned char[]){ 0xF4, 0x90, 0x80, 0x80 },
.len = 4,
.exp_len = 4,
.exp_cp = LG_CODEPOINT_INVALID,
diff --git a/test/utf8-encode.c b/test/utf8-encode.c
index 99f5d48..9ebaccf 100644
--- a/test/utf8-encode.c
+++ b/test/utf8-encode.c
@@ -9,43 +9,43 @@
static const struct {
uint_least32_t cp; /* input code point */
- uint8_t *exp_arr; /* expected UTF-8 byte sequence */
+ char *exp_arr; /* expected UTF-8 byte sequence */
size_t exp_len; /* expected length of UTF-8 sequence */
} enc_test[] = {
{
/* invalid code point (UTF-16 surrogate half) */
.cp = UINT32_C(0xD800),
- .exp_arr = (uint8_t[]){ 0xEF, 0xBF, 0xBD },
+ .exp_arr = (char *)(unsigned char[]){ 0xEF, 0xBF, 0xBD },
.exp_len = 3,
},
{
/* invalid code point (UTF-16-unrepresentable) */
.cp = UINT32_C(0x110000),
- .exp_arr = (uint8_t[]){ 0xEF, 0xBF, 0xBD },
+ .exp_arr = (char *)(unsigned char[]){ 0xEF, 0xBF, 0xBD },
.exp_len = 3,
},
{
/* code point encoded to a 1-byte sequence */
.cp = 0x01,
- .exp_arr = (uint8_t[]){ 0x01 },
+ .exp_arr = (char *)(unsigned char[]){ 0x01 },
.exp_len = 1,
},
{
/* code point encoded to a 2-byte sequence */
.cp = 0xFF,
- .exp_arr = (uint8_t[]){ 0xC3, 0xBF },
+ .exp_arr = (char *)(unsigned char[]){ 0xC3, 0xBF },
.exp_len = 2,
},
{
/* code point encoded to a 3-byte sequence */
.cp = 0xFFF,
- .exp_arr = (uint8_t[]){ 0xE0, 0xBF, 0xBF },
+ .exp_arr = (char *)(unsigned char[]){ 0xE0, 0xBF, 0xBF },
.exp_len = 3,
},
{
/* code point encoded to a 4-byte sequence */
.cp = UINT32_C(0xFFFFF),
- .exp_arr = (uint8_t[]){ 0xF3, 0xBF, 0xBF, 0xBF },
+ .exp_arr = (char *)(unsigned char[]){ 0xF3, 0xBF, 0xBF, 0xBF },
.exp_len = 4,
},
};
@@ -59,7 +59,7 @@ main(int argc, char *argv[])
/* UTF-8 encoder test */
for (i = 0, failed = 0; i < LEN(enc_test); i++) {
- uint8_t arr[4];
+ char arr[4];
size_t len;
len = lg_utf8_encode(enc_test[i].cp, arr, LEN(arr));