Re: [PATCH v12 04/10] utf8: teach same_encoding() alternative UTF encoding names
> On 16 Mar 2018, at 19:19, Eric Sunshinewrote: > > On Fri, Mar 16, 2018 at 1:50 PM, Junio C Hamano wrote: >> Eric Sunshine writes: >>> However, I'm having a tough time imagining cases in which callers >>> would want same_encoding() to return true if both arguments are NULL, >>> but outright crash if only one is NULL (which is the behavior even >>> before this patch). In other words, same_encoding() takes advantage of >>> is_encoding_utf8() for its convenience, not for its NULL-handling. >>> Given that view, the two explicit is_encoding_utf8() calls in >>> same_encoding() seem redundant once the same_utf_encoding() call is >>> added. >> >> So... does that mean we'd want something like this, or do you have >> something else in mind? >> >>int same_encoding(const char *src, const char *dst) >>{ >>static const char utf8[] = "UTF-8"; >> >>if (!src) >>src = utf8; >>if (!dst) >>dst = utf8; >>if (same_utf_encoding(src, dst)) >>return 1; >>return !strcasecmp(src, dst); >>} > > I am not proposing anything like that for this patch or patch series. > I'm merely asking why, after this patch, same_encoding() still > contains the (in my mind) now-unneeded conditional: > >if (is_encoding_utf8(src) && is_encoding_utf8(dst)) >return 1; My main motivation was to keep the existing behavior "as-is" to avoid any regressions. However, I agree with your critic of the inconsistencies. Therefore, I will use Junio's suggestion above as it makes the intented behaivior clear. Thanks, Lars
Re: [PATCH v12 04/10] utf8: teach same_encoding() alternative UTF encoding names
On Fri, Mar 16, 2018 at 1:50 PM, Junio C Hamanowrote: > Eric Sunshine writes: >> However, I'm having a tough time imagining cases in which callers >> would want same_encoding() to return true if both arguments are NULL, >> but outright crash if only one is NULL (which is the behavior even >> before this patch). In other words, same_encoding() takes advantage of >> is_encoding_utf8() for its convenience, not for its NULL-handling. >> Given that view, the two explicit is_encoding_utf8() calls in >> same_encoding() seem redundant once the same_utf_encoding() call is >> added. > > So... does that mean we'd want something like this, or do you have > something else in mind? > > int same_encoding(const char *src, const char *dst) > { > static const char utf8[] = "UTF-8"; > > if (!src) > src = utf8; > if (!dst) > dst = utf8; > if (same_utf_encoding(src, dst)) > return 1; > return !strcasecmp(src, dst); > } I am not proposing anything like that for this patch or patch series. I'm merely asking why, after this patch, same_encoding() still contains the (in my mind) now-unneeded conditional: if (is_encoding_utf8(src) && is_encoding_utf8(dst)) return 1; If I'm reading the current code correctly, same_encoding() will crash when either (but not both) of its arguments is NULL and the non-NULL argument is not a variation of "UTF-8". If both arguments are NULL, then it won't crash, but I don't believe that it was intentional that it should crash for some NULL cases but not others; rather, not crashing when both are NULL is an _accidental_ side-effect of relying upon is_encoding_utf8(). If I understood him correctly, Lars's justification for retaining the conditional in question even after the patch is to maintain the non-crashing behavior for both arguments being NULL, even though it will continue to crash when only one is NULL. That justification doesn't makes sense to me since I can't imagine clients relying on accidental behavior of sometimes crashing, sometimes not, hence my question to Lars. As for your snippet above which ensures that 'src' and 'dst' are non-NULL, that (or some variation) would be fine if we ever expect callers of same_encoding() to pass NULL for either of those arguments, but such a fix is outside the scope of this patch and even this patch series (which does not require such a fix), IMHO.
Re: [PATCH v12 04/10] utf8: teach same_encoding() alternative UTF encoding names
Eric Sunshinewrites: > However, I'm having a tough time imagining cases in which callers > would want same_encoding() to return true if both arguments are NULL, > but outright crash if only one is NULL (which is the behavior even > before this patch). In other words, same_encoding() takes advantage of > is_encoding_utf8() for its convenience, not for its NULL-handling. > Given that view, the two explicit is_encoding_utf8() calls in > same_encoding() seem redundant once the same_utf_encoding() call is > added. So... does that mean we'd want something like this, or do you have something else in mind? int same_encoding(const char *src, const char *dst) { static const char utf8[] = "UTF-8"; if (!src) src = utf8; if (!dst) dst = utf8; if (same_utf_encoding(src, dst)) return 1; return !strcasecmp(src, dst); }
Re: [PATCH v12 04/10] utf8: teach same_encoding() alternative UTF encoding names
On Thu, Mar 15, 2018 at 7:35 PM, Lars Schneiderwrote: >> On 16 Mar 2018, at 00:25, Eric Sunshine wrote: >>>if (is_encoding_utf8(src) && is_encoding_utf8(dst)) >>>return 1; >>> + if (same_utf_encoding(src, dst)) >>> + return 1; >>>return !strcasecmp(src, dst); >>> } >> >> This seems odd. I would have expected the newly-added generalized >> conditional to replace the original UTF-8-specific conditional, not >> supplement it. That is, shouldn't the entire function body be: >> >>if (same_utf_encoding(src, dst)) >>return 1; >>return !strcasecmp(src, dst); > > No, because is_encoding_utf8() returns "true" (=1) if the encoding > is NULL. That is not the case for UTF-16 et al. The caller of > same_encoding() might expect that behavior. > I could have moved the "UTF-8" == NULL assumption into > same_utf_encoding() but that did not feel right. > Does this make sense? Not particularly. Looking at 677cfed56a (commit-tree: cope with different ways "utf-8" can be spelled., 2006-12-30), which introduced is_encoding_utf8() along with its builtin NULL-check, I can see why that function wants to treat NULL the same as UTF-8. However, I'm having a tough time imagining cases in which callers would want same_encoding() to return true if both arguments are NULL, but outright crash if only one is NULL (which is the behavior even before this patch). In other words, same_encoding() takes advantage of is_encoding_utf8() for its convenience, not for its NULL-handling. Given that view, the two explicit is_encoding_utf8() calls in same_encoding() seem redundant once the same_utf_encoding() call is added.
Re: [PATCH v12 04/10] utf8: teach same_encoding() alternative UTF encoding names
> On 16 Mar 2018, at 00:25, Eric Sunshinewrote: > > On Thu, Mar 15, 2018 at 6:57 PM, wrote: >> The function same_encoding() checked only for alternative UTF-8 encoding >> names. Teach it to check for all kinds of alternative UTF encoding >> names. >> >> Signed-off-by: Lars Schneider >> --- >> diff --git a/utf8.c b/utf8.c >> @@ -401,11 +401,27 @@ void strbuf_utf8_replace(struct strbuf *sb_src, int >> pos, int width, >> +static int same_utf_encoding(const char *src, const char *dst) >> +{ >> + if (istarts_with(src, "utf") && istarts_with(dst, "utf")) { >> + /* src[3] or dst[3] might be '\0' */ >> + int i = (src[3] == '-' ? 4 : 3); >> + int j = (dst[3] == '-' ? 4 : 3); >> + return !strcasecmp(src+i, dst+j); >> + } >> + return 0; >> +} > > Alternate implementation without magic numbers: > >if (iskip_prefix(src, "utf", ) && >iskip_prefix(dst, "utf", )) { >if (*src == '-') >src++; >if (*dst == '-') >dst++; >return !strcasecmp(src, dst); >} >return 0; > > ... assuming you add an iskip_prefix() function (patterned after > skip_prefix()). If a reroll is necessary, then I can do this. >> int is_encoding_utf8(const char *name) >> { >>if (!name) >>return 1; >> - if (!strcasecmp(name, "utf-8") || !strcasecmp(name, "utf8")) >> + if (same_utf_encoding("utf-8", name)) >>return 1; >>return 0; >> } >> @@ -414,6 +430,8 @@ int same_encoding(const char *src, const char *dst) >> { >>if (is_encoding_utf8(src) && is_encoding_utf8(dst)) >>return 1; >> + if (same_utf_encoding(src, dst)) >> + return 1; >>return !strcasecmp(src, dst); >> } > > This seems odd. I would have expected the newly-added generalized > conditional to replace the original UTF-8-specific conditional, not > supplement it. That is, shouldn't the entire function body be: > >if (same_utf_encoding(src, dst)) >return 1; >return !strcasecmp(src, dst); No, because is_encoding_utf8() returns "true" (=1) if the encoding is NULL. That is not the case for UTF-16 et al. The caller of same_encoding() might expect that behavior. I could have moved the "UTF-8" == NULL assumption into same_utf_encoding() but that did not feel right. Does this make sense? - Lars
Re: [PATCH v12 04/10] utf8: teach same_encoding() alternative UTF encoding names
On Thu, Mar 15, 2018 at 6:57 PM,wrote: > The function same_encoding() checked only for alternative UTF-8 encoding > names. Teach it to check for all kinds of alternative UTF encoding > names. > > Signed-off-by: Lars Schneider > --- > diff --git a/utf8.c b/utf8.c > @@ -401,11 +401,27 @@ void strbuf_utf8_replace(struct strbuf *sb_src, int > pos, int width, > +static int same_utf_encoding(const char *src, const char *dst) > +{ > + if (istarts_with(src, "utf") && istarts_with(dst, "utf")) { > + /* src[3] or dst[3] might be '\0' */ > + int i = (src[3] == '-' ? 4 : 3); > + int j = (dst[3] == '-' ? 4 : 3); > + return !strcasecmp(src+i, dst+j); > + } > + return 0; > +} Alternate implementation without magic numbers: if (iskip_prefix(src, "utf", ) && iskip_prefix(dst, "utf", )) { if (*src == '-') src++; if (*dst == '-') dst++; return !strcasecmp(src, dst); } return 0; ... assuming you add an iskip_prefix() function (patterned after skip_prefix()). > int is_encoding_utf8(const char *name) > { > if (!name) > return 1; > - if (!strcasecmp(name, "utf-8") || !strcasecmp(name, "utf8")) > + if (same_utf_encoding("utf-8", name)) > return 1; > return 0; > } > @@ -414,6 +430,8 @@ int same_encoding(const char *src, const char *dst) > { > if (is_encoding_utf8(src) && is_encoding_utf8(dst)) > return 1; > + if (same_utf_encoding(src, dst)) > + return 1; > return !strcasecmp(src, dst); > } This seems odd. I would have expected the newly-added generalized conditional to replace the original UTF-8-specific conditional, not supplement it. That is, shouldn't the entire function body be: if (same_utf_encoding(src, dst)) return 1; return !strcasecmp(src, dst); ?
[PATCH v12 04/10] utf8: teach same_encoding() alternative UTF encoding names
From: Lars SchneiderThe function same_encoding() checked only for alternative UTF-8 encoding names. Teach it to check for all kinds of alternative UTF encoding names. This function is used in a subsequent commit. Signed-off-by: Lars Schneider --- utf8.c | 20 +++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/utf8.c b/utf8.c index 2c27ce0137..c30daf4d34 100644 --- a/utf8.c +++ b/utf8.c @@ -401,11 +401,27 @@ void strbuf_utf8_replace(struct strbuf *sb_src, int pos, int width, strbuf_release(_dst); } +/* + * Returns true (1) if the src encoding name matches the dst encoding + * name directly or one of its alternative names. E.g. UTF-16BE is the + * same as UTF16BE. + */ +static int same_utf_encoding(const char *src, const char *dst) +{ + if (istarts_with(src, "utf") && istarts_with(dst, "utf")) { + /* src[3] or dst[3] might be '\0' */ + int i = (src[3] == '-' ? 4 : 3); + int j = (dst[3] == '-' ? 4 : 3); + return !strcasecmp(src+i, dst+j); + } + return 0; +} + int is_encoding_utf8(const char *name) { if (!name) return 1; - if (!strcasecmp(name, "utf-8") || !strcasecmp(name, "utf8")) + if (same_utf_encoding("utf-8", name)) return 1; return 0; } @@ -414,6 +430,8 @@ int same_encoding(const char *src, const char *dst) { if (is_encoding_utf8(src) && is_encoding_utf8(dst)) return 1; + if (same_utf_encoding(src, dst)) + return 1; return !strcasecmp(src, dst); } -- 2.16.2