Re: [PATCH v12 04/10] utf8: teach same_encoding() alternative UTF encoding names

2018-04-01 Thread Lars Schneider

> On 16 Mar 2018, at 19:19, Eric Sunshine  wrote:
> 
> On Fri, Mar 16, 2018 at 1:50 PM, Junio C Hamano  wrote:
>> Eric Sunshine  writes:
>>> However, I'm having a tough time imagining cases in which callers
>>> would want same_encoding() to return true if both arguments are NULL,
>>> but outright crash if only one is NULL (which is the behavior even
>>> before this patch). In other words, same_encoding() takes advantage of
>>> is_encoding_utf8() for its convenience, not for its NULL-handling.
>>> Given that view, the two explicit is_encoding_utf8() calls in
>>> same_encoding() seem redundant once the same_utf_encoding() call is
>>> added.
>> 
>> So... does that mean we'd want something like this, or do you have
>> something else in mind?
>> 
>>int same_encoding(const char *src, const char *dst)
>>{
>>static const char utf8[] = "UTF-8";
>> 
>>if (!src)
>>src = utf8;
>>if (!dst)
>>dst = utf8;
>>if (same_utf_encoding(src, dst))
>>return 1;
>>return !strcasecmp(src, dst);
>>}
> 
> I am not proposing anything like that for this patch or patch series.
> I'm merely asking why, after this patch, same_encoding() still
> contains the (in my mind) now-unneeded conditional:
> 
>if (is_encoding_utf8(src) && is_encoding_utf8(dst))
>return 1;

My main motivation was to keep the existing behavior "as-is"
to avoid any regressions.

However, I agree with your critic of the inconsistencies. 

Therefore, I will use Junio's suggestion above as it makes 
the intented behaivior clear.

Thanks,
Lars


Re: [PATCH v12 04/10] utf8: teach same_encoding() alternative UTF encoding names

2018-03-16 Thread Eric Sunshine
On Fri, Mar 16, 2018 at 1:50 PM, Junio C Hamano  wrote:
> Eric Sunshine  writes:
>> However, I'm having a tough time imagining cases in which callers
>> would want same_encoding() to return true if both arguments are NULL,
>> but outright crash if only one is NULL (which is the behavior even
>> before this patch). In other words, same_encoding() takes advantage of
>> is_encoding_utf8() for its convenience, not for its NULL-handling.
>> Given that view, the two explicit is_encoding_utf8() calls in
>> same_encoding() seem redundant once the same_utf_encoding() call is
>> added.
>
> So... does that mean we'd want something like this, or do you have
> something else in mind?
>
> int same_encoding(const char *src, const char *dst)
> {
> static const char utf8[] = "UTF-8";
>
> if (!src)
> src = utf8;
> if (!dst)
> dst = utf8;
> if (same_utf_encoding(src, dst))
> return 1;
> return !strcasecmp(src, dst);
> }

I am not proposing anything like that for this patch or patch series.
I'm merely asking why, after this patch, same_encoding() still
contains the (in my mind) now-unneeded conditional:

if (is_encoding_utf8(src) && is_encoding_utf8(dst))
return 1;

If I'm reading the current code correctly, same_encoding() will crash
when either (but not both) of its arguments is NULL and the non-NULL
argument is not a variation of "UTF-8". If both arguments are NULL,
then it won't crash, but I don't believe that it was intentional that
it should crash for some NULL cases but not others; rather, not
crashing when both are NULL is an _accidental_ side-effect of relying
upon is_encoding_utf8().

If I understood him correctly, Lars's justification for retaining the
conditional in question even after the patch is to maintain the
non-crashing behavior for both arguments being NULL, even though it
will continue to crash when only one is NULL. That justification
doesn't makes sense to me since I can't imagine clients relying on
accidental behavior of sometimes crashing, sometimes not, hence my
question to Lars.

As for your snippet above which ensures that 'src' and 'dst' are
non-NULL, that (or some variation) would be fine if we ever expect
callers of same_encoding() to pass NULL for either of those arguments,
but such a fix is outside the scope of this patch and even this patch
series (which does not require such a fix), IMHO.


Re: [PATCH v12 04/10] utf8: teach same_encoding() alternative UTF encoding names

2018-03-16 Thread Junio C Hamano
Eric Sunshine  writes:

> However, I'm having a tough time imagining cases in which callers
> would want same_encoding() to return true if both arguments are NULL,
> but outright crash if only one is NULL (which is the behavior even
> before this patch). In other words, same_encoding() takes advantage of
> is_encoding_utf8() for its convenience, not for its NULL-handling.
> Given that view, the two explicit is_encoding_utf8() calls in
> same_encoding() seem redundant once the same_utf_encoding() call is
> added.

So... does that mean we'd want something like this, or do you have
something else in mind?

int same_encoding(const char *src, const char *dst)
{
static const char utf8[] = "UTF-8";

if (!src)
src = utf8;
if (!dst)
dst = utf8;
if (same_utf_encoding(src, dst))
return 1;
return !strcasecmp(src, dst);
}


Re: [PATCH v12 04/10] utf8: teach same_encoding() alternative UTF encoding names

2018-03-15 Thread Eric Sunshine
On Thu, Mar 15, 2018 at 7:35 PM, Lars Schneider
 wrote:
>> On 16 Mar 2018, at 00:25, Eric Sunshine  wrote:
>>>if (is_encoding_utf8(src) && is_encoding_utf8(dst))
>>>return 1;
>>> +   if (same_utf_encoding(src, dst))
>>> +   return 1;
>>>return !strcasecmp(src, dst);
>>> }
>>
>> This seems odd. I would have expected the newly-added generalized
>> conditional to replace the original UTF-8-specific conditional, not
>> supplement it. That is, shouldn't the entire function body be:
>>
>>if (same_utf_encoding(src, dst))
>>return 1;
>>return !strcasecmp(src, dst);
>
> No, because is_encoding_utf8() returns "true" (=1) if the encoding
> is NULL. That is not the case for UTF-16 et al. The caller of
> same_encoding() might expect that behavior.
> I could have moved the "UTF-8" == NULL assumption into
> same_utf_encoding() but that did not feel right.
> Does this make sense?

Not particularly.

Looking at 677cfed56a (commit-tree: cope with different ways "utf-8"
can be spelled., 2006-12-30), which introduced is_encoding_utf8()
along with its builtin NULL-check, I can see why that function wants
to treat NULL the same as UTF-8.

However, I'm having a tough time imagining cases in which callers
would want same_encoding() to return true if both arguments are NULL,
but outright crash if only one is NULL (which is the behavior even
before this patch). In other words, same_encoding() takes advantage of
is_encoding_utf8() for its convenience, not for its NULL-handling.
Given that view, the two explicit is_encoding_utf8() calls in
same_encoding() seem redundant once the same_utf_encoding() call is
added.


Re: [PATCH v12 04/10] utf8: teach same_encoding() alternative UTF encoding names

2018-03-15 Thread Lars Schneider

> On 16 Mar 2018, at 00:25, Eric Sunshine  wrote:
> 
> On Thu, Mar 15, 2018 at 6:57 PM,   wrote:
>> The function same_encoding() checked only for alternative UTF-8 encoding
>> names. Teach it to check for all kinds of alternative UTF encoding
>> names.
>> 
>> Signed-off-by: Lars Schneider 
>> ---
>> diff --git a/utf8.c b/utf8.c
>> @@ -401,11 +401,27 @@ void strbuf_utf8_replace(struct strbuf *sb_src, int 
>> pos, int width,
>> +static int same_utf_encoding(const char *src, const char *dst)
>> +{
>> +   if (istarts_with(src, "utf") && istarts_with(dst, "utf")) {
>> +   /* src[3] or dst[3] might be '\0' */
>> +   int i = (src[3] == '-' ? 4 : 3);
>> +   int j = (dst[3] == '-' ? 4 : 3);
>> +   return !strcasecmp(src+i, dst+j);
>> +   }
>> +   return 0;
>> +}
> 
> Alternate implementation without magic numbers:
> 
>if (iskip_prefix(src, "utf", ) &&
>iskip_prefix(dst, "utf", )) {
>if (*src == '-')
>src++;
>if (*dst == '-')
>dst++;
>return !strcasecmp(src, dst);
>}
>return 0;
> 
> ... assuming you add an iskip_prefix() function (patterned after 
> skip_prefix()).

If a reroll is necessary, then I can do this.


>> int is_encoding_utf8(const char *name)
>> {
>>if (!name)
>>return 1;
>> -   if (!strcasecmp(name, "utf-8") || !strcasecmp(name, "utf8"))
>> +   if (same_utf_encoding("utf-8", name))
>>return 1;
>>return 0;
>> }
>> @@ -414,6 +430,8 @@ int same_encoding(const char *src, const char *dst)
>> {
>>if (is_encoding_utf8(src) && is_encoding_utf8(dst))
>>return 1;
>> +   if (same_utf_encoding(src, dst))
>> +   return 1;
>>return !strcasecmp(src, dst);
>> }
> 
> This seems odd. I would have expected the newly-added generalized
> conditional to replace the original UTF-8-specific conditional, not
> supplement it. That is, shouldn't the entire function body be:
> 
>if (same_utf_encoding(src, dst))
>return 1;
>return !strcasecmp(src, dst);

No, because is_encoding_utf8() returns "true" (=1) if the encoding
is NULL. That is not the case for UTF-16 et al. The caller of
same_encoding() might expect that behavior.

I could have moved the "UTF-8" == NULL assumption into 
same_utf_encoding() but that did not feel right.

Does this make sense?

- Lars

Re: [PATCH v12 04/10] utf8: teach same_encoding() alternative UTF encoding names

2018-03-15 Thread Eric Sunshine
On Thu, Mar 15, 2018 at 6:57 PM,   wrote:
> The function same_encoding() checked only for alternative UTF-8 encoding
> names. Teach it to check for all kinds of alternative UTF encoding
> names.
>
> Signed-off-by: Lars Schneider 
> ---
> diff --git a/utf8.c b/utf8.c
> @@ -401,11 +401,27 @@ void strbuf_utf8_replace(struct strbuf *sb_src, int 
> pos, int width,
> +static int same_utf_encoding(const char *src, const char *dst)
> +{
> +   if (istarts_with(src, "utf") && istarts_with(dst, "utf")) {
> +   /* src[3] or dst[3] might be '\0' */
> +   int i = (src[3] == '-' ? 4 : 3);
> +   int j = (dst[3] == '-' ? 4 : 3);
> +   return !strcasecmp(src+i, dst+j);
> +   }
> +   return 0;
> +}

Alternate implementation without magic numbers:

if (iskip_prefix(src, "utf", ) &&
iskip_prefix(dst, "utf", )) {
if (*src == '-')
src++;
if (*dst == '-')
dst++;
return !strcasecmp(src, dst);
}
return 0;

... assuming you add an iskip_prefix() function (patterned after skip_prefix()).

>  int is_encoding_utf8(const char *name)
>  {
> if (!name)
> return 1;
> -   if (!strcasecmp(name, "utf-8") || !strcasecmp(name, "utf8"))
> +   if (same_utf_encoding("utf-8", name))
> return 1;
> return 0;
>  }
> @@ -414,6 +430,8 @@ int same_encoding(const char *src, const char *dst)
>  {
> if (is_encoding_utf8(src) && is_encoding_utf8(dst))
> return 1;
> +   if (same_utf_encoding(src, dst))
> +   return 1;
> return !strcasecmp(src, dst);
>  }

This seems odd. I would have expected the newly-added generalized
conditional to replace the original UTF-8-specific conditional, not
supplement it. That is, shouldn't the entire function body be:

if (same_utf_encoding(src, dst))
return 1;
return !strcasecmp(src, dst);

?


[PATCH v12 04/10] utf8: teach same_encoding() alternative UTF encoding names

2018-03-15 Thread lars . schneider
From: Lars Schneider 

The function same_encoding() checked only for alternative UTF-8 encoding
names. Teach it to check for all kinds of alternative UTF encoding
names.

This function is used in a subsequent commit.

Signed-off-by: Lars Schneider 
---
 utf8.c | 20 +++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/utf8.c b/utf8.c
index 2c27ce0137..c30daf4d34 100644
--- a/utf8.c
+++ b/utf8.c
@@ -401,11 +401,27 @@ void strbuf_utf8_replace(struct strbuf *sb_src, int pos, 
int width,
strbuf_release(_dst);
 }
 
+/*
+ * Returns true (1) if the src encoding name matches the dst encoding
+ * name directly or one of its alternative names. E.g. UTF-16BE is the
+ * same as UTF16BE.
+ */
+static int same_utf_encoding(const char *src, const char *dst)
+{
+   if (istarts_with(src, "utf") && istarts_with(dst, "utf")) {
+   /* src[3] or dst[3] might be '\0' */
+   int i = (src[3] == '-' ? 4 : 3);
+   int j = (dst[3] == '-' ? 4 : 3);
+   return !strcasecmp(src+i, dst+j);
+   }
+   return 0;
+}
+
 int is_encoding_utf8(const char *name)
 {
if (!name)
return 1;
-   if (!strcasecmp(name, "utf-8") || !strcasecmp(name, "utf8"))
+   if (same_utf_encoding("utf-8", name))
return 1;
return 0;
 }
@@ -414,6 +430,8 @@ int same_encoding(const char *src, const char *dst)
 {
if (is_encoding_utf8(src) && is_encoding_utf8(dst))
return 1;
+   if (same_utf_encoding(src, dst))
+   return 1;
return !strcasecmp(src, dst);
 }
 
-- 
2.16.2