Re: [PATCH v3 4/4] urlmatch: allow globbing for the URL host part

Junio C Hamano Thu, 26 Jan 2017 12:48:08 -0800

Patrick Steinhardt <[email protected]> writes:

> The URL matching function computes for two URLs whether they match not.
> The match is performed by splitting up the URL into different parts and
> then doing an exact comparison with the to-be-matched URL.
>
> The main user of `urlmatch` is the configuration subsystem. It allows to
> set certain configurations based on the URL which is being connected to
> via keys like `http.<url>.*`. A common use case for this is to set
> proxies for only some remotes which match the given URL. Unfortunately,
> having exact matches for all parts of the URL can become quite tedious
> in some setups. Imagine for example a corporate network where there are
> dozens or even hundreds of subdomains, which would have to be configured
> individually.
>
> This commit introduces the ability to use globbing in the host-part of
> the URLs. A user can simply specify a `*` as part of the host name to
> match all subdomains at this level. For example adding a configuration
> key `http.https://*.example.com.proxy` will match all subdomains of
> `https://example.com`.


This is probably a useful improvement.

Having said that, when I mentioned "glob", I meant to also support
something like this:

        https://www[1-4].ibm.com/

And when people read "glob", that is what they expect.

So calling this "the ability to use globbing" is misleading.
The last paragraph in the log message above needs a bit of
tweaking, perhaps like this:

        Allow users to write an asterisk '*' in place of any 'host'
        or 'subdomain' label as part of the host name.  For example,
        "http.https://*.example.com.proxy"; sets "http.proxy" for all
        direct subdomains of "https://example.com";,
        e.g. "https://foo.example.com";, but not
        "https://foo.bar.example.com";.

Fortunately, your update to config.txt, which is facing the end
users, does not misuse the word and instead is explicit that the
only thing the matcher does is to match '*' to a single hierarchy.
It is clear that even http://www*.ibm.com/ is not supported from
the description, which is good.

>  . Host/domain name (e.g., `example.com` in `https://example.com/`).
> -  This field must match exactly between the config key and the URL.
> +  This field must match between the config key and the URL. It is
> +  possible to specify a `*` as part of the host name to match all subdomains
> +  at this level. `https://*.example.com/` for example would match
> +  `https://foo.example.com/`, but not `https://foo.bar.example.com/`.

This is good as-is.

>  . Port number (e.g., `8080` in `http://example.com:8080/`).
>    This field must match exactly between the config key and the URL.
> diff --git a/t/t1300-repo-config.sh b/t/t1300-repo-config.sh
> index 923bfc5a2..ec545e092 100755
> --- a/t/t1300-repo-config.sh
> +++ b/t/t1300-repo-config.sh
> @@ -1177,6 +1177,42 @@ test_expect_success 'urlmatch' '
>       test_cmp expect actual
>  '
>  
> +test_expect_success 'glob-based urlmatch' '

This is not "glob".  A more generic term "wildcard" is OK.

> +     cat >.git/config <<-\EOF &&
> +     [http]
> +             sslVerify
> ...
> +static int match_host(const struct url_info *url_info,
> +                   const struct url_info *pattern_info)
> +{
> +     char *url = xmemdupz(url_info->url + url_info->host_off, 
> url_info->host_len);
> +     char *pat = xmemdupz(pattern_info->url + pattern_info->host_off, 
> pattern_info->host_len);
> +     char *url_tok, *pat_tok, *url_save, *pat_save;
> +     int matching;
> +
> +     url_tok = strtok_r(url, ".", &url_save);
> +     pat_tok = strtok_r(pat, ".", &pat_save);

Hmph, this will be the first use of strtok_r() in our codebase.
Does everybody have it?

For a use like this where your delimiter set is a singleton, it may
be simpler to do the usual strchrnul() or memchr() based loop.  The
attached is my attempt to do so on top of this patch.

> +
> +     for (; url_tok && pat_tok; url_tok = strtok_r(NULL, ".", &url_save),
> +                                pat_tok = strtok_r(NULL, ".", &pat_save)) {
> +             if (!strcmp(pat_tok, "*"))
> +                     continue; /* a simple glob matches everything */

s/glob/asterisk/

Other than that, the patch looks OK.



diff --git a/urlmatch.c b/urlmatch.c
index 53ff972a60..8dfc7fd28a 100644
--- a/urlmatch.c
+++ b/urlmatch.c
@@ -63,36 +63,47 @@ static int append_normalized_escapes(struct strbuf *buf,
        return 1;
 }
 
+static const char *end_of_token(const char *s, int c, size_t n)
+{
+       const char *next = memchr(s, c, n);
+       if (!next)
+               next = s + n;
+       return next;
+}
+
 static int match_host(const struct url_info *url_info,
                      const struct url_info *pattern_info)
 {
-       char *url = xmemdupz(url_info->url + url_info->host_off, 
url_info->host_len);
-       char *pat = xmemdupz(pattern_info->url + pattern_info->host_off, 
pattern_info->host_len);
-       char *url_tok, *pat_tok, *url_save, *pat_save;
-       int matching;
-
-       url_tok = strtok_r(url, ".", &url_save);
-       pat_tok = strtok_r(pat, ".", &pat_save);
-
-       for (; url_tok && pat_tok; url_tok = strtok_r(NULL, ".", &url_save),
-                                  pat_tok = strtok_r(NULL, ".", &pat_save)) {
-               if (!strcmp(pat_tok, "*"))
-                       continue; /* a simple glob matches everything */
-
-               if (strcmp(url_tok, pat_tok)) {
-                       /* subdomains do not match */
-                       matching = 0;
-                       break;
-               }
+       const char *url = url_info->url + url_info->host_off;
+       const char *pat = pattern_info->url + pattern_info->host_off;
+       int url_len = url_info->host_len;
+       int pat_len = pattern_info->host_len;
+
+       while (url_len && pat_len) {
+               const char *url_next = end_of_token(url, '.', url_len);
+               const char *pat_next = end_of_token(pat, '.', pat_len);
+
+               if (pat_next == pat + 1 && pat[0] == '*')
+                       /* wildcard matches anything */
+                       ;
+               else if ((pat_next - pat) == (url_next - url) &&
+                        !memcmp(url, pat, url_next - url))
+                       /* the components are the same */
+                       ;
+               else
+                       return 0; /* found an unmatch */
+
+               if (url_next < url + url_len)
+                       url_next++;
+               url_len -= url_next - url;
+               url = url_next;
+               if (pat_next < pat + pat_len)
+                       pat_next++;
+               pat_len -= pat_next - pat;
+               pat = pat_next;
        }
 
-       /* matching if both URL and pattern are at their ends */
-       matching = (url_tok == NULL && pat_tok == NULL);
-
-       free(url);
-       free(pat);
-
-       return matching;
+       return 1;
 }
 
 static char *url_normalize_1(const char *url, struct url_info *out_info, char 
allow_globs)

Re: [PATCH v3 4/4] urlmatch: allow globbing for the URL host part

Reply via email to