Re: [Bug-wget] [PATCH] Fix possible issues running in a turkish locale

Ángel González Thu, 20 Nov 2014 11:18:07 -0800

Darshit Shah wrote:

In short, I;m not sure if we want to use c_strcasecmp for in http.c or
recur.c since domain names allowed to be non-ASCII. And if I'm not
wrong, even the HTTP headers may contain non-ASCII data. In such cases
would c_strcasecmp work correctly?

I expect them to be in punycode at that point. Maybe I read it too fast.
And if it isn't, it would only work -by chance- if the page uses the same
encoding as your locale.

I don't think the headers would be a problem. Note that most cases only
deal with the ascii case-unsensitiveness (eg. html5 explicitely notes that).


Tim Ruehsen wrote:

On Thursday 20 November 2014 00:12:08 Ángel González wrote:

I am attaching it here fwiw. I generally changed them on a few more
places, although I think
some of your edits to init.c are incorrect, as well as those on
progress.c: as they are
user-parameters, they _might_ be introduced in the user locale (they
would misteriously fail
when run under C locale in cron, though. I'm not so sure it should be
supported).

Please be more specific.

I would have included the diff of the diffs, but that would have beenreally ugly to read. :)


Seemed better to provide the other patch.

Imaging user input --level=INF (or --level=inf) will be compared with "inf".
The turkish people will be used to enter the correct char in this case, namely
'I' or 'i' and not 'İ' or 'ı'. Else most programs would simply break. In this
case a ASCII comparison (c_str...) is absolutely ok. Using locale-aware
comparison would not work (well, the user could try it out since he gets
immediate response by Wget).

Then we can get rid of the strcasecmp comparisons left.

Notwithstanding with keeping parameters in user-locale case, I made the
accepts list C-case.
That's the most arguable one, but doesn't seem sensible to change the
code to support that.

I think this is not correct.

This comment was related to in_acclist. That function can performcomparisons

in several ways:
- Using fnmatch_nocase (this function explicitely uses C locale)

- Using match_tail() (I had changed it to c_strcasecmp, as it was usedfor eg. cookies)

- Using a strcasecmp

Thus, I changed the third branch to c_strcasecmp

The accepts and regexes are filename related.
Filenames are not limited to ASCII. What we have to do here is a normalization
to UTF-8 (using the users locale). Filenames/pathes found in HTML or CSS also
have to be converted to UTF-8 (using the page's locale). These UTF-8 strings
have to be compared with an appropriate function. str(n)casecmp would not be
correct here, a byte-by-byte comparison like c_str(n)casecmp is better but not
perfect. libunistring has functions for that.

You are probably right, which means we should redo this part of the code.

I would suggest that I push my patch.
We still have two weeks to inspect the changes... if in doubt, let's set up a
test case. Just give an example of what could go wrong and we can simply try
it out.

Ok, I rebased my patch on top of what you committed.

Changes:
-Remove strcase(n)cmp from cmpt.c -> Shouldn't be a problem
- Domain comparisons on cookies.c -> Perhaps not ok
- css-url.c and ftp-basic.c ->  hardcoded ascii text , shall be c_strcasecmp
- ftp.c -> it's a user-provided glob, we can keep strcasecmp

- hash -> Mixed. make_nocase_string_hash_table is used for hosts, tagsand attributes

- html-parse.c -> the tags are only C-insensitive. Shall be changed
- html-url.c -> same with attrs
- parse_content_range  -> this is a header reply. Change
- persistent_available_p (1) -> maybe not ok
- persistent_available_p (2-3) -> these are C constants. Change

- ensure_extension -> only called with either .html or .css. I wouldchange it.

- netrc -> Domain. Maybe not ok
- recur -> idem
- openssl.c -> clarifies a comment
- url.c -> We are comapring with a scheme, thus C comparison.
- utils.c -> Discussed above

So while some are completely clear, we should clarify hosts comparison.

Regards

>From fade2aab9a2a1914f48184f56ee31ee11182b4b0 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C3=81ngel=20Gonz=C3=A1lez?= <[email protected]>
Date: Wed, 19 Nov 2014 23:50:20 +0100
Subject: [PATCH] Replace strcasecmp and strncasecmp with c_strcasecmp and
 c_strncasecmp

---
 bootstrap.conf   |  1 +
 src/cmpt.c       | 63 --------------------------------------------------------
 src/cookies.c    |  4 ++--
 src/css-url.c    |  2 +-
 src/ftp-basic.c  |  3 +--
 src/ftp.c        |  4 ++--
 src/hash.c       |  2 +-
 src/html-parse.c |  2 +-
 src/html-url.c   |  2 +-
 src/http.c       | 14 ++++++-------
 src/mswindows.h  |  8 -------
 src/netrc.c      |  2 +-
 src/openssl.c    |  2 +-
 src/recur.c      |  4 ++--
 src/url.c        |  2 +-
 src/utils.c      |  4 ++--
 16 files changed, 24 insertions(+), 95 deletions(-)

diff --git a/bootstrap.conf b/bootstrap.conf
index 71aa913..045dbba 100644
--- a/bootstrap.conf
+++ b/bootstrap.conf
@@ -70,6 +70,7 @@ sigpipe
 snprintf
 socket
 stdbool
+strcase
 strcasestr
 strerror_r-posix
 strtok_r
diff --git a/src/cmpt.c b/src/cmpt.c
index 889413f..cb16253 100644
--- a/src/cmpt.c
+++ b/src/cmpt.c
@@ -47,69 +47,6 @@ as that of the covered work.  */
    new Wget-specific interfaces -- those should be placed in utils.c
    or elsewhere.  */
 
-/* strcasecmp and strncasecmp apparently originated with BSD 4.4.
-   SUSv3 seems to be the only standard out there (that I can find)
-   that requires their existence, so in theory there might be systems
-   still in use that lack them.  Note that these don't get defined
-   under Windows because mswindows.h defines them to the equivalent
-   Windows functions stricmp and strnicmp.  */
-
-#ifndef HAVE_STRCASECMP
-/* From GNU libc.  */
-/* Compare S1 and S2, ignoring case, returning less than, equal to or
-   greater than zero if S1 is lexiographically less than,
-   equal to or greater than S2.  */
-int
-strcasecmp (const char *s1, const char *s2)
-{
-  register const unsigned char *p1 = (const unsigned char *) s1;
-  register const unsigned char *p2 = (const unsigned char *) s2;
-  unsigned char c1, c2;
-
-  if (p1 == p2)
-    return 0;
-
-  do
-    {
-      c1 = c_tolower (*p1++);
-      c2 = c_tolower (*p2++);
-      if (c1 == '\0')
-        break;
-    }
-  while (c1 == c2);
-
-  return c1 - c2;
-}
-#endif /* not HAVE_STRCASECMP */
-
-#ifndef HAVE_STRNCASECMP
-/* From GNU libc.  */
-/* Compare no more than N characters of S1 and S2,
-   ignoring case, returning less than, equal to or
-   greater than zero if S1 is lexicographically less
-   than, equal to or greater than S2.  */
-int
-strncasecmp (const char *s1, const char *s2, size_t n)
-{
-  register const unsigned char *p1 = (const unsigned char *) s1;
-  register const unsigned char *p2 = (const unsigned char *) s2;
-  unsigned char c1, c2;
-
-  if (p1 == p2 || n == 0)
-    return 0;
-
-  do
-    {
-      c1 = c_tolower (*p1++);
-      c2 = c_tolower (*p2++);
-      if (c1 == '\0' || c1 != c2)
-        return c1 - c2;
-    } while (--n > 0);
-
-  return c1 - c2;
-}
-#endif /* not HAVE_STRNCASECMP */
-
 #ifndef HAVE_MEMRCHR
 /* memrchr is a GNU extension.  It is like the memchr function, except
    that it searches backwards from the end of the n bytes pointed to
diff --git a/src/cookies.c b/src/cookies.c
index 365f8d5..59f587b 100644
--- a/src/cookies.c
+++ b/src/cookies.c
@@ -560,7 +560,7 @@ no_psl:
   DEBUGP (("cdm: 2"));
 
   /* For the sake of efficiency, check for exact match first. */
-  if (0 == strcasecmp (cookie_domain, host))
+  if (0 == c_strcasecmp (cookie_domain, host))
     return true;
 
   DEBUGP ((" 3"));
@@ -895,7 +895,7 @@ cookie_matches_url (const struct cookie *cookie,
      equal to HOST.  If not, assume success on the grounds of the
      cookie's chain having been found by find_chains_of_host.  */
   if (cookie->domain_exact
-      && 0 != strcasecmp (host, cookie->domain))
+      && 0 != c_strcasecmp (host, cookie->domain))
     return false;
 
   pg = path_matches (path, cookie->path);
diff --git a/src/css-url.c b/src/css-url.c
index 8ee4e8c..cac9dbc 100644
--- a/src/css-url.c
+++ b/src/css-url.c
@@ -73,7 +73,7 @@ extern int yylex (void);
 static char *
 get_uri_string (const char *at, int *pos, int *length)
 {
-  if (0 != strncasecmp (at + *pos, "url(", 4))
+  if (0 != c_strncasecmp (at + *pos, "url(", 4))
     return NULL;
 
   *pos += 4;
diff --git a/src/ftp-basic.c b/src/ftp-basic.c
index 2f5765e..b9c22c5 100644
--- a/src/ftp-basic.c
+++ b/src/ftp-basic.c
@@ -45,7 +45,6 @@ as that of the covered work.  */
 #include "retr.h"
 #include "c-strcase.h"
 
-
 /* Get the response of FTP server and allocate enough room to handle
    it.  <CR> and <LF> characters are stripped from the line, and the
    line is 0-terminated.  All the response lines but the last one are
@@ -1208,7 +1207,7 @@ char
 ftp_process_type (const char *params)
 {
   if (params
-      && 0 == strncasecmp (params, "type=", 5)
+      && 0 == c_strncasecmp (params, "type=", 5)
       && params[5] != '\0')
     return c_toupper (params[5]);
   else
diff --git a/src/ftp.c b/src/ftp.c
index 8b4ce3b..1e3ed95 100644
--- a/src/ftp.c
+++ b/src/ftp.c
@@ -2331,10 +2331,10 @@ ftp_retrieve_glob (struct url *u, ccon *con, int action)
            * harmless.
            */
           int (*cmp) (const char *, const char *)
-            = opt.ignore_case ? strcasecmp : (int (*)())strcmp;
+            = opt.ignore_case ? c_strcasecmp : (int (*)())strcmp;
 #else /* def __VMS */
           int (*cmp) (const char *, const char *)
-            = opt.ignore_case ? strcasecmp : strcmp;
+            = opt.ignore_case ? c_strcasecmp : strcmp;
 #endif /* def __VMS [else] */
           f = start;
           while (f)
diff --git a/src/hash.c b/src/hash.c
index 9514d23..d7278f2 100644
--- a/src/hash.c
+++ b/src/hash.c
@@ -690,7 +690,7 @@ hash_string_nocase (const void *key)
 static int
 string_cmp_nocase (const void *s1, const void *s2)
 {
-  return !strcasecmp ((const char *)s1, (const char *)s2);
+  return !c_strcasecmp ((const char *)s1, (const char *)s2);
 }
 
 /* Like make_string_hash_table, but uses string_hash_nocase and
diff --git a/src/html-parse.c b/src/html-parse.c
index eaf833d..e8d8e21 100644
--- a/src/html-parse.c
+++ b/src/html-parse.c
@@ -352,7 +352,7 @@ tagstack_find (struct tagstack_item *tail, const char *tagname_begin,
     {
       if (len == (tail->tagname_end - tail->tagname_begin))
         {
-          if (0 == strncasecmp (tail->tagname_begin, tagname_begin, len))
+          if (0 == c_strncasecmp (tail->tagname_begin, tagname_begin, len))
             return tail;
         }
       tail = tail->prev;
diff --git a/src/html-url.c b/src/html-url.c
index fb4cad1..73da817 100644
--- a/src/html-url.c
+++ b/src/html-url.c
@@ -452,7 +452,7 @@ tag_find_urls (int tagid, struct taginfo *tag, struct map_context *ctx)
          has three attributes.  */
       for (i = first; i < size && tag_url_attributes[i].tagid == tagid; i++)
         {
-          if (0 == strcasecmp (tag->attrs[attrind].name,
+          if (0 == c_strcasecmp (tag->attrs[attrind].name,
                                tag_url_attributes[i].attr_name))
             {
               struct urlpos *up = append_url (link, ATTR_POS(tag,attrind,ctx),
diff --git a/src/http.c b/src/http.c
index b96d4a9..4cc26d9 100644
--- a/src/http.c
+++ b/src/http.c
@@ -871,7 +871,7 @@ parse_content_range (const char *hdr, wgint *first_byte_ptr,
   /* Ancient versions of Netscape proxy server, presumably predating
      rfc2068, sent out `Content-Range' without the "bytes"
      specifier.  */
-  if (0 == strncasecmp (hdr, "bytes", 5))
+  if (0 == c_strncasecmp (hdr, "bytes", 5))
     {
       hdr += 5;
       /* "JavaWebServer/1.1.1" sends "bytes: x-y/z", contrary to the
@@ -1336,7 +1336,7 @@ persistent_available_p (const char *host, int port, bool ssl,
 
   /* If the host is the same, we're in business.  If not, there is
      still hope -- read below.  */
-  if (0 != strcasecmp (host, pconn.host))
+  if (0 != c_strcasecmp (host, pconn.host))
     {
       /* Check if pconn.socket is talking to HOST under another name.
          This happens often when both sites are virtual hosts
@@ -2756,14 +2756,14 @@ read_header:
      of the multitude of broken CGI's that "forget" to generate the
      content-type.  */
   if (!type ||
-        0 == strncasecmp (type, TEXTHTML_S, strlen (TEXTHTML_S)) ||
-        0 == strncasecmp (type, TEXTXHTML_S, strlen (TEXTXHTML_S)))
+        0 == c_strncasecmp (type, TEXTHTML_S, strlen (TEXTHTML_S)) ||
+        0 == c_strncasecmp (type, TEXTXHTML_S, strlen (TEXTXHTML_S)))
     *dt |= TEXTHTML;
   else
     *dt &= ~TEXTHTML;
 
   if (type &&
-      0 == strncasecmp (type, TEXTCSS_S, strlen (TEXTCSS_S)))
+      0 == c_strncasecmp (type, TEXTCSS_S, strlen (TEXTCSS_S)))
     *dt |= TEXTCSS;
   else
     *dt &= ~TEXTCSS;
@@ -4081,8 +4081,8 @@ ensure_extension (struct http_stat *hs, const char *ext, int *dt)
     }
 
   if (last_period_in_local_filename == NULL
-      || !(0 == strcasecmp (last_period_in_local_filename, shortext)
-           || 0 == strcasecmp (last_period_in_local_filename, ext)))
+      || !(0 == c_strcasecmp (last_period_in_local_filename, shortext)
+           || 0 == c_strcasecmp (last_period_in_local_filename, ext)))
     {
       int local_filename_len = strlen (hs->local_file);
       /* Resize the local file, allowing for ".html" preceded by
diff --git a/src/mswindows.h b/src/mswindows.h
index 3b815f0..482edb1 100644
--- a/src/mswindows.h
+++ b/src/mswindows.h
@@ -57,14 +57,6 @@ as that of the covered work.  */
 /* Declares getpid(). */
 #include <process.h>
 
-/* We have strcasecmp and strncasecmp, just under different names.  */
-#ifndef HAVE_STRCASECMP
-# define strcasecmp stricmp
-#endif
-#ifndef HAVE_STRNCASECMP
-# define strncasecmp strnicmp
-#endif
-
 #include <stdio.h>
 
 /* Define a wgint type under Windows. */
diff --git a/src/netrc.c b/src/netrc.c
index dec9e2a..228003d 100644
--- a/src/netrc.c
+++ b/src/netrc.c
@@ -117,7 +117,7 @@ search_netrc (const char *host, const char **acc, const char **passwd,
     {
       if (!l->host)
         continue;
-      else if (!strcasecmp (l->host, host))
+      else if (!c_strcasecmp (l->host, host))
         break;
     }
   if (l)
diff --git a/src/openssl.c b/src/openssl.c
index 3a2ee5f..971829d 100644
--- a/src/openssl.c
+++ b/src/openssl.c
@@ -564,7 +564,7 @@ ssl_connect_wget (int fd, const char *hostname)
    not bar.com [or foo.bar.com]."
 
    If the pattern contain no wildcards, pattern_match(a, b) is
-   equivalent to !strcasecmp(a, b).  */
+   equivalent to !strcasecmp(a, b) in a C locale.  */
 
 static bool
 pattern_match (const char *pattern, const char *string)
diff --git a/src/recur.c b/src/recur.c
index 799fffa..b90a679 100644
--- a/src/recur.c
+++ b/src/recur.c
@@ -568,7 +568,7 @@ download_child_p (const struct urlpos *upos, struct url *parent, int depth,
      the parent page when in -p mode.  */
   if (opt.no_parent
       && schemes_are_similar_p (u->scheme, start_url_parsed->scheme)
-      && 0 == strcasecmp (u->host, start_url_parsed->host)
+      && 0 == c_strcasecmp (u->host, start_url_parsed->host)
       && (u->scheme != start_url_parsed->scheme
           || u->port == start_url_parsed->port)
       && !(opt.page_requisites && upos->link_inline_p))
@@ -625,7 +625,7 @@ download_child_p (const struct urlpos *upos, struct url *parent, int depth,
 
   /* 7. */
   if (schemes_are_similar_p (u->scheme, parent->scheme))
-    if (!opt.spanhost && 0 != strcasecmp (parent->host, u->host))
+    if (!opt.spanhost && 0 != c_strcasecmp (parent->host, u->host))
       {
         DEBUGP (("This is not the same hostname as the parent's (%s and %s).\n",
                  u->host, parent->host));
diff --git a/src/url.c b/src/url.c
index 742b0d7..d405f9b 100644
--- a/src/url.c
+++ b/src/url.c
@@ -429,7 +429,7 @@ url_scheme (const char *url)
   int i;
 
   for (i = 0; supported_schemes[i].leading_string; i++)
-    if (0 == strncasecmp (url, supported_schemes[i].leading_string,
+    if (0 == c_strncasecmp (url, supported_schemes[i].leading_string,
                           strlen (supported_schemes[i].leading_string)))
       {
         if (!(supported_schemes[i].flags & scm_disabled))
diff --git a/src/utils.c b/src/utils.c
index 618c12e..37b7932 100644
--- a/src/utils.c
+++ b/src/utils.c
@@ -1039,7 +1039,7 @@ match_tail (const char *string, const char *tail, bool fold_case)
   if (!fold_case)
     return !strcmp (string + pos, tail);
   else
-    return !strcasecmp (string + pos, tail);
+    return !c_strcasecmp (string + pos, tail);
 }
 
 /* Checks whether string S matches each element of ACCEPTS.  A list
@@ -1071,7 +1071,7 @@ in_acclist (const char *const *accepts, const char *s, bool backward)
           else
             {
               int cmp = opt.ignore_case
-                ? strcasecmp (s, *accepts) : strcmp (s, *accepts);
+                ? c_strcasecmp (s, *accepts) : strcmp (s, *accepts);
               if (cmp == 0)
                 return true;
             }
-- 
2.1.3

Re: [Bug-wget] [PATCH] Fix possible issues running in a turkish locale

Reply via email to