Hrvoje Niksic <[EMAIL PROTECTED]> writes:

> Jacques Beigbeder <[EMAIL PROTECTED]> writes:
>
>> I ran into a trouble with:
>>      wget -m http://some/site
>> because of a line like:
>>      <img src="a.gif" v:shapes="...">
>> v:shapes contains a character ':', so a.gif isn't mirrored.
>>
>> Correction for wget 1.8.1:
>> (line 340 of src/html-parse.c)
>> #define NAME_CHAR_P(x) (ISALNUM (x) || (x) == '.' || (x) == '-' || (x) == '_' || x 
>== ':')
>>                                                                               
>^^^^^^^^^^^^
>> Hope this helps.
>
> Thanks for the report.  I think I'll make NAME_CHAR_P much more
> forgiving about the type of characters it uses.  Doing anything else
> is counter-productive, because too many pages use or leak weird
> characters in attribute names.

Here is a patch:

2002-05-27  Hrvoje Niksic  <[EMAIL PROTECTED]>

        * html-parse.c (NAME_CHAR_P): Allow almost any character here.

Index: src/html-parse.c
===================================================================
RCS file: /pack/anoncvs/wget/src/html-parse.c,v
retrieving revision 1.11
diff -u -r1.11 html-parse.c
--- src/html-parse.c    2002/05/18 02:16:22     1.11
+++ src/html-parse.c    2002/05/27 15:02:25
@@ -344,11 +344,24 @@
   return 1;
 }
 
-/* RFC1866: name [of attribute or tag] consists of letters, digits,
-   periods, or hyphens.  We also allow _, for compatibility with
-   brain-damaged generators.  */
-#define NAME_CHAR_P(x) (ISALNUM (x) || (x) == '.' || (x) == '-' || (x) == '_')
+/* Originally we used to adhere to RFC1866 here, and allowed only
+   letters, digits, periods, and hyphens as names (of tags or
+   attributes).  However, this broke too many pages which used
+   proprietary or strange attributes, e.g.  <img src="a.gif"
+   v:shapes="whatever">.
+
+   So now we allow any character except:
+     * whitespace
+     * 8-bit and control chars
+     * characters that clearly cannot be part of name:
+       '=', '>', '/'.
 
+   This only affects attribute and tag names; attribute values allow
+   an even greater variety of characters.  */
+
+#define NAME_CHAR_P(x) ((x) > 32 && (x) < 127                          \
+                       && (x) != '=' && (x) != '>' && (x) != '/')
+
 /* States while advancing through comments. */
 #define AC_S_DONE      0
 #define AC_S_BACKOUT   1
@@ -450,10 +463,10 @@
            }
          break;
        case AC_S_DCLNAME:
-         if (NAME_CHAR_P (ch))
-           ch = *p++;
-         else if (ch == '-')
+         if (ch == '-')
            state = AC_S_DASH1;
+         else if (NAME_CHAR_P (ch))
+           ch = *p++;
          else
            state = AC_S_DEFAULT;
          break;

Reply via email to