Hrvoje Niksic <[EMAIL PROTECTED]> writes: > Jacques Beigbeder <[EMAIL PROTECTED]> writes: > >> I ran into a trouble with: >> wget -m http://some/site >> because of a line like: >> <img src="a.gif" v:shapes="..."> >> v:shapes contains a character ':', so a.gif isn't mirrored. >> >> Correction for wget 1.8.1: >> (line 340 of src/html-parse.c) >> #define NAME_CHAR_P(x) (ISALNUM (x) || (x) == '.' || (x) == '-' || (x) == '_' || x >== ':') >> >^^^^^^^^^^^^ >> Hope this helps. > > Thanks for the report. I think I'll make NAME_CHAR_P much more > forgiving about the type of characters it uses. Doing anything else > is counter-productive, because too many pages use or leak weird > characters in attribute names.
Here is a patch: 2002-05-27 Hrvoje Niksic <[EMAIL PROTECTED]> * html-parse.c (NAME_CHAR_P): Allow almost any character here. Index: src/html-parse.c =================================================================== RCS file: /pack/anoncvs/wget/src/html-parse.c,v retrieving revision 1.11 diff -u -r1.11 html-parse.c --- src/html-parse.c 2002/05/18 02:16:22 1.11 +++ src/html-parse.c 2002/05/27 15:02:25 @@ -344,11 +344,24 @@ return 1; } -/* RFC1866: name [of attribute or tag] consists of letters, digits, - periods, or hyphens. We also allow _, for compatibility with - brain-damaged generators. */ -#define NAME_CHAR_P(x) (ISALNUM (x) || (x) == '.' || (x) == '-' || (x) == '_') +/* Originally we used to adhere to RFC1866 here, and allowed only + letters, digits, periods, and hyphens as names (of tags or + attributes). However, this broke too many pages which used + proprietary or strange attributes, e.g. <img src="a.gif" + v:shapes="whatever">. + + So now we allow any character except: + * whitespace + * 8-bit and control chars + * characters that clearly cannot be part of name: + '=', '>', '/'. + This only affects attribute and tag names; attribute values allow + an even greater variety of characters. */ + +#define NAME_CHAR_P(x) ((x) > 32 && (x) < 127 \ + && (x) != '=' && (x) != '>' && (x) != '/') + /* States while advancing through comments. */ #define AC_S_DONE 0 #define AC_S_BACKOUT 1 @@ -450,10 +463,10 @@ } break; case AC_S_DCLNAME: - if (NAME_CHAR_P (ch)) - ch = *p++; - else if (ch == '-') + if (ch == '-') state = AC_S_DASH1; + else if (NAME_CHAR_P (ch)) + ch = *p++; else state = AC_S_DEFAULT; break;