Micah Cowan wrote: >Tony Lewis wrote:
>"hash" doesn't apply to URIs that wget would handle (it's called the >"fragment" portion in relevant RFCs), as that's not normally part of >what gets sent to the server. But it can appear in the links within a page. Are you going to discard the fragment portion before doing the match? >I'd forgotten the port number... probably we should include that with >domain, and consider calling it "host" instead. Actually, since Wget >commonly uses the term "domain", we should at least provide that as an >alternative name. While we can't ignore it, I doubt that people want to match on the port very often (if ever!). It's probably best to have :domain: refer just to the host name portion of the URL. >Note, of course, that -D is still an option, and -D site.com would be >equivalent, so probably still the best choice. So many different ways to accomplish the same thing. Yikes! I can just see someone posting a question to the list and getting three correct answers using different command line options (-D, -A, and -z). >> Sounds OK, but I think you mean: [ : [ components ] [ / flags ] : ] >Or something similar. '::' would be silly :) I agree that '::' is silly and I'm assuming the parser would treat it as a no-op. >No, they're not. But we only have a handful of sane single-character >options remaining, and only three pairings of uppercase/lowercase. The >other two are -J/j and -G/g. Hmm... -g for --grep? :-) >So, that makes two against so far. Though I think I may have persuaded >Matt Woehlke down to a slight, vague preference. I could live with either solution, but I think explicit anchors makes the most intuitive sense when you're doing a pattern match. >> In the more general case of anchoring, I think ':path:foo' should match >> '/path/to/foo.html' and '/foo/baz/index.html'. >Yeah, except when is that useful? When foo is a CGI script such that /path/to/foo and /path/to/foo/arg/u/ments invoke the same script. >>>If the components aren't specified, it would default to matching just >>>the pathname portion of the URL. >>I'm not sure this is the obvious behavior, but I would get used to it. >It's open for discussion. What do you think the most obvious behavior >would be? I think :scheme-path: is the most obvious default. >> What if you're recording unfollowed links to the SIDB? Don't you still want >> those links to appear? >Dunno. What do you think? I would expect the URLs from the last traversal level to appear as unfollowed links. >>What's wrong with treating --traverse as meaning --traverse >>':path/i:^.*\.html?$' and then having --traverse ':path/i:^.*\.php$' >>override that behavior and only download PHP pages. >Mainly, because I don't want to continue what I consider to be broken >default behavior, if I can get away with it. :) Either the current behavior is a bug in which case it should be fixed or it's a feature that needs to be maintained. I was assuming it was a feature and suggested a simple way to maintain the behavior while easily allowing it to be overridden. >>Given that the most common use case is to match against suffixes in the >>path, perhaps ':path/i:^.*\.' and '$' should be implied so that --traverse >>'(html?|php)' is interpreted as ':path/i:^.*\.(html?|php)$'. >Again, I really want consistency with the regex rules. OK. So how about adding :suffix: to the mix. Then one can say --traverse ':suffix/i:(html?|php)'. > Perhaps we should simply say, "most convenience be damned", and go for > explicit anchors everywhere, even if that leads to a little more typing > in most places. It certainly follows the principle of least surprise... In all the places that I work with regular expressions, anchors are explicitly specified so *I* would be most surprised by having implicit anchors. >I'd probably go for --match ':path:.*/a/.*\.[Zz][Ii][Pp]' and --match >':path:.*/b/.*\.[Jj][Pp][Ee]?[Gg]'. PCREs would make that somewhat nicer. What about the possibility of including multiple components in the same argument to match? For example, --match ':path:.*/a/.*:suffix/i:zip'. This would mean that you have to escape a colon when it appears between the scheme and domain as in ':url:http\://www.site.com/.*'. In your proposal am I allowed to supply two --match parameters that are OR'ed together? Also, will URLs be converted to canonical form before the --match operation is performed? Will the argument to match be canonicalized? Will '=' match '=', '%3d', or '%3D'? Likewise, will '%61' match 'a' or '%61'?
