RE: [Bug-wget] Thoughts on regex support

Tony Lewis Thu, 24 Sep 2009 21:44:55 -0700

Micah Cowan wrote:

>Tony Lewis wrote:


>"hash" doesn't apply to URIs that wget would handle (it's called the
>"fragment" portion in relevant RFCs), as that's not normally part of
>what gets sent to the server.
But it can appear in the links within a page. Are you going to discard the
fragment portion before doing the match?

>I'd forgotten the port number... probably we should include that with
>domain, and consider calling it "host" instead. Actually, since Wget
>commonly uses the term "domain", we should at least provide that as an
>alternative name.

While we can't ignore it, I doubt that people want to match on the port very
often (if ever!). It's probably best to have :domain: refer just to the host
name portion of the URL.

>Note, of course, that -D is still an option, and -D site.com would be
>equivalent, so probably still the best choice.

So many different ways to accomplish the same thing. Yikes! I can just see
someone posting a question to the list and getting three correct answers
using different command line options (-D, -A, and -z).

>> Sounds OK, but I think you mean: [ : [ components ] [ / flags ] : ]
>Or something similar. '::' would be silly :)

I agree that '::' is silly and I'm assuming the parser would treat it as a
no-op.

>No, they're not. But we only have a handful of sane single-character
>options remaining, and only three pairings of uppercase/lowercase. The
>other two are -J/j and -G/g.

Hmm... -g for --grep? :-)

>So, that makes two against so far. Though I think I may have persuaded
>Matt Woehlke down to a slight, vague preference.

I could live with either solution, but I think explicit anchors makes the
most intuitive sense when you're doing a pattern match.

>> In the more general case of anchoring, I think ':path:foo' should match
>> '/path/to/foo.html' and '/foo/baz/index.html'.
>Yeah, except when is that useful?

When foo is a CGI script such that /path/to/foo and /path/to/foo/arg/u/ments
invoke the same script.

>>>If the components aren't specified, it would default to matching just
>>>the pathname portion of the URL.
>>I'm not sure this is the obvious behavior, but I would get used to it.
>It's open for discussion. What do you think the most obvious behavior
>would be?

I think :scheme-path: is the most obvious default.

>> What if you're recording unfollowed links to the SIDB? Don't you still
want
>> those links to appear?
>Dunno. What do you think?

I would expect the URLs from the last traversal level to appear as
unfollowed links.

>>What's wrong with treating --traverse as meaning --traverse
>>':path/i:^.*\.html?$' and then having --traverse ':path/i:^.*\.php$'
>>override that behavior and only download PHP pages.
>Mainly, because I don't want to continue what I consider to be broken
>default behavior, if I can get away with it. :)

Either the current behavior is a bug in which case it should be fixed or
it's a feature that needs to be maintained. I was assuming it was a feature
and suggested a simple way to maintain the behavior while easily allowing it
to be overridden.

>>Given that the most common use case is to match against suffixes in the
>>path, perhaps ':path/i:^.*\.' and '$' should be implied so that --traverse
>>'(html?|php)' is interpreted as ':path/i:^.*\.(html?|php)$'.
>Again, I really want consistency with the regex rules.

OK. So how about adding :suffix: to the mix. Then one can say --traverse
':suffix/i:(html?|php)'.

> Perhaps we should simply say, "most convenience be damned", and go for
> explicit anchors everywhere, even if that leads to a little more typing
> in most places. It certainly follows the principle of least surprise...

In all the places that I work with regular expressions, anchors are
explicitly specified so *I* would be most surprised by having implicit
anchors.

>I'd probably go for --match ':path:.*/a/.*\.[Zz][Ii][Pp]' and --match
>':path:.*/b/.*\.[Jj][Pp][Ee]?[Gg]'. PCREs would make that somewhat nicer.

What about the possibility of including multiple components in the same
argument to match?
For example, --match ':path:.*/a/.*:suffix/i:zip'. This would mean that you
have to escape a colon when it appears between the scheme and domain as in
':url:http\://www.site.com/.*'.

In your proposal am I allowed to supply two --match parameters that are
OR'ed together?

Also, will URLs be converted to canonical form before the --match operation
is performed? Will the argument to match be canonicalized? Will '=' match
'=', '%3d', or '%3D'? Likewise, will '%61' match 'a' or '%61'?

RE: [Bug-wget] Thoughts on regex support

Reply via email to