what definition of regexp would you be following?  or would this be
making up something new?  I'm not quite understanding the comment about
the comma and needing escaping for literal commas.  this is true for any
character in the regexp language, so why the special concern for comma?

I do like the [file|path|domain]: approach.  very nice and flexible.
(and would be a huge help to one specific need I have!)  I suggest also
including an "any" option as a shortcut for putting the same pattern in
all three options.

Jim



On Wed, 29 Mar 2006, Mauro Tortonesi wrote:

> 
> hrvoje and i have been recently talking about adding regex support to wget. we
> were considering to add a new --filter option which, by supporting regular
> expressions, would allow more powerful ways of filtering urls to download.
> 
> for instance the new option could allow the filtering of domain names, file
> names and url paths. in the following case --filter is used to prevent any
> download from the www-*.yoyodyne.com domain and to restrict download only to
> .gif files:
> 
> wget -r --filter=-domain:www-*.yoyodyne.com --filter=+file:\.gif$
> http://yoyodyne.com
> 
> (notice that --filter interprets every given rule as a regex).
> 
> i personally think the --filter option would be a great new feature for wget,
> and i have already started working on its implementation, but we still have a
> few opened questions.
> 
> for instance, the syntax for --filter presented above is basically the
> following:
> 
> --filter=[+|-][file|path|domain]:REGEXP
> 
> is it consistent? is it flawed? is there a more convenient one?
> 
> please notice that supporting multiple comma-separated regexp in a single
> --filter option:
> 
> --filter=[+|-][file|path|domain]:REGEXP1,REGEXP2,...
> 
> would significantly complicate the implementation and usage of --filter, as it
> would require escaping of the "," charachter. also notice that current
> filtering options like -A/R are somewhat broken, as they do not allow the
> usage of "," char in filtering rules.
> 
> we also have to reach consensus on the filtering algorithm. for instance,
> should we simply require that a url passes all the filtering rules to allow
> its download (just like the current -A/R behaviour), or should we instead
> adopt a short circuit algorithm that applies all rules in the same order in
> which they were given in the command line and immediately allows the download
> of an url if it passes the first "allow" match? should we also support
> apache-like deny-from-all and allow-from-all policies? and what would be the
> best syntax to trigger the usage of these policies?
> 
> i am looking forward to read your opinions on this topic.
> 
> 
> P.S.: the new --filter option would replace and extend the old -D, -I/X
> and -A/R options, which will be deprecated but still supported.
> 
> -- 
> Aequam memento rebus in arduis servare mentem...
> 
> Mauro Tortonesi                          http://www.tortonesi.com
> 
> University of Ferrara - Dept. of Eng.    http://www.ing.unife.it
> GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
> Deep Space 6 - IPv6 for Linux            http://www.deepspace6.net
> Ferrara Linux User Group                 http://www.ferrara.linux.it
> 
> 

Reply via email to