hrvoje and i have been recently talking about adding regex support to wget. we were considering to add a new --filter option which, by supporting regular expressions, would allow more powerful ways of filtering urls to download.

for instance the new option could allow the filtering of domain names, file names and url paths. in the following case --filter is used to prevent any download from the www-*.yoyodyne.com domain and to restrict download only to .gif files:

wget -r --filter=-domain:www-*.yoyodyne.com --filter=+file:\.gif$ http://yoyodyne.com

(notice that --filter interprets every given rule as a regex).

i personally think the --filter option would be a great new feature for wget, and i have already started working on its implementation, but we still have a few opened questions.

for instance, the syntax for --filter presented above is basically the following:

--filter=[+|-][file|path|domain]:REGEXP

is it consistent? is it flawed? is there a more convenient one?

please notice that supporting multiple comma-separated regexp in a single --filter option:

--filter=[+|-][file|path|domain]:REGEXP1,REGEXP2,...

would significantly complicate the implementation and usage of --filter, as it would require escaping of the "," charachter. also notice that current filtering options like -A/R are somewhat broken, as they do not allow the usage of "," char in filtering rules.

we also have to reach consensus on the filtering algorithm. for instance, should we simply require that a url passes all the filtering rules to allow its download (just like the current -A/R behaviour), or should we instead adopt a short circuit algorithm that applies all rules in the same order in which they were given in the command line and immediately allows the download of an url if it passes the first "allow" match? should we also support apache-like deny-from-all and allow-from-all policies? and what would be the best syntax to trigger the usage of these policies?

i am looking forward to read your opinions on this topic.


P.S.: the new --filter option would replace and extend the old -D, -I/X
      and -A/R options, which will be deprecated but still supported.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi                          http://www.tortonesi.com

University of Ferrara - Dept. of Eng.    http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linux            http://www.deepspace6.net
Ferrara Linux User Group                 http://www.ferrara.linux.it

Reply via email to