hrvoje and i have been recently talking about adding regex support to
wget. we were considering to add a new --filter option which, by
supporting regular expressions, would allow more powerful ways of
filtering urls to download.
for instance the new option could allow the filtering of domain names,
file names and url paths. in the following case --filter is used to
prevent any download from the www-*.yoyodyne.com domain and to restrict
download only to .gif files:
wget -r --filter=-domain:www-*.yoyodyne.com --filter=+file:\.gif$
http://yoyodyne.com
(notice that --filter interprets every given rule as a regex).
i personally think the --filter option would be a great new feature for
wget, and i have already started working on its implementation, but we
still have a few opened questions.
for instance, the syntax for --filter presented above is basically the
following:
--filter=[+|-][file|path|domain]:REGEXP
is it consistent? is it flawed? is there a more convenient one?
please notice that supporting multiple comma-separated regexp in a
single --filter option:
--filter=[+|-][file|path|domain]:REGEXP1,REGEXP2,...
would significantly complicate the implementation and usage of --filter,
as it would require escaping of the "," charachter. also notice that
current filtering options like -A/R are somewhat broken, as they do not
allow the usage of "," char in filtering rules.
we also have to reach consensus on the filtering algorithm. for
instance, should we simply require that a url passes all the filtering
rules to allow its download (just like the current -A/R behaviour), or
should we instead adopt a short circuit algorithm that applies all rules
in the same order in which they were given in the command line and
immediately allows the download of an url if it passes the first "allow"
match? should we also support apache-like deny-from-all and
allow-from-all policies? and what would be the best syntax to trigger
the usage of these policies?
i am looking forward to read your opinions on this topic.
P.S.: the new --filter option would replace and extend the old -D, -I/X
and -A/R options, which will be deprecated but still supported.
--
Aequam memento rebus in arduis servare mentem...
Mauro Tortonesi http://www.tortonesi.com
University of Ferrara - Dept. of Eng. http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linux http://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it