Am Wednesday 03 July 2013 schrieb [email protected]:
> Is there means of saying that I want _only_ pages of MIME-type "text/html",
> whatever the extension?

I guess you are talking about recursive retrieving.

You could use a two-pass method using some scripting.

1. create a list with URLs and Content-Type information
2. retrieving only the wanted URLs

Step 1 is something (this is really naive !) like this
wget -d --spider -r www.example.com 2>&1|egrep -i '^Dequeuing|^Content-Type:'|
grep -A1 ^Deq|cut -d' ' -f2|grep -B1 '^text/html'|grep -v ^text/html 
>my_urls.txt

Step 2 would be wget -i my_urls.txt

I guess, a little awk or perl script would be more elegant.

Regards, Tim

Reply via email to