Am Wednesday 03 July 2013 schrieb [email protected]: > Is there means of saying that I want _only_ pages of MIME-type "text/html", > whatever the extension?
I guess you are talking about recursive retrieving. You could use a two-pass method using some scripting. 1. create a list with URLs and Content-Type information 2. retrieving only the wanted URLs Step 1 is something (this is really naive !) like this wget -d --spider -r www.example.com 2>&1|egrep -i '^Dequeuing|^Content-Type:'| grep -A1 ^Deq|cut -d' ' -f2|grep -B1 '^text/html'|grep -v ^text/html >my_urls.txt Step 2 would be wget -i my_urls.txt I guess, a little awk or perl script would be more elegant. Regards, Tim
