-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Richard Baron Penman wrote: > hello, > > When mirroring a website how do I just download HTML content (whether > static, PHP, ASP, etc) and ignore images, css, js, and everything else? > At first I thought of creating an accept list, but I can't rely on the file > extension because many HTML pages do not include an extension (eg > http://en.wikipedia.org/wiki/Foo) > Then I thought of a reject list, but there are so many different kinds of > non-HTML content. > > Is there a way to do this with wget?
Not really... at some point we'd like to supply content-type-based accept/reject options, but this will also tend to increase the amount of traffic, as we'd have to send extra requests to determine the content type. Perhaps a robust version of it would use a mixture of heuristic (e.g., when a filename extension exists, make assumptions about the content-type)... - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. Maintainer of GNU Wget and GNU Teseq http://micah.cowan.name/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkpH6d8ACgkQ7M8hyUobTrF+xwCeOAlZEyfV2ranXEYJRIYTlHnn pBwAn3B4BURi0sUCW/gpdMrR5JMcgmv6 =lnUH -----END PGP SIGNATURE-----
