On Mon, Jun 29, 2009 at 8:08 AM, Micah Cowan <[email protected]> wrote:
> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Richard Baron Penman wrote: > > hello, > > > > When mirroring a website how do I just download HTML content (whether > > static, PHP, ASP, etc) and ignore images, css, js, and everything else? > > At first I thought of creating an accept list, but I can't rely on the > file > > extension because many HTML pages do not include an extension (eg > > http://en.wikipedia.org/wiki/Foo) > > Then I thought of a reject list, but there are so many different kinds of > > non-HTML content. > > > > Is there a way to do this with wget? > > Not really... at some point we'd like to supply content-type-based > accept/reject options, but this will also tend to increase the amount of > traffic, as we'd have to send extra requests to determine the content > type. Perhaps a robust version of it would use a mixture of heuristic > (e.g., when a filename extension exists, make assumptions about the > content-type)... > > - -- > Micah J. Cowan > Programmer, musician, typesetting enthusiast, gamer. > Maintainer of GNU Wget and GNU Teseq > http://micah.cowan.name/ > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.9 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iEYEARECAAYFAkpH6d8ACgkQ7M8hyUobTrF+xwCeOAlZEyfV2ranXEYJRIYTlHnn > pBwAn3B4BURi0sUCW/gpdMrR5JMcgmv6 > =lnUH > -----END PGP SIGNATURE----- > ah OK. Yeah I can't think of a clean way to do it either without those extra requests. As a workaround would you recommend using something like this then: --reject=".js,.css,.jpg,.png,.gif"? Richard
