Re: [Bug-wget] just download HTML content

Micah Cowan Sun, 28 Jun 2009 15:08:50 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Richard Baron Penman wrote:
> hello,
> 
> When mirroring a website how do I just download HTML content (whether
> static, PHP, ASP, etc) and ignore images, css, js, and everything else?
> At first I thought of creating an accept list, but I can't rely on the file
> extension because many HTML pages do not include an extension (eg
> http://en.wikipedia.org/wiki/Foo)
> Then I thought of a reject list, but there are so many different kinds of
> non-HTML content.
> 
> Is there a way to do this with wget?


Not really... at some point we'd like to supply content-type-based
accept/reject options, but this will also tend to increase the amount of
traffic, as we'd have to send extra requests to determine the content
type. Perhaps a robust version of it would use a mixture of heuristic
(e.g., when a filename extension exists, make assumptions about the
content-type)...

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
Maintainer of GNU Wget and GNU Teseq
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkpH6d8ACgkQ7M8hyUobTrF+xwCeOAlZEyfV2ranXEYJRIYTlHnn
pBwAn3B4BURi0sUCW/gpdMrR5JMcgmv6
=lnUH
-----END PGP SIGNATURE-----

Re: [Bug-wget] just download HTML content

Reply via email to