Richard Baron Penman wrote: > On Mon, Jun 29, 2009 at 8:08 AM, Micah Cowan <[email protected]> wrote: > > Richard Baron Penman wrote: >>>> hello, >>>> >>>> When mirroring a website how do I just download HTML content (whether >>>> static, PHP, ASP, etc) and ignore images, css, js, and everything else? >>>> At first I thought of creating an accept list, but I can't rely on the > file >>>> extension because many HTML pages do not include an extension (eg >>>> http://en.wikipedia.org/wiki/Foo) >>>> Then I thought of a reject list, but there are so many different kinds of >>>> non-HTML content. >>>> >>>> Is there a way to do this with wget? > Not really... at some point we'd like to supply content-type-based > accept/reject options, but this will also tend to increase the amount of > traffic, as we'd have to send extra requests to determine the content > type. Perhaps a robust version of it would use a mixture of heuristic > (e.g., when a filename extension exists, make assumptions about the > content-type)... > >>
> ah OK. Yeah I can't think of a clean way to do it either without those extra > requests. > As a workaround would you recommend using something like this then: > --reject=".js,.css,.jpg,.png,.gif"? That's probably about as good as you're gonna get. It's likely to miss a few things, so you may need to adjust it to get it "just right". -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. Maintainer of GNU Wget and GNU Teseq http://micah.cowan.name/
