Re: [Bug-wget] just download HTML content

Micah Cowan Sun, 28 Jun 2009 16:56:26 -0700

Richard Baron Penman wrote:
> On Mon, Jun 29, 2009 at 8:08 AM, Micah Cowan <[email protected]> wrote:
> 
> Richard Baron Penman wrote:
>>>> hello,
>>>>
>>>> When mirroring a website how do I just download HTML content (whether
>>>> static, PHP, ASP, etc) and ignore images, css, js, and everything else?
>>>> At first I thought of creating an accept list, but I can't rely on the
> file
>>>> extension because many HTML pages do not include an extension (eg
>>>> http://en.wikipedia.org/wiki/Foo)
>>>> Then I thought of a reject list, but there are so many different kinds of
>>>> non-HTML content.
>>>>
>>>> Is there a way to do this with wget?
> Not really... at some point we'd like to supply content-type-based
> accept/reject options, but this will also tend to increase the amount of
> traffic, as we'd have to send extra requests to determine the content
> type. Perhaps a robust version of it would use a mixture of heuristic
> (e.g., when a filename extension exists, make assumptions about the
> content-type)...
> 
>>


> ah OK. Yeah I can't think of a clean way to do it either without those extra
> requests.
> As a workaround would you recommend using something like this then:
> --reject=".js,.css,.jpg,.png,.gif"?

That's probably about as good as you're gonna get. It's likely to miss a
few things, so you may need to adjust it to get it "just right".

-- 
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
Maintainer of GNU Wget and GNU Teseq
http://micah.cowan.name/

Re: [Bug-wget] just download HTML content

Reply via email to