Many sites use server side compression for sending most of the data. Example is Google, this is the header in the homepage load (f12 in Chrome).
1. alt-svc: quic=":443"; p="1"; ma=604800 2. alternate-protocol: 443:quic,p=1 3. cache-control: private, max-age=0 4. content-encoding: gzip 5. content-type: text/html; charset=UTF-8 6. date: Wed, 23 Sep 2015 02:20:03 GMT See the 'content encoding'? Chrome shows the encoded size with 53kb for the html document. If I download this with wget, it shows me the uncompressed size with 150 kB. (wget --user-agent mozilla www.google.com) If I use wget --user-agent mozilla --header="accept-encoding: gzip " www.google.com it downloads a file with 51kB - much closer to what Chrome sees (the difference might be the user agent and cookie handling, or the download does not work properly. If I zcat the file, it seems cut off ). So, now with -p I want to load all page elements (images, scripts, css, etc), and with -H I ensure to get the elements from other domains as well. (-r is not the right tool for this, as far as I know. ) BTW, with no user agent, Google blocks the download. Actually need a full, valid agent. Then, robots off so it won't check that as well (time). Example: wget --user-agent "Mozilla/5.0 (Windows NT x.y; WOW64; rv:10.0) Gecko/20100101 Firefox/10.0" -e robots=off -p -H "www.google.com" This gives me a whole list of files with a total of 378 kB: . ./ssl.gstatic.com ./ssl.gstatic.com/gb ./ssl.gstatic.com/gb/images ./ssl.gstatic.com/gb/images/i2_2ec824b0.png ./ssl.gstatic.com/gb/images/a ./ssl.gstatic.com/gb/images/a/f5cdd88b65.png ./ssl.gstatic.com/gb/images/p1_8b13e09b.png ./ssl.gstatic.com/gb/images/p2_5972b4fd.png ./ssl.gstatic.com/gb/images/i1_1967ca6a.png ./www.google.com ./www.google.com/index.html ./www.google.com/images ./www.google.com/images/nav_logo231.png ./www.google.com/images/branding ./www.google.com/images/branding/googlelogo ./www.google.com/images/branding/googlelogo/2x ./www.google.com/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png ./www.google.com/images/branding/product ./www.google.com/images/branding/product/ico ./www.google.com/images/branding/product/ico/googleg_lodp.ico Now the same, but download compressed what is compressed... wget --user-agent "Mozilla/5.0 (Windows NT x.y; WOW64; rv:10.0) Gecko/20100101 Firefox/10.0" -e robots=off --header="accept-encoding: gzip " -p -H "www.google.com" Still only gives me 52 kb! and one file: index.html So, accept encoding seems to work, but only for the main file? On Tue, Sep 22, 2015 at 3:51 PM, Ángel González <[email protected]> wrote: > On 22/09/15 19:57, andreas wpv wrote: > >> Unfortunately this only pulls the html files (because where I pull them >> they are compressed), and not all the other scripts and stylesheets and so >> on, although at least a few of these are compressed, either. >> > From wget point of view, the "html" is a binary blob. It scans it looking > for > scripts/stylesheets and founds none. > > Ideas, tips? >> > What about implementing gzip Accept-encoding into wget? :) > > Someone asked about doing it not so long ago, but it wasn't done. > > > * That should actually save the pages uncompressed, but I assume you are > more interested in downloading the contents compressed than in storing > them compressed locally. Otherwise, you can download them with current > wget and then run a script compressing everything. > > >
