Re: [Bug-wget] --header="Accept-encoding: gzip"

andreas wpv Tue, 22 Sep 2015 19:48:08 -0700

Many sites use server side compression for sending most of the data.
Example is Google, this is the header in the homepage load (f12 in Chrome).

   1. alt-svc:
   quic=":443"; p="1"; ma=604800
   2. alternate-protocol:
   443:quic,p=1
   3. cache-control:
   private, max-age=0
   4. content-encoding:
   gzip
   5. content-type:
   text/html; charset=UTF-8
   6. date:
   Wed, 23 Sep 2015 02:20:03 GMT

See the 'content encoding'?
Chrome shows the encoded size with 53kb for the html document.
If I download this with wget, it shows me the uncompressed size with 150
kB.  (wget --user-agent mozilla www.google.com)
If I use wget --user-agent mozilla --header="accept-encoding: gzip "
www.google.com it downloads a file with 51kB - much closer to what Chrome
sees (the difference might be the user agent and cookie handling, or the
download does not work properly. If I zcat the file, it seems cut off ).

So, now with -p I want to load all page elements (images, scripts, css,
etc), and with -H I ensure to get the elements from other domains as well.
(-r is not the right tool for this, as far as I know. ) BTW, with no user
agent, Google blocks the download. Actually need a full, valid agent. Then,
robots off so it won't check that as well (time).

Example: wget --user-agent "Mozilla/5.0 (Windows NT x.y; WOW64; rv:10.0)
Gecko/20100101 Firefox/10.0" -e robots=off  -p -H "www.google.com"
This gives me a whole list of files with a total of 378 kB:
.
./ssl.gstatic.com
./ssl.gstatic.com/gb
./ssl.gstatic.com/gb/images
./ssl.gstatic.com/gb/images/i2_2ec824b0.png
./ssl.gstatic.com/gb/images/a
./ssl.gstatic.com/gb/images/a/f5cdd88b65.png
./ssl.gstatic.com/gb/images/p1_8b13e09b.png
./ssl.gstatic.com/gb/images/p2_5972b4fd.png
./ssl.gstatic.com/gb/images/i1_1967ca6a.png
./www.google.com
./www.google.com/index.html
./www.google.com/images
./www.google.com/images/nav_logo231.png
./www.google.com/images/branding
./www.google.com/images/branding/googlelogo
./www.google.com/images/branding/googlelogo/2x
./www.google.com/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png
./www.google.com/images/branding/product
./www.google.com/images/branding/product/ico
./www.google.com/images/branding/product/ico/googleg_lodp.ico

Now the same, but download compressed what is compressed...

wget --user-agent "Mozilla/5.0 (Windows NT x.y; WOW64; rv:10.0)
Gecko/20100101 Firefox/10.0" -e robots=off --header="accept-encoding: gzip
" -p -H "www.google.com"

Still only gives me 52 kb! and one file: index.html

So, accept encoding seems to work, but only for the main file?

On Tue, Sep 22, 2015 at 3:51 PM, Ángel González <[email protected]> wrote:

> On 22/09/15 19:57, andreas wpv wrote:
>
>> Unfortunately this only pulls the html files (because where I pull them
>> they are compressed), and not all the other scripts and stylesheets and so
>> on, although at least a few of these are compressed, either.
>>
> From wget point of view, the "html" is a binary blob. It scans it looking
> for
> scripts/stylesheets and founds none.
>
> Ideas, tips?
>>
> What about implementing gzip Accept-encoding into wget? :)
>
> Someone asked about doing it not so long ago, but it wasn't done.
>
>
> * That should actually save the pages uncompressed, but I assume you are
> more interested in downloading the contents compressed than in storing
> them compressed locally. Otherwise, you can download them with current
> wget and then run a script compressing everything.
>
>
>

Re: [Bug-wget] --header="Accept-encoding: gzip"

Reply via email to