Re: [gentoo-user] [OT] Differences between wget and browser file retrieval?

2021-01-15 Thread Walter Dnes
On Fri, Jan 15, 2021 at 02:40:51AM -0500, Philip Webb wrote
>  
> Here in Toronto, I get the same result as Walter via his URL
> & similar results from the  2  longer versions above,
> except that the escaped version give "ERROR 403: Forbidden".

  I get "ERROR 403: Forbidden" when downloading a non-existant file,
e.g. when I make a typo, or when the government site is late updating
and they haven't posted the file by the time I request it.

-- 
Walter Dnes 
I don't run "desktop environments"; I run useful applications



Re: [gentoo-user] [OT] Differences between wget and browser file retrieval?

2021-01-15 Thread Walter Dnes
On Thu, Jan 14, 2021 at 11:00:38PM +0100, David Haller wrote

> So, try:
> 
> wget -S --no-check-certificate -U 'Mozilla/5.0 ...' \
> https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf

  No luck.  For DNS, I use my ISP's servers (Teksavvy) with fallback to
Google 8.8.8.8.


[i3][waltdnes][/dev/shm]  wget -S --no-check-certificate -U 'Mozilla/5.0 
(Windows NT 10.0; Win64; x64; rv:83.0) Gecko/20100101 Firefox/83.0' 
https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf
--2021-01-15 02:15:30--  
https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf
Resolving files.ontario.ca... 13.33.160.117, 13.33.160.123, 13.33.160.45, ...
Connecting to files.ontario.ca|13.33.160.117|:443... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK
  Content-Type: application/pdf
  Content-Length: 0
  Connection: keep-alive
  Date: Thu, 14 Jan 2021 15:15:50 GMT
  Last-Modified: Thu, 14 Jan 2021 15:15:50 GMT
  ETag: "d41d8cd98f00b204e9800998ecf8427e"
  x-amz-meta-ctime: 1610637349
  x-amz-meta-mode: 33188
  x-amz-meta-gid: 500
  x-amz-meta-uid: 500
  x-amz-meta-mtime: 1610637349
  Accept-Ranges: bytes
  Server: AmazonS3
  X-Cache: Hit from cloudfront
  Via: 1.1 47dbad48e25df8c5ccf2822e46c2aaa6.cloudfront.net (CloudFront)
  X-Amz-Cf-Pop: YTO50-C3
  X-Amz-Cf-Id: ARgHfF6QMVfUtkxqkr0AL5ljxIfE7Yd5xPmA4eDMx46NdPXOwIftnQ==
  Age: 57573
Length: 0 [application/pdf]
Saving to: 'moh-covid-19-report-en-2021-01-14.pdf'

moh-covid-19-report [ <=>]   0  --.-KB/sin 0s  

2021-01-15 02:15:30 (0.00 B/s) - 'moh-covid-19-report-en-2021-01-14.pdf' saved 
[0/0]



> BTW: you know that you can let date format that URL? e.g.:
> 
> wget -S --no-check-certificate -U 'Mozilla/5.0 ...' \
>   "$(date 
> '+https://files.ontario.ca/moh-covid-19-report-en-%Y-%m-%d.pdf')"

  Nice, but civil servants get stat holidays off.  I downloaded Dec 25th
and 26th PDFs on the 26th.  Monday Dec 28th was a lieu day for Boxing
day, so I downloaded the 28th and 29th PDFs on the 29th.  And of course
Jan 1st and 2nd PDFs on Jan 2nd.  That's why I can't automate the date.
I have a script "getone"...

[i3][waltdnes][~/covid] cat getone 
#!/bin/bash
wget https://files.ontario.ca/moh-covid-19-report-en-2021-01-${1}.pdf

  On the 14th it was invoked as "../getone 14" (called from the working
directory, one level below the main "covid" directory).  I tweak the
script once a month to match year+month.  In a worst-case scenario. I
can go to
https://covid-19.ontario.ca/covid-19-epidemiologic-summaries-public-health-ontario#daily
to manually retrieve a daily PDF.  Note that on this page, they list
the date that the report is up to.  The report issued 10:15 AM on the
14th shows up in the listing as "COVID-19 in Ontario: January 13, 2021".
That's because it contains data up to the 13th.

-- 
Walter Dnes 
I don't run "desktop environments"; I run useful applications



Re: [gentoo-user] [OT] Differences between wget and browser file retrieval?

2021-01-14 Thread Philip Webb
210114 David Haller wrote:
> On Thu, 14 Jan 2021, Walter Dnes wrote:
>> I download daily a PDF.  Today, the command ...
>>  wget https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf
>> returns a zero-byte file.  *BUT*, sticking the URL into the URL bar
> >of Pale Moon and Google Chrome brings up the PDF file just fine.
>> Is "wget" being blocked ?
> I could download that file just fine just now[1].
> Try running 'wget' with the '-S' option.
> Oh and :
>> WARNING: cannot verify files.ontario.ca's certificate, issued by
> So, try:
>   wget -S --no-check-certificate -U 'Mozilla/5.0 ...' \
>https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf
> BTW: you know that you can let date format that URL? e.g.:
>   wget -S --no-check-certificate -U 'Mozilla/5.0 ...' \
>"$(date '+https://files.ontario.ca/moh-covid-19-report-en-%Y-%m-%d.pdf')"
 
Here in Toronto, I get the same result as Walter via his URL
& similar results from the  2  longer versions above,
except that the escaped version give "ERROR 403: Forbidden".

When I drop Walter's URL into the address bar of Firefox, no problem :
a  1,75 MB  PDF which appears to have all the info.

It looks as if the site is refusing 'wget' requests from Ontario,
but allowing them from eg Germany (!).

What Walter is doing is well worthwhile.  Press reports are very shallow
& the Ontario government doesn't appear to have any clear idea
just where & how the virus is being spread between humans.  HTH.

-- 
,,
SUPPORT ___//___,   Philip Webb
ELECTRIC   /] [] [] [] [] []|   Cities Centre, University of Toronto
TRANSIT`-O--O---'   purslowatcadotinterdotnet




Re: [gentoo-user] [OT] Differences between wget and browser file retrieval?

2021-01-14 Thread David Haller
Hello,

On Thu, 14 Jan 2021, Walter Dnes wrote:
>  I'm bored, so I do a regular daily report at the DSL Reports "CanChat"
>sub-forum, on the Covid-19 case counts for Ontario, using provincial
>data.  I download 2 files daily as source data.  One of them is a PDF
>file, which is run through "pdftotext" and then parsed by a bash script
>(don't ask).  Today, the command...
>
>  wget https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf
>
>...returns a zero-byte file.  *BUT*, sticking the URL into the URL bar
>of Pale Moon and Google Chrome (and I assume Firefox/etc) brings up the
>PDF file just fine.  Is "wget" being blocked?
[..]
>  I've tried setting --user-agent= with my browser's string as shown by
>https://www.whatismybrowser.com/detect/what-is-my-user-agent  but no
>luck.  Is there some way to get around this?  I have not updated this
>past week, so I don't think the problem is at my end.

I could download that file just fine just now[1]. Try running 'wget'
with the '-S' option. Oh and:

[..]
WARNING: cannot verify files.ontario.ca's certificate, issued by
[..]

If you sent stderr to /dev/null ...

So, try:

wget -S --no-check-certificate -U 'Mozilla/5.0 ...' \
https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf

BTW: you know that you can let date format that URL? e.g.:

wget -S --no-check-certificate -U 'Mozilla/5.0 ...' \
  "$(date '+https://files.ontario.ca/moh-covid-19-report-en-%Y-%m-%d.pdf')"

There just are no unescaped '%' allowed besides the format strings for
the date/time. So if an URL contains one, you need to escape those
with another '%', as in e.g.
$(date '+foo%%20bar-%Y-%m-%d.pdf')
^^ this fella

In your case, the URL is clean ;)

HTH,
-dnh

[1] $ TZ=America/Toronto date
Thu Jan 14 16:50:15 EST 2021

-- 
"Airplane travel is nature's way of making you look like your passport
photo." -- Al Gore



Re: [gentoo-user] [OT] Differences between wget and browser file retrieval?

2021-01-14 Thread Andreas Fink
On Thu, 14 Jan 2021 16:10:09 -0500
Jack  wrote:

> On 2021.01.14 15:49, Walter Dnes wrote:
> >   I'm bored, so I do a regular daily report at the DSL Reports
> > "CanChat"
> > sub-forum, on the Covid-19 case counts for Ontario, using provincial
> > data.  I download 2 files daily as source data.  One of them is a PDF
> > file, which is run through "pdftotext" and then parsed by a bash
> > script
> > (don't ask).  Today, the command...
> >
> >   wget https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf
> >
> > ...returns a zero-byte file.  *BUT*, sticking the URL into the URL bar
> > of Pale Moon and Google Chrome (and I assume Firefox/etc) brings up
> > the
> > PDF file just fine.  Is "wget" being blocked?  I have to do extra
> > steps
> > to get from the browser-invoked PDF to get the PDF file saved to the
> > standard work area where my script expects it to be, so it can work
> > its
> > magic and parse out the daily breakdown by PHU (Public Health Unit).
> > BTW, today's posts requiring the PDF file are...
> > https://www.dslreports.com/forum/r33002718-
> > https://www.dslreports.com/forum/r33002752-
> >
> >   I've tried setting --user-agent= with my browser's string as shown
> > by
> > https://www.whatismybrowser.com/detect/what-is-my-user-agent  but no
> > luck.  Is there some way to get around this?  I have not updated this
> > past week, so I don't think the problem is at my end.
>
> I just copy/pasted that wget command into my terminal, and it got me a
> 1.7M PDF doc.  I'm in the US, but I have no idea if location/IP is an
> issue or not.
>
> Jack
>

I could download the file too with the wget command that you posted. If
you still have trouble, you could try using curl and pretend that
you're a firefox:
curl 'https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf' -H 
'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:84.0) Gecko/20100101 
Firefox/84.0' -H 'Accept: 
text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' -H 
'Accept-Language: en,de;q=0.7,en-US;q=0.3' --compressed -H 'DNT: 1' -H 
'Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1' -H 'Pragma: 
no-cache' -H 'Cache-Control: no-cache' > moh-covid-19-report-en-2021-01-14.pdf

Andreas



Re: [gentoo-user] [OT] Differences between wget and browser file retrieval?

2021-01-14 Thread Jack

On 2021.01.14 15:49, Walter Dnes wrote:
  I'm bored, so I do a regular daily report at the DSL Reports  
"CanChat"

sub-forum, on the Covid-19 case counts for Ontario, using provincial
data.  I download 2 files daily as source data.  One of them is a PDF
file, which is run through "pdftotext" and then parsed by a bash  
script

(don't ask).  Today, the command...

  wget https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf

...returns a zero-byte file.  *BUT*, sticking the URL into the URL bar
of Pale Moon and Google Chrome (and I assume Firefox/etc) brings up  
the
PDF file just fine.  Is "wget" being blocked?  I have to do extra  
steps

to get from the browser-invoked PDF to get the PDF file saved to the
standard work area where my script expects it to be, so it can work  
its

magic and parse out the daily breakdown by PHU (Public Health Unit).
BTW, today's posts requiring the PDF file are...
https://www.dslreports.com/forum/r33002718-
https://www.dslreports.com/forum/r33002752-

  I've tried setting --user-agent= with my browser's string as shown  
by

https://www.whatismybrowser.com/detect/what-is-my-user-agent  but no
luck.  Is there some way to get around this?  I have not updated this
past week, so I don't think the problem is at my end.


I just copy/pasted that wget command into my terminal, and it got me a  
1.7M PDF doc.  I'm in the US, but I have no idea if location/IP is an  
issue or not.


Jack



[gentoo-user] [OT] Differences between wget and browser file retrieval?

2021-01-14 Thread Walter Dnes
  I'm bored, so I do a regular daily report at the DSL Reports "CanChat"
sub-forum, on the Covid-19 case counts for Ontario, using provincial
data.  I download 2 files daily as source data.  One of them is a PDF
file, which is run through "pdftotext" and then parsed by a bash script
(don't ask).  Today, the command...

  wget https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf

...returns a zero-byte file.  *BUT*, sticking the URL into the URL bar
of Pale Moon and Google Chrome (and I assume Firefox/etc) brings up the
PDF file just fine.  Is "wget" being blocked?  I have to do extra steps
to get from the browser-invoked PDF to get the PDF file saved to the
standard work area where my script expects it to be, so it can work its
magic and parse out the daily breakdown by PHU (Public Health Unit).
BTW, today's posts requiring the PDF file are...
https://www.dslreports.com/forum/r33002718-
https://www.dslreports.com/forum/r33002752-

  I've tried setting --user-agent= with my browser's string as shown by
https://www.whatismybrowser.com/detect/what-is-my-user-agent  but no
luck.  Is there some way to get around this?  I have not updated this
past week, so I don't think the problem is at my end.

-- 
Walter Dnes 
I don't run "desktop environments"; I run useful applications