Re: [gentoo-user] [OT] Differences between wget and browser file retrieval?
On Fri, Jan 15, 2021 at 02:40:51AM -0500, Philip Webb wrote > > Here in Toronto, I get the same result as Walter via his URL > & similar results from the 2 longer versions above, > except that the escaped version give "ERROR 403: Forbidden". I get "ERROR 403: Forbidden" when downloading a non-existant file, e.g. when I make a typo, or when the government site is late updating and they haven't posted the file by the time I request it. -- Walter Dnes I don't run "desktop environments"; I run useful applications
Re: [gentoo-user] [OT] Differences between wget and browser file retrieval?
On Thu, Jan 14, 2021 at 11:00:38PM +0100, David Haller wrote > So, try: > > wget -S --no-check-certificate -U 'Mozilla/5.0 ...' \ > https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf No luck. For DNS, I use my ISP's servers (Teksavvy) with fallback to Google 8.8.8.8. [i3][waltdnes][/dev/shm] wget -S --no-check-certificate -U 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:83.0) Gecko/20100101 Firefox/83.0' https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf --2021-01-15 02:15:30-- https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf Resolving files.ontario.ca... 13.33.160.117, 13.33.160.123, 13.33.160.45, ... Connecting to files.ontario.ca|13.33.160.117|:443... connected. HTTP request sent, awaiting response... HTTP/1.1 200 OK Content-Type: application/pdf Content-Length: 0 Connection: keep-alive Date: Thu, 14 Jan 2021 15:15:50 GMT Last-Modified: Thu, 14 Jan 2021 15:15:50 GMT ETag: "d41d8cd98f00b204e9800998ecf8427e" x-amz-meta-ctime: 1610637349 x-amz-meta-mode: 33188 x-amz-meta-gid: 500 x-amz-meta-uid: 500 x-amz-meta-mtime: 1610637349 Accept-Ranges: bytes Server: AmazonS3 X-Cache: Hit from cloudfront Via: 1.1 47dbad48e25df8c5ccf2822e46c2aaa6.cloudfront.net (CloudFront) X-Amz-Cf-Pop: YTO50-C3 X-Amz-Cf-Id: ARgHfF6QMVfUtkxqkr0AL5ljxIfE7Yd5xPmA4eDMx46NdPXOwIftnQ== Age: 57573 Length: 0 [application/pdf] Saving to: 'moh-covid-19-report-en-2021-01-14.pdf' moh-covid-19-report [ <=>] 0 --.-KB/sin 0s 2021-01-15 02:15:30 (0.00 B/s) - 'moh-covid-19-report-en-2021-01-14.pdf' saved [0/0] > BTW: you know that you can let date format that URL? e.g.: > > wget -S --no-check-certificate -U 'Mozilla/5.0 ...' \ > "$(date > '+https://files.ontario.ca/moh-covid-19-report-en-%Y-%m-%d.pdf')" Nice, but civil servants get stat holidays off. I downloaded Dec 25th and 26th PDFs on the 26th. Monday Dec 28th was a lieu day for Boxing day, so I downloaded the 28th and 29th PDFs on the 29th. And of course Jan 1st and 2nd PDFs on Jan 2nd. That's why I can't automate the date. I have a script "getone"... [i3][waltdnes][~/covid] cat getone #!/bin/bash wget https://files.ontario.ca/moh-covid-19-report-en-2021-01-${1}.pdf On the 14th it was invoked as "../getone 14" (called from the working directory, one level below the main "covid" directory). I tweak the script once a month to match year+month. In a worst-case scenario. I can go to https://covid-19.ontario.ca/covid-19-epidemiologic-summaries-public-health-ontario#daily to manually retrieve a daily PDF. Note that on this page, they list the date that the report is up to. The report issued 10:15 AM on the 14th shows up in the listing as "COVID-19 in Ontario: January 13, 2021". That's because it contains data up to the 13th. -- Walter Dnes I don't run "desktop environments"; I run useful applications
Re: [gentoo-user] [OT] Differences between wget and browser file retrieval?
210114 David Haller wrote: > On Thu, 14 Jan 2021, Walter Dnes wrote: >> I download daily a PDF. Today, the command ... >> wget https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf >> returns a zero-byte file. *BUT*, sticking the URL into the URL bar > >of Pale Moon and Google Chrome brings up the PDF file just fine. >> Is "wget" being blocked ? > I could download that file just fine just now[1]. > Try running 'wget' with the '-S' option. > Oh and : >> WARNING: cannot verify files.ontario.ca's certificate, issued by > So, try: > wget -S --no-check-certificate -U 'Mozilla/5.0 ...' \ >https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf > BTW: you know that you can let date format that URL? e.g.: > wget -S --no-check-certificate -U 'Mozilla/5.0 ...' \ >"$(date '+https://files.ontario.ca/moh-covid-19-report-en-%Y-%m-%d.pdf')" Here in Toronto, I get the same result as Walter via his URL & similar results from the 2 longer versions above, except that the escaped version give "ERROR 403: Forbidden". When I drop Walter's URL into the address bar of Firefox, no problem : a 1,75 MB PDF which appears to have all the info. It looks as if the site is refusing 'wget' requests from Ontario, but allowing them from eg Germany (!). What Walter is doing is well worthwhile. Press reports are very shallow & the Ontario government doesn't appear to have any clear idea just where & how the virus is being spread between humans. HTH. -- ,, SUPPORT ___//___, Philip Webb ELECTRIC /] [] [] [] [] []| Cities Centre, University of Toronto TRANSIT`-O--O---' purslowatcadotinterdotnet
Re: [gentoo-user] [OT] Differences between wget and browser file retrieval?
Hello, On Thu, 14 Jan 2021, Walter Dnes wrote: > I'm bored, so I do a regular daily report at the DSL Reports "CanChat" >sub-forum, on the Covid-19 case counts for Ontario, using provincial >data. I download 2 files daily as source data. One of them is a PDF >file, which is run through "pdftotext" and then parsed by a bash script >(don't ask). Today, the command... > > wget https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf > >...returns a zero-byte file. *BUT*, sticking the URL into the URL bar >of Pale Moon and Google Chrome (and I assume Firefox/etc) brings up the >PDF file just fine. Is "wget" being blocked? [..] > I've tried setting --user-agent= with my browser's string as shown by >https://www.whatismybrowser.com/detect/what-is-my-user-agent but no >luck. Is there some way to get around this? I have not updated this >past week, so I don't think the problem is at my end. I could download that file just fine just now[1]. Try running 'wget' with the '-S' option. Oh and: [..] WARNING: cannot verify files.ontario.ca's certificate, issued by [..] If you sent stderr to /dev/null ... So, try: wget -S --no-check-certificate -U 'Mozilla/5.0 ...' \ https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf BTW: you know that you can let date format that URL? e.g.: wget -S --no-check-certificate -U 'Mozilla/5.0 ...' \ "$(date '+https://files.ontario.ca/moh-covid-19-report-en-%Y-%m-%d.pdf')" There just are no unescaped '%' allowed besides the format strings for the date/time. So if an URL contains one, you need to escape those with another '%', as in e.g. $(date '+foo%%20bar-%Y-%m-%d.pdf') ^^ this fella In your case, the URL is clean ;) HTH, -dnh [1] $ TZ=America/Toronto date Thu Jan 14 16:50:15 EST 2021 -- "Airplane travel is nature's way of making you look like your passport photo." -- Al Gore
Re: [gentoo-user] [OT] Differences between wget and browser file retrieval?
On Thu, 14 Jan 2021 16:10:09 -0500 Jack wrote: > On 2021.01.14 15:49, Walter Dnes wrote: > > I'm bored, so I do a regular daily report at the DSL Reports > > "CanChat" > > sub-forum, on the Covid-19 case counts for Ontario, using provincial > > data. I download 2 files daily as source data. One of them is a PDF > > file, which is run through "pdftotext" and then parsed by a bash > > script > > (don't ask). Today, the command... > > > > wget https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf > > > > ...returns a zero-byte file. *BUT*, sticking the URL into the URL bar > > of Pale Moon and Google Chrome (and I assume Firefox/etc) brings up > > the > > PDF file just fine. Is "wget" being blocked? I have to do extra > > steps > > to get from the browser-invoked PDF to get the PDF file saved to the > > standard work area where my script expects it to be, so it can work > > its > > magic and parse out the daily breakdown by PHU (Public Health Unit). > > BTW, today's posts requiring the PDF file are... > > https://www.dslreports.com/forum/r33002718- > > https://www.dslreports.com/forum/r33002752- > > > > I've tried setting --user-agent= with my browser's string as shown > > by > > https://www.whatismybrowser.com/detect/what-is-my-user-agent but no > > luck. Is there some way to get around this? I have not updated this > > past week, so I don't think the problem is at my end. > > I just copy/pasted that wget command into my terminal, and it got me a > 1.7M PDF doc. I'm in the US, but I have no idea if location/IP is an > issue or not. > > Jack > I could download the file too with the wget command that you posted. If you still have trouble, you could try using curl and pretend that you're a firefox: curl 'https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/84.0' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' -H 'Accept-Language: en,de;q=0.7,en-US;q=0.3' --compressed -H 'DNT: 1' -H 'Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache' > moh-covid-19-report-en-2021-01-14.pdf Andreas
Re: [gentoo-user] [OT] Differences between wget and browser file retrieval?
On 2021.01.14 15:49, Walter Dnes wrote: I'm bored, so I do a regular daily report at the DSL Reports "CanChat" sub-forum, on the Covid-19 case counts for Ontario, using provincial data. I download 2 files daily as source data. One of them is a PDF file, which is run through "pdftotext" and then parsed by a bash script (don't ask). Today, the command... wget https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf ...returns a zero-byte file. *BUT*, sticking the URL into the URL bar of Pale Moon and Google Chrome (and I assume Firefox/etc) brings up the PDF file just fine. Is "wget" being blocked? I have to do extra steps to get from the browser-invoked PDF to get the PDF file saved to the standard work area where my script expects it to be, so it can work its magic and parse out the daily breakdown by PHU (Public Health Unit). BTW, today's posts requiring the PDF file are... https://www.dslreports.com/forum/r33002718- https://www.dslreports.com/forum/r33002752- I've tried setting --user-agent= with my browser's string as shown by https://www.whatismybrowser.com/detect/what-is-my-user-agent but no luck. Is there some way to get around this? I have not updated this past week, so I don't think the problem is at my end. I just copy/pasted that wget command into my terminal, and it got me a 1.7M PDF doc. I'm in the US, but I have no idea if location/IP is an issue or not. Jack
[gentoo-user] [OT] Differences between wget and browser file retrieval?
I'm bored, so I do a regular daily report at the DSL Reports "CanChat" sub-forum, on the Covid-19 case counts for Ontario, using provincial data. I download 2 files daily as source data. One of them is a PDF file, which is run through "pdftotext" and then parsed by a bash script (don't ask). Today, the command... wget https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf ...returns a zero-byte file. *BUT*, sticking the URL into the URL bar of Pale Moon and Google Chrome (and I assume Firefox/etc) brings up the PDF file just fine. Is "wget" being blocked? I have to do extra steps to get from the browser-invoked PDF to get the PDF file saved to the standard work area where my script expects it to be, so it can work its magic and parse out the daily breakdown by PHU (Public Health Unit). BTW, today's posts requiring the PDF file are... https://www.dslreports.com/forum/r33002718- https://www.dslreports.com/forum/r33002752- I've tried setting --user-agent= with my browser's string as shown by https://www.whatismybrowser.com/detect/what-is-my-user-agent but no luck. Is there some way to get around this? I have not updated this past week, so I don't think the problem is at my end. -- Walter Dnes I don't run "desktop environments"; I run useful applications