Hey, just repeating what you already wrote:
The --if-modified-since (which is enabled with -N) relies on the proper timestamp of the file when the file exists on your local disk.
When stopping wget in the middle of a download, the partial file's timestamp is the current timestamp. Restarting the download with -N will very likely get a 304 from the server because it is unlikely that the file content changed on the server side in the meantime.
Using the HEAD request instead of --if-modified-since helps in your special case. But I recently learned that not all servers even allow HEAD requests. And the HEAD request always adds another round trip (extra request + response), so this is a sub-optimal solution.
What other options do we have to make --if-modified-since workable in your scenario? (Apart from switching --if-modified-since off)
a) When you download a file, use a temporary file name. After wget exists, check the return status and if it is 0, rename the file.
The downside is that you always have download the file, even if it didn't change.
b) Wget could make sure that the file timestamp is set properly when existing. This doesn't always work, e.g. when the whole system is switched off, no exit function in wget will be executed.
c) After every write, the file's timestamp is set to the server's timestamp. This doubles the amount of syscalls and will increase CPU usage.
d) Use the metalink protocol. It provides checksums for your files and if the checksum of the local file diverts, only the corrupted parts are re-downloaded. It's a great protocol (working via HTTP/HTTPS) that sadly never got real traction. So most distributions don't compile it in and you have to build your own version of wget.
e) Use FTP(S)... the downside is that your admin likely isn't happy about it.
Maybe b) is a compromise, even if it's not perfect!?Again, I am not saying that switching back to HEAD requests is out of the race. First, I'd like to see more ideas / suggestions on this topic.
Regards, TimOn 5/24/24 13:53, Romain Morotti (London) via Public discussion list for GNU Wget development wrote:
Hello, Apologies for the long email, it is quite long and was quite difficult to debug. I hope you can roll a fix. There are previous bug reports related to this issue, but they never reached a repro or an explanation. TL;DR critical bug in wget, wget is leaving corrupted files when using the -N flag. ROOT CAUSE: Change of behaviour in or around version v1.17. wget -N code was rewritten and a new flag was added --no-if-modified-since off by default, unfortunately the new code and behaviour is incorrect and leaves corrupted downloads. FIX: -N must always be used together with --no-if-modified-since behavior, otherwise wget will leave corrupted files. The flag --no-if-modified-since should be set by default when -N is used. WORKAROUND: As a workaround, you can set together “-N --no-if-modified-since” in the command line, however the flag does not exist on older versions of wget and will fail. You may have to detect wget versions and pass relevant flags if you plan to deploy on multiple systems with various wget versions. CONTEXT: We use wget to download archives and large files to deploy. We started getting regular issues with corrupted archives after moving to ubuntu 22 and latest version of wget. ``` $ wget -N https://mycompany.com/myarchive.tar.gz $ tar -xf myarchive.tar.gz (stdin): File ends unexpectedly at pos 94479367 tar: Unexpected EOF in archive tar: Unexpected EOF in archive tar: Error is not recoverable: exiting now ``` It took me forever to get to the bottom of it, it's an issue with wget leaving partial corrupted downloads. It is a bug in wget itself. wget -N flag is meant to only (re)download a file when the timestamp of the file or the file size has changed. It simply stopped working as expected in recent versions, like the recent version in ubuntu 22. We see the issue happening regularly in production, It triggers after wget is interrupted once. Interruptions can happen for any reasons, like the user can Ctrl+C a script, a deployment can be cancelled, the process can be killed or the machine rebooted any moment. When wget is interrupted, it leaves a partial downloaded file. The timestamp is newer but the size doesn't match the expected file size. * In older versions of wget, wget was sending a HEAD request to get the filesize and the timestamp, then it downloaded the file if the date changed or the sized changed. wget worked as expected. * In recent versions of wget, wget does not detect the file size is incorrect. wget is stuck with a bad file and can never recover. Recovery requires intervention from a developer or SRE to go onto the affected machine and delete bad files leftover by wget. REPRO: You can Ctrl+C to interrupt wget or you can run “truncate” to simulate a partial download. ``` wget --version wget -N https://mycompany.com/myarchive.tar.lz --debug --server-response truncate --size 1 myarchive.tar.lz wget -N https://mycompany.com/myarchive.tar.lz --debug --server-response ``` DEBUGGING: see logs below for the last call to wget, after truncate Notice in recent versions, wget is sending a single GET request with an if-modified-since header, the server replies with a 304 response to tell the content did not change. The 304 response has no content-size header and no content. This is an edge case of the HTTP spec. The content-size header is not required on a 304 response. The header may be set but it is not required. Having a look at the web server response (artifactory/tomcat), the content-size is not set. See HTTP RFC https://datatracker.ietf.org/doc/html/rfc7232#section-4.1 This is a very interesting side effect of the HTTP spec and the real world. It prevents wget from knowing about the file size or getting the content. Turns out, detecting the file size is critical for "wget -N" to operate as expected. Otherwise it will get into a bad state where a file on disk is bad but wget can’t detect the issue and can’t redownload. I think wget must always send a HEAD request first. ``` wget 1.14 on centos 7 works as expected, send a HEAD request, detect the size has changed, then redownload wget -N --server-response https://mycompany.com/myarchive.tar.lz --2024-05-24 10:42:52-- https://mycompany.com/myarchive.tar.lz Resolving mycompany.com (mycompany.com)... 10.192.10.20 Connecting to mycompany.com (mycompany.com)|10.192.10.20|:443... connected. HTTP request sent, awaiting response... HTTP/1.1 200 OK Date: Fri, 24 May 2024 09:42:52 GMT Content-Type: application/octet-stream Content-Length: 185751081 Connection: keep-alive Server: Artifactory X-Artifactory-Id: 5e06b1f8f8c7e195:5afd0284:18d702c2085:-8000 X-Artifactory-Node-Id: dc09bebb5d42 Last-Modified: Thu, 23 May 2024 10:46:13 GMT ETag: 25f65d47dde6ae2015c0fb7fe8fb895ec988ceb0 X-Checksum-Sha1: 25f65d47dde6ae2015c0fb7fe8fb895ec988ceb0 X-Checksum-Sha256: 12a219e5c632629f11cfcd954069c1bc5e2273c1684d0877fdea0cf60b2e0d78 X-Checksum-Md5: 8b8a1d9db73eb2fbb635b45317320f19 Accept-Ranges: bytes X-Artifactory-Filename: myarchive.tar.lz Content-Disposition: attachment; filename="myarchive.tar.lz"; filename*=UTF-8''myarchive.tar.lz Length: 185751081 (177M) [application/octet-stream] The sizes do not match (local 1) -- retrieving. --2024-05-24 10:42:52-- https://mycompany.com/myarchive.tar.lz Reusing existing connection to mycompany.com:443. HTTP request sent, awaiting response... HTTP/1.1 200 OK Date: Fri, 24 May 2024 09:42:52 GMT Content-Type: application/octet-stream Content-Length: 185751081 Connection: keep-alive Server: Artifactory X-Artifactory-Id: 5e06b1f8f8c7e195:5afd0284:18d702c2085:-8000 X-Artifactory-Node-Id: dc09bebb5d42 Last-Modified: Thu, 23 May 2024 10:46:13 GMT ETag: 25f65d47dde6ae2015c0fb7fe8fb895ec988ceb0 X-Checksum-Sha1: 25f65d47dde6ae2015c0fb7fe8fb895ec988ceb0 X-Checksum-Sha256: 12a219e5c632629f11cfcd954069c1bc5e2273c1684d0877fdea0cf60b2e0d78 X-Checksum-Md5: 8b8a1d9db73eb2fbb635b45317320f19 Accept-Ranges: bytes X-Artifactory-Filename: myarchive.tar.lz Content-Disposition: attachment; filename="myarchive.tar.lz"; filename*=UTF-8''myarchive.tar.lz Length: 185751081 (177M) [application/octet-stream] Saving to: ‘myarchive.tar.lz’ 100%[==============================================================================>] 185,751,081 277MB/s in 0.6s 2024-05-24 10:42:53 (277 MB/s) - ‘myarchive.tar.lz’ saved [185751081/185751081] ``` ``` wget 1.21 on ubuntu 22 doesn’t work. wget incorrectly think there is nothing to download. wget -N --server-response https://mycompany.com/myarchive.tar.lz --2024-05-24 10:42:11-- https://mycompany.com/myarchive.tar.lz Resolving mycompany.com (mycompany.com)... 10.192.10.20 Connecting to mycompany.com (mycompany.com)|10.192.10.20|:443... connected. HTTP request sent, awaiting response... HTTP/1.1 304 Not Modified Date: Fri, 24 May 2024 09:42:11 GMT Connection: keep-alive Server: Artifactory X-Artifactory-Id: 5e06b1f8f8c7e195:5afd0284:18d702c2085:-8000 X-Artifactory-Node-Id: dc09bebb5d42 Last-Modified: Thu, 23 May 2024 10:46:13 GMT ETag: 25f65d47dde6ae2015c0fb7fe8fb895ec988ceb0 X-Checksum-Sha1: 25f65d47dde6ae2015c0fb7fe8fb895ec988ceb0 X-Checksum-Sha256: 12a219e5c632629f11cfcd954069c1bc5e2273c1684d0877fdea0cf60b2e0d78 X-Checksum-Md5: 8b8a1d9db73eb2fbb635b45317320f19 Accept-Ranges: bytes X-Artifactory-Filename: myarchive.tar.lz Content-Disposition: attachment; filename="myarchive.tar.lz"; filename*=UTF-8''myarchive.tar.lz File ‘myarchive.tar.lz’ not modified on server. Omitting download. ``` Regards. This email has been sent by a member of the Man group (“Man”). Man's parent company, Man Group plc, is registered in Jersey (company number 127570) with its registered office at 22 Grenville Street, St Helier, Jersey, JE4 8PX. The contents of this email are for the named addressee(s) only. It contains information which may be confidential and privileged. If you are not the intended recipient, please notify the sender immediately, destroy this email and any attachments and do not otherwise disclose or use them. Email transmission is not a secure method of communication and Man cannot accept responsibility for the completeness or accuracy of this email or any attachments. Whilst Man makes every effort to keep its network free from viruses, it does not accept responsibility for any computer virus which might be transferred by way of this email or any attachments. This email does not constitute a request, offer, recommendation or solicitation of any kind to buy, subscribe, sell or redeem any investment instruments or to perform other such transactions of any kind. Man reserves the right to monitor, record and retain all electronic and telephone communications through its network in accordance with applicable laws and regulations. During the course of our business relationship with you, we may process your personal data, including through the monitoring of electronic communications. We will only process your personal data to the extent permitted by laws and regulations; for the purposes of ensuring compliance with our legal and regulatory obligations and internal policies; and for managing client relationships. For further information please see our Privacy Notice: https://www.man.com/privacy-policy
OpenPGP_signature.asc
Description: OpenPGP digital signature