Hello, Sorry for delay in getting back.
I think the appropriate solution is to send the HEAD request by default, as wget was doing before. From the previous PRs that changed the behavior, it's not clear to me why the behavior was changed in the first place? I think they really missed the edge case where it leaves corrupted files. There were actually a few threads that reported the issue afterwards but the buggy behavior was not reverted, I think people didn't manage to understand the root cause, the new behavior is bugged. I wonder if the purpose of the change was a micro-optimization to save a HEAD request? The whole point of "wget -N" is to avoid redownloading large files if unchanged (in my case GB of files to deploy). Workarounds that require to download the file then compare are not viable. ^^ Personally, I don't think the HEAD request needs to be optimized away. "wget -N" flag is meant to avoid downloading large files, it's very reasonable to send one HEAD request to save MB or GB of downloads. I think there could be an alternative behavior for wget by using a temporary file, as suggested in the last 2 emails: Obviously this would need to be corrected in wget itself. 1) do the --if-modified-since with the file timestamp, when the file already exists. 2) download to a temporary file name (it must be in the same directory or you will have issues with rename across volumes) 3) set the timestamp on the temporary file upon completion 4) rename the temporary file I can think of another workaround if it's possible to set the timestamp initially. Wget can create the file, set the timestamp to "oldest timestamp", write the content gradually, and finally set the timestamp when the download is completed. However that doesn't work if every write is setting the file timestamp to now? I don't know how the filesystem operates. You mentioned an option "c) to write the timestamp after every write operation if needs be". Unfortunately that doesn't fix the issue. The download can be interrupted between the write and the writetimestamp calls, leaving a corrupted file with a newer date. It doesn't resolve the issue. Regards. -----Original Message----- From: Derek Martin <demar...@akamai.com> Sent: Wednesday, June 12, 2024 7:14 PM To: Tim Rühsen <tim.rueh...@gmx.de> Cc: Romain Morotti (London) <romain.moro...@man.com>; wget-...@gnu.org; bug-wget <bug-wget@gnu.org> Subject: Re: CRITICAL BUG: wget -N is leaving corrupted files [You don't often get email from demar...@akamai.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ] External Email: Caution advised On Sat, Jun 08, 2024 at 07:21:14PM +0200, Tim Rühsen wrote: > What other options do we have to make --if-modified-since workable in > your scenario? (Apart from switching --if-modified-since off) > > a) When you download a file, use a temporary file name. After wget > exists, check the return status and if it is 0, rename the file. > > The downside is that you always have download the file, even if it > didn't change. I think this is probably the right solution, except: 1. ALWAYS rename the file, even if the download fails / is interrupted. 2. BEFORE the rename, set the timestamp appropriately: - set it to the original local file's timestamp if the transfer did not complete successfully - set it to the upstream file's timestamp if it did complete successfully. 3. To successfully do that for the most possible cases, you'll need to catch signals and delay their handling until the above is done, in addition to whatever other error handling is already required. And probably: 4. Document that in cases where clean-up procedures can't catch every last case, temporary files may be left behind, so the user can expect them on errors and manually clean them up. Probably also name the temporary file something like ${original_file_name}_tmp.XXXXXX so that the user can, if they so choose, rename it to ${original_file_name} and manually reset the time stamp to get wget to resume/redownload or whatever. This email has been sent by a member of the Man group (“Man”). Man's parent company, Man Group plc, is registered in Jersey (company number 127570) with its registered office at 22 Grenville Street, St Helier, Jersey, JE4 8PX. The contents of this email are for the named addressee(s) only. It contains information which may be confidential and privileged. If you are not the intended recipient, please notify the sender immediately, destroy this email and any attachments and do not otherwise disclose or use them. Email transmission is not a secure method of communication and Man cannot accept responsibility for the completeness or accuracy of this email or any attachments. Whilst Man makes every effort to keep its network free from viruses, it does not accept responsibility for any computer virus which might be transferred by way of this email or any attachments. This email does not constitute a request, offer, recommendation or solicitation of any kind to buy, subscribe, sell or redeem any investment instruments or to perform other such transactions of any kind. Man reserves the right to monitor, record and retain all electronic and telephone communications through its network in accordance with applicable laws and regulations. During the course of our business relationship with you, we may process your personal data, including through the monitoring of electronic communications. We will only process your personal data to the extent permitted by laws and regulations; for the purposes of ensuring compliance with our legal and regulatory obligations and internal policies; and for managing client relationships. For further information please see our Privacy Notice: https://www.man.com/privacy-policy