URL: <https://savannah.gnu.org/bugs/?66468>
Summary: wget --no-clobber sometimes overwrites existing files Group: GNU Wget Submitter: None Submitted: Wed 20 Nov 2024 12:04:38 PM UTC Category: Program Logic Severity: 3 - Normal Priority: 5 - Normal Status: None Privacy: Public Assigned to: None Originator Name: Raven Originator Email: ravenmobil...@gmail.com Open/Closed: Open Discussion Lock: Any Release: 1.20 Operating System: GNU/Linux Reproducibility: Intermittent Fixed Release: None Planned Release: None Regression: None Work Required: None Patch Included: None _______________________________________________________ Follow-up Comments: ------------------------------------------------------- Date: Wed 20 Nov 2024 12:04:38 PM UTC By: Anonymous I have a Bash script that runs wget v1.21.3 on Debian 12. The script passes wget a file containing URLs to download, using --directory-prefix=, and --no-clobber to skip already downloaded files. This script worked beautifully. I downloaded approximately 20,000 files without issue, and upon rerunning the script after it would skip all 20,000 existing files. Great. But now a month later I rewrote parts of the script, and slightly changed the wget command to use a different output folder with --directory-prefix. Suddenly the script started ignoring the --no-clobber for around 5% of the 20,000 files, redownloading them every time I ran my script, fully reproduceable. I ran it dozens of times, it would redownload a group of URLs one after the other (a sequential list of URLs in the middle were failing, while the other thousands of URLs were fine). I then picked one URL from this group of URLs that were always failing, and tried to isolate the cause of the issue. I tried changing --directory-prefix to /tmp, and it worked fine, skipping the existing file. Then I changed it back to "2.Download_Pages_Data" and it failed again, overwriting the existing file every time (tried dozens of times). I enabled debug output with -d for the working and failing output directories, but it provided no new information. When downloading to /tmp it says the file exists. While downloading to "2.Download_Pages_Data" it acts as if the output file does not exist, and overwrites it every time. I thought it might be a bug with the --directory-prefix parameter, so I opted to try "cd 2.Download_Pages_Data && wget ..." instead, but that also failed in exactly the same manner. It worked fine doing "cd /tmp && wget ...". I then wondered if wget might be having issues with there being a period "." in the output directory name "2.Download_Pages_Data", so I started trying other output directory names like "2_Download_Pages_Data" (failed), then "2_Download_Data" (failed), then "2_Data" (WORKED!), then I went back to "2.Download_Pages_Data" (ALSO WORKED!). Now each time I run the wget command it works, skipping the existing output file. I'm not sure how it's even possible to have wget break intermittently like this. Literally only changing the output directory, wget --no-clobber overwrites files or doesn't. It's clearly a bug, because if --no-clobber was not used, it should produce output_file.html.1, then output_file.html.2, etc., but it is overwriting existing files. In case it's relevant, the URLs I was downloading were all ending in .html similar to "Some_Random_Page_Name.html". How does one even go about trying to debug this? Debug output with -d showed nothing relevant whether it was working or failing. It goes from working to not working when changing nothing but the output directory. _______________________________________________________ Reply to this item at: <https://savannah.gnu.org/bugs/?66468> _______________________________________________ Message sent via Savannah https://savannah.gnu.org/
signature.asc
Description: PGP signature