[bug #66468] wget --no-clobber sometimes overwrites existing files

anonymous Wed, 20 Nov 2024 04:05:14 -0800

URL:
  <https://savannah.gnu.org/bugs/?66468>


                 Summary: wget --no-clobber sometimes overwrites existing
files
                   Group: GNU Wget
               Submitter: None
               Submitted: Wed 20 Nov 2024 12:04:38 PM UTC
                Category: Program Logic
                Severity: 3 - Normal
                Priority: 5 - Normal
                  Status: None
                 Privacy: Public
             Assigned to: None
         Originator Name: Raven
        Originator Email: ravenmobil...@gmail.com
             Open/Closed: Open
         Discussion Lock: Any
                 Release: 1.20
        Operating System: GNU/Linux
         Reproducibility: Intermittent
           Fixed Release: None
         Planned Release: None
              Regression: None
           Work Required: None
          Patch Included: None


    _______________________________________________________

Follow-up Comments:


-------------------------------------------------------
Date: Wed 20 Nov 2024 12:04:38 PM UTC By: Anonymous
I have a Bash script that runs wget v1.21.3 on Debian 12.  The script passes
wget a file containing URLs to download, using --directory-prefix=, and
--no-clobber to skip already downloaded files.  This script worked
beautifully.  I downloaded approximately 20,000 files without issue, and upon
rerunning the script after it would skip all 20,000 existing files.  Great.

But now a month later I rewrote parts of the script, and slightly changed the
wget command to use a different output folder with --directory-prefix.
Suddenly the script started ignoring the --no-clobber for around 5% of the
20,000 files, redownloading them every time I ran my script, fully
reproduceable.  I ran it dozens of times, it would redownload a group of URLs
one after the other (a sequential list of URLs in the middle were failing,
while the other thousands of URLs were fine).

I then picked one URL from this group of URLs that were always failing, and
tried to isolate the cause of the issue.  I tried changing --directory-prefix
to /tmp, and it worked fine, skipping the existing file.  Then I changed it
back to "2.Download_Pages_Data" and it failed again, overwriting the existing
file every time (tried dozens of times).

I enabled debug output with -d for the working and failing output directories,
but it provided no new information.  When downloading to /tmp it says the file
exists.  While downloading to "2.Download_Pages_Data" it acts as if the output
file does not exist, and overwrites it every time.

I thought it might be a bug with the --directory-prefix parameter, so I opted
to try "cd 2.Download_Pages_Data && wget ..." instead, but that also failed in
exactly the same manner.  It worked fine doing "cd /tmp && wget ...".

I then wondered if wget might be having issues with there being a period "."
in the output directory name "2.Download_Pages_Data", so I started trying
other output directory names like "2_Download_Pages_Data" (failed), then
"2_Download_Data" (failed), then "2_Data" (WORKED!), then I went back to
"2.Download_Pages_Data" (ALSO WORKED!).  Now each time I run the wget command
it works, skipping the existing output file.

I'm not sure how it's even possible to have wget break intermittently like
this.  Literally only changing the output directory, wget --no-clobber
overwrites files or doesn't.

It's clearly a bug, because if --no-clobber was not used, it should produce
output_file.html.1, then output_file.html.2, etc., but it is overwriting
existing files.

In case it's relevant, the URLs I was downloading were all ending in .html
similar to "Some_Random_Page_Name.html".

How does one even go about trying to debug this?  Debug output with -d showed
nothing relevant whether it was working or failing.  It goes from working to
not working when changing nothing but the output directory.







    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?66468>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

signature.asc
Description: PGP signature

[bug #66468] wget --no-clobber sometimes overwrites existing files

Reply via email to