Re: Problematic default file naming system (BUG?)

ge...@mweb.co.za Sun, 15 Oct 2023 12:40:46 -0700

Functioning as designed ... 

(Disclaimer: I am not an expert user of this program, but I have some 
experience that may help you:)


I guess you are Windows users. Unlike Unix and Linux systems, in Windows the 
last part of a file name (anything following the last ("rightmost") period is 
considered the file extension and can be used to determine what application 
would open the file by default (e.g a .html file would be opened by a browser, 
a .doc (or nowadays a .docx file) would be given to a word processor, such as 
Microsoft Office's word.exe (the .exe indicating that this file contains 
executable code)etc.) 

That is the one aspect of what is going on here - you downloaded something that 
was a .html file, but you didn't give it a name. Somewhere in teh documentation 
it will tell you that (and presumably why) it will give such a file a default 
file name of "index" followed by the file extension. 

The other aspect is what will happen if you download a file to a location where 
a file of the same name and extension is already present. There are a few 
options, between which you can choose using parameters on the command line - 
and these options make good sense in certain circumstances and none at all in 
certain other circumstances. (I'll let you dig through the documentation of 
wget, since that is an important part of testing (evaluating) the program as 
part of your project ;-) 

The most obvious choices you may want to try out are the following (and they 
apply regardless of whether you are downloading a file named index.html or an 
image file named JamesBond007.jpg - I'll go with index.html for an example): 

First option: 

Your existing file index.html is now outdated and the new version - with the 
same file name - will overwrite it. (hint: in the language of the 
documentation, it will "clobber" the file.) 

Second option: 

Your existing file should not be overwritten ("clobbered"), so even though your 
new file was meant to have the same name, it will be called index.html.1 or 
index.html.2 or - eventually index.html.4711 and so on. This may not be pretty, 
but it is effective. Windows users typically would expect to see a different 
syntax (but wget is not just for Windows) - index (1).html, index (2).html, 
..., index (4711).html might look more acceptable to you ...

Third option:

When downloading files across a notoriously unreliable line the process may be 
interrupted by line failure before the file is complete. Wget gives you the 
option then to continue downloading by adding the additional data from retrying 
the download to the end of the existing file - in my life that has been the 
option I used most, especially since Murphy's Law stipulates that the worse 
your line, the bigger your files. 

Obviously, wget can't make the decision for you, which of these options you 
need in any given situation. And it is pretty much impossible to fix the 
results after the fact if you chose the wrong one. What you can do, though, is 
rename all the .1, .2, .3, etc. files to something more sensible. And when you 
plan to download complete web sites or similar groups of files, wget offers you 
ways to drop them with sensible names (most likely taken from your source) into 
a suitable directory structure (e.g. to duplicate the source structure.) 

Study the documentation that came with your downlaoded copy of wget (or find it 
elsewhere on the web) and play with the program a bit more. Do come back here 
for more advice if/when needed. And I'll let the experts answer when their 
input is needed ;-)

Good luck, 

Gerd

 


----- Original Message -----
From: "Joel F Leppänen" <joel.f.leppa...@student.lut.fi>
To: "bug-wget" <bug-wget@gnu.org>
Sent: Sunday, October 15, 2023 4:44:33 PM
Subject: Problematic default file naming system (BUG?)

Hi all,

We’re testing wget version 1.24.4 for a school project. When downloading an 
.html file, if you don’t name it and download additional .html files, also 
unnamed, it saves the second and the following files after that in formats that 
don’t exist. The first one is saved as ”index.html” and the second one as 
”index.html.1”, the third one as ”index.html.2” and so forth. The files can of 
course be changed back to .html-formats afterwards, but I feel like this is a 
bug that affects user experience negatively (or it’s intended, but I can’t 
figure out why that would be).

Regards,
Joel Leppänen and Werneri Punavaara
LUT University

Re: Problematic default file naming system (BUG?)

Reply via email to