Re: [Bug-wget] Shouldn't wget strip leading spaces from a URL?
L A Walshwrites: >> But of course, no URL contains an embedded space. > --- > Why not? Because that's what it says in RFC 3986, which is what *defines* what a URL *is*. Now, someone can provide a string that contains spaces and claim it's a URL, but it isn't. The question is, What to do with it? My preference is to barf and tell the user that what they provided wasn't a proper URL. Beyond that, one might do some simple tidying up, such as removing leading and trailing spaces. That fix, by the way, is known to be safe, *because a URL can't contain a space*, and so any trailing space can't actually be part of the URL. It gets uglier when there are invalid characters in the middle of the URL, because simply deleting them is unlikely to produce the results the user expected. Dale
Re: [Bug-wget] Shouldn't wget strip leading spaces from a URL?
On 13/06/17 02:09, L A Walsh wrote: > > > Dale R. Worley wrote: >> L A Walshwrites: >>> W/cut+paste into target line, where URL is double-quoted. More often >>> than not, I find it safer to double-quote a URL than not, because, for >>> example, shells react badly to embedded spaces, ampersands and >>> question marks. >> >> But of course, no URL contains an embedded space. > --- > Why not? > >John Mueller of Google posted a note about spaces in the URL on Google+. > You know, the URLs that look like www.domain.com/file name goes here.html. > >Should you fill those holes? > >John Mueller of Google said "the answer is not "no"" when it comesto > the > question "Should you encode spaces in URLs as "%20", "+" or >as a space (" ")?" > But those are correctly handled by wget. I guess the whole point of this thread is to handle leading (and probably also trailing) spaces. And the easiest way to 'handle' them is to remove them. So we need a patch that trims the URL, as Tim said (I think), shouldn't be hard. Other than that, there's nothing more to worry about, is it? Spaces within a path/query are already correctly dealt with, and spaces in any other place (htt ps://... !!!) are probably incorrect and should be reported as such. > > But what would someone at google know? > > > >
Re: [Bug-wget] [GSoC Update] Week 2
= NEXT STEPS === Things which would be done in the coming week: * Finished on wget_test_start_server() in order to call Libmicrohttpd as service for wget_test(). Problems and questions need to be resolved: - Decide what the best threading model for Libmicrohttpd. Currently using MHD_USE_INTERNALLY_POLLING_THREAD which use external select. I still check the comparison with legacy code that use Wget2 API wget_thread_start. Choose any mechanism that uses select(). We can change the threading model at a later stage if it turns out to be a bottleneck. `epoll` is Linux-only and even `poll` isn't always available, so as long as you choose a `select` based implementation, it should be fine for now. - http_server_port still hardcoded. This is important. The port should be a random number. Usually, passing 0 in the port number makes the kernel choose an open one for you. Having a randomised port is important to ensure that multiple runs don't step on each other. - In ahc_eco() of Libmicrohttpd, urls data still using static checking for matching with requested urls. In other word, it's hardcoded. Need to be changed to dynamic method to accomodate variadic data. - https still not touched yet. - What to do with FTP and FTPS functions? Since Libmicrohttpd just provide service for HTTP. Do we need keep the function for FTP{s}, or removing it? We keep the FTP code intact for now. If time permits, we should look into different libraries that provide a FTP server in C and try to integrate that into your test suite as well. But for now, it is out of scope. - Last check failed when the test try to resolve URL with question mark. E.g: "/subdir1/subpage1.html?query", when I debug, it return just "/subdir1/subpage1.html" so the result is 404 not found. I also check using logging example source code provided in Libmicrohttpd tutorial [4]. When I access using http client such as Wget2 and Firefox, the result is still the same. The URL result omit the query part. Need to confirm to Libmicrohttpd side about this, whether it is intended behaviour or not. * Make sure all test suite running correctly. [1]: https://gitlab.com/dstw/wget2 [2]: https://www.gnu.org/software/libmicrohttpd/manual/libmicrohttpd.html#microhttpd_002ddauth [3]: https://gitlab.com/dstw/wget2/tree/use-mhd [4]: https://www.gnu.org/software/libmicrohttpd/tutorial.html#logging_002ec Regards, Didik Setiawan -- Thanking You, Darshit Shah PGP Fingerprint: 7845 120B 07CB D8D6 ECE5 FF2B 2A17 43ED A91A 35B6 signature.asc Description: PGP signature
Re: [Bug-wget] overwrite
Hi Ansgar, On 06/13/2017 09:40 AM, taschenberggr...@streber24.de wrote: >Hi, > >a common question online is how to properly force wget to make an >overwrite of an existing file name. > >The existing options are quite confusing and I am under the impression >that even taking what works does not imply users understand what they >do. > >The background of my question is >https://bugs.winehq.org/show_bug.cgi?id=43100 >and your section in the manual > "When running Wget without ‘-N’, ‘-nc’, ‘-r’, or ‘-p’, downloading the same > file > in the same directory will result in the original copy of file being > preserved > and the second copy being named ‘file.1’. If that file is downloaded yet > again, > the third copy will be named ‘file.2’, and so on." I admit this is pretty confusing. It's historically grown and we won't change it to not break existing scripts etc. >If I want to disable that default behaviour what option do I take? I go >with -nc but I have no clue why I have to take n and c, and online >recommendations vary from recommending n, c, and nc. > >I am looking for something intuitive like: >wget --overwrite[1] https://dl.winehq.org/wine-builds/Release.key There are several possibilities. I personally prefer to make a backup of existing files before downloading (just in case download stops in the middle and leaves me with a broken file). That can be used to move an existing file out of the way: mv -f Release.key Release.key.bak wget https://dl.winehq.org/wine-builds/Release.key Wget can do this without an additional command, even rotates up to an arbitrary number (see 'man wget'): wget --backup=3 https://dl.winehq.org/wine-builds/Release.key But if you still want to replace a file in place (not recommended), you can wget -O Release.key https://dl.winehq.org/wine-builds/Release.key or (basically the same) wget -O- https://dl.winehq.org/wine-builds/Release.key > Release.key The -nc just switches off saving multiple versions of the file (.1, .2, ...). Wget is designed not to easily overwrite files resp. to prevent accidentally overwrites. >Best, >Ansgar With Best Regards, Tim signature.asc Description: OpenPGP digital signature
[Bug-wget] overwrite
Hi, a common question online is how to properly force wget to make an overwrite of an existing file name. The existing options are quite confusing and I am under the impression that even taking what works does not imply users understand what they do. The background of my question is https://bugs.winehq.org/show_bug.cgi?id=43100 and your section in the manual "When running Wget without ‘-N’, ‘-nc’, ‘-r’, or ‘-p’, downloading the same file in the same directory will result in the original copy of file being preserved and the second copy being named ‘file.1’. If that file is downloaded yet again, the third copy will be named ‘file.2’, and so on." If I want to disable that default behaviour what option do I take? I go with -nc but I have no clue why I have to take n and c, and online recommendations vary from recommending n, c, and nc. I am looking for something intuitive like: wget --overwrite[1] https://dl.winehq.org/wine-builds/Release.key Best, Ansgar References 1. https://dl.winehq.org/wine-builds/Release.key
Re: [Bug-wget] [GSoC Update] Week 2
On 06/13/2017 05:42 AM, Didik Setiawan wrote: >- Last check failed when the test try to resolve URL with question mark. > E.g: "/subdir1/subpage1.html?query", when I debug, it return just > "/subdir1/subpage1.html" so the result is 404 not found. I also check > using > logging example source code provided in Libmicrohttpd tutorial [4]. When > I > access using http client such as Wget2 and Firefox, the result is still > the > same. The URL result omit the query part. Need to confirm to > Libmicrohttpd > side about this, whether it is intended behaviour or not. Yes, that's intended, for URL parameters/arguments you need to use MHD_get_connection_values() with kind=MHD_GET_ARGUMENT_KIND to inspect them. Happy hacking! Christian signature.asc Description: OpenPGP digital signature
Re: [Bug-wget] wget and srcset tag
Fixed in current master (fix release will be 1.19.2). Thanks for your report and help ! With Best Regards, Tim On 06/12/2017 06:07 PM, Chris wrote: > Hi Tim, > > I just created a test page at - > https://www.anfractuosity.com/files/test2.html > were I still get the issue. > > The version is 'GNU Wget 1.19.1 built on linux-gnu.' > > cheers > Chris > > > On 12 June 2017 at 15:35, Tim Rühsenwrote: > >> On 06/12/2017 10:27 AM, chris wrote: >>> Hi Tim, >>> >>> Thanks for your reply, I notice the following in the debug logs: >>> >>> """ >>> will convert url >>> http://www.anfractuosity.com/wp-content/uploads/2014/02/fsk.png to local >>> site_output/fsk.png >>> will convert url >>> https://www.anfractuosity.com/wp-content/uploads/2014/02/fsk.png to >> local >>> site_output/fsk.png.html >>> """ >>> >>> The difference between those URLs seems to be one is https and one isn't. >>> When I wget those URLs though, both seem to return a .png, with 'Length: >>> 51068 (50K) [image/png]'. >>> >>> So I'm a bit confused why I get the fsk.png.html URL. >> >> What version of wget are you using ? (1.19.1 here) >> >> I tried some combinations of srcset (with https and http) and your >> original options. I thought of an issue with redirection (because that's >> an answer with text/html Content-Type). >> >> Could you create a small reproducer page ? e.g. like >> >> > srcset="https://www.anfractuosity.com/wp-content/uploads/2014/02/fsk.png >> 533w, >> http://www.anfractuosity.com/wp-content/uploads/2014/02/fsk-266x300.png >> 266w"> >> >> >> With whatever paths you are using for the .png files. >> I don't want to download tons of files (limited bandwidth here). >> >>> cheers >>> Chris >>> >>> On Mon, Jun 12, 2017 at 9:08 AM, Tim Rühsen wrote: >>> Hi Chris, On 06/11/2017 05:24 PM, chris wrote: > Hi, > > I'm just wondering if I've possibly found a bug, unless I'm just doing > something incorrectly (which I assume is more likely). > > I grab my webpage using 'wget -T1 -t1 -E -k -H -nd -N -p -P site_output > https://www.anfractuosity.com/projects/ultrasound-networking/ > note1 >> 2> > note2' > > But i notice the srcset tags in the resulting downloaded files produce > 'srcset="fsk.png.html 533w, fsk-266x300.png 266w" sizes="(max-width: 533px) > 100vw, 533px" />' in the output index.html. > > On the actual webpage it looks like "srcset=" > https://www.anfractuosity.com/wp-content/uploads/2014/02/fft.png 762w," > no .html extension on the .png. You requested -E (--adjust-extension) and -k (--convert-links). That would change the file name when the server tags the file as content-type 'text/html'. You could see that in the debug output (options -d or --debug). > > Cheers > Chris > With Best Regards, Tim >> >> > signature.asc Description: OpenPGP digital signature