Re: [Bug-wget] Shouldn't wget strip leading spaces from a URL?

2017-06-13 Thread Dale R. Worley
L A Walsh  writes:
>> But of course, no URL contains an embedded space.
> ---
> Why not?

Because that's what it says in RFC 3986, which is what *defines* what a
URL *is*.

Now, someone can provide a string that contains spaces and claim it's a
URL, but it isn't.  The question is, What to do with it?  My preference
is to barf and tell the user that what they provided wasn't a proper
URL.

Beyond that, one might do some simple tidying up, such as removing
leading and trailing spaces.  That fix, by the way, is known to be safe,
*because a URL can't contain a space*, and so any trailing space can't
actually be part of the URL.

It gets uglier when there are invalid characters in the middle of the
URL, because simply deleting them is unlikely to produce the results the
user expected.

Dale



Re: [Bug-wget] Shouldn't wget strip leading spaces from a URL?

2017-06-13 Thread Ander Juaristi


On 13/06/17 02:09, L A Walsh wrote:
> 
> 
> Dale R. Worley wrote:
>> L A Walsh  writes:
>>> W/cut+paste into target line, where URL is double-quoted.  More often
>>> than not, I find it safer to double-quote a URL than not, because, for
>>> example, shells react badly to embedded spaces, ampersands and
>>> question marks.
>>
>> But of course, no URL contains an embedded space.
> ---
> Why not?
> 
>John Mueller of Google posted a note about spaces in the URL on Google+.   
> You know, the URLs that look like www.domain.com/file name goes here.html.
> 
>Should you fill those holes?
> 
>John Mueller of Google said "the answer is not "no"" when it comesto 
> the
> question "Should you encode spaces in URLs as "%20", "+" or
>as a space (" ")?"
> 

But those are correctly handled by wget.

I guess the whole point of this thread is to handle leading (and probably also
trailing) spaces. And the easiest way to 'handle' them is to remove them. So we
need a patch that trims the URL, as Tim said (I think), shouldn't be hard.

Other than that, there's nothing more to worry about, is it? Spaces within a
path/query are already correctly dealt with, and spaces in any other place
(htt ps://... !!!) are probably incorrect and should be reported as such.

> 
> But what would someone at google know?
> 
> 
> 
> 



Re: [Bug-wget] [GSoC Update] Week 2

2017-06-13 Thread Darshit Shah


= NEXT STEPS ===
Things which would be done in the coming week:

* Finished on wget_test_start_server() in order to call Libmicrohttpd as
  service for wget_test(). Problems and questions need to be resolved:
  - Decide what the best threading model for Libmicrohttpd. Currently using
MHD_USE_INTERNALLY_POLLING_THREAD which use external select. I still check
the comparison with legacy code that use Wget2 API wget_thread_start.
Choose any mechanism that uses select(). We can change the threading 
model at a later stage if it turns out to be a bottleneck. `epoll` is 
Linux-only and even `poll` isn't always available, so as long as you 
choose a `select` based implementation, it should be fine for now.



  - http_server_port still hardcoded.
This is important. The port should be a random number. Usually, passing 
0 in the port number makes the kernel choose an open one for you.


Having a randomised port is important to ensure that multiple runs don't 
step on each other.

  - In ahc_eco() of Libmicrohttpd, urls data still using static checking for
matching with requested urls. In other word, it's hardcoded. Need to be
changed to dynamic method to accomodate variadic data.
  - https still not touched yet.
  - What to do with FTP and FTPS functions? Since Libmicrohttpd just provide
service for HTTP. Do we need keep the function for FTP{s}, or removing it?
We keep the FTP code intact for now. If time permits, we should look 
into different libraries that provide a FTP server in C and try to 
integrate that into your test suite as well. But for now, it is out of 
scope.

  - Last check failed when the test try to resolve URL with question mark.
E.g: "/subdir1/subpage1.html?query", when I debug, it return just
"/subdir1/subpage1.html" so the result is 404 not found. I also check using
logging example source code provided in Libmicrohttpd tutorial [4]. When I
access using http client such as Wget2 and Firefox, the result is still the
same. The URL result omit the query part. Need to confirm to Libmicrohttpd
side about this, whether it is intended behaviour or not.
* Make sure all test suite running correctly.


[1]: https://gitlab.com/dstw/wget2
[2]: 
https://www.gnu.org/software/libmicrohttpd/manual/libmicrohttpd.html#microhttpd_002ddauth
[3]: https://gitlab.com/dstw/wget2/tree/use-mhd
[4]: https://www.gnu.org/software/libmicrohttpd/tutorial.html#logging_002ec

Regards,
Didik Setiawan



--
Thanking You,
Darshit Shah
PGP Fingerprint: 7845 120B 07CB D8D6 ECE5 FF2B 2A17 43ED A91A 35B6


signature.asc
Description: PGP signature


Re: [Bug-wget] overwrite

2017-06-13 Thread Tim Rühsen
Hi Ansgar,


On 06/13/2017 09:40 AM, taschenberggr...@streber24.de wrote:
>Hi,
> 
>a common question online is how to properly force wget to make an
>overwrite of an existing file name.
> 
>The existing options are quite confusing and I am under the impression
>that even taking what works does not imply users understand what they
>do.
> 
>The background of my question is
>https://bugs.winehq.org/show_bug.cgi?id=43100
>and your section in the manual
> "When running Wget without ‘-N’, ‘-nc’, ‘-r’, or ‘-p’, downloading the same 
> file
>  in the same directory will result in the original copy of file being 
> preserved
> and the second copy being named ‘file.1’. If that file is downloaded yet 
> again,
> the third copy will be named ‘file.2’, and so on."

I admit this is pretty confusing. It's historically grown and we won't
change it to not break existing scripts etc.

>If I want to disable that default behaviour what option do I take? I go
>with -nc but I have no clue why I have to take n and c, and online
>recommendations vary from recommending n, c, and nc.
> 
>I am looking for something intuitive like:
>wget --overwrite[1] https://dl.winehq.org/wine-builds/Release.key

There are several possibilities. I personally prefer to make a backup of
existing files before downloading (just in case download stops in the
middle and leaves me with a broken file). That can be used to move an
existing file out of the way:

mv -f Release.key Release.key.bak
wget https://dl.winehq.org/wine-builds/Release.key

Wget can do this without an additional command, even rotates up to an
arbitrary number (see 'man wget'):

wget --backup=3 https://dl.winehq.org/wine-builds/Release.key


But if you still want to replace a file in place (not recommended), you can

wget -O Release.key https://dl.winehq.org/wine-builds/Release.key

or (basically the same)

wget -O- https://dl.winehq.org/wine-builds/Release.key > Release.key


The -nc just switches off saving multiple versions of the file (.1, .2,
...).

Wget is designed not to easily overwrite files resp. to prevent
accidentally overwrites.

>Best,
>Ansgar

With Best Regards, Tim



signature.asc
Description: OpenPGP digital signature


[Bug-wget] overwrite

2017-06-13 Thread taschenberggroup
   Hi,

   a common question online is how to properly force wget to make an
   overwrite of an existing file name.

   The existing options are quite confusing and I am under the impression
   that even taking what works does not imply users understand what they
   do.

   The background of my question is
   https://bugs.winehq.org/show_bug.cgi?id=43100
   and your section in the manual
"When running Wget without ‘-N’, ‘-nc’, ‘-r’, or ‘-p’, downloading the same file
 in the same directory will result in the original copy of file being preserved
and the second copy being named ‘file.1’. If that file is downloaded yet again,
the third copy will be named ‘file.2’, and so on."

   If I want to disable that default behaviour what option do I take? I go
   with -nc but I have no clue why I have to take n and c, and online
   recommendations vary from recommending n, c, and nc.

   I am looking for something intuitive like:
   wget --overwrite[1] https://dl.winehq.org/wine-builds/Release.key

   Best,
   Ansgar

References

   1. https://dl.winehq.org/wine-builds/Release.key


Re: [Bug-wget] [GSoC Update] Week 2

2017-06-13 Thread Christian Grothoff
On 06/13/2017 05:42 AM, Didik Setiawan wrote:
>- Last check failed when the test try to resolve URL with question mark.
>  E.g: "/subdir1/subpage1.html?query", when I debug, it return just
>  "/subdir1/subpage1.html" so the result is 404 not found. I also check 
> using
>  logging example source code provided in Libmicrohttpd tutorial [4]. When 
> I
>  access using http client such as Wget2 and Firefox, the result is still 
> the
>  same. The URL result omit the query part. Need to confirm to 
> Libmicrohttpd
>  side about this, whether it is intended behaviour or not.

Yes, that's intended, for URL parameters/arguments you need to use
MHD_get_connection_values() with kind=MHD_GET_ARGUMENT_KIND to inspect them.

Happy hacking!

Christian



signature.asc
Description: OpenPGP digital signature


Re: [Bug-wget] wget and srcset tag

2017-06-13 Thread Tim Rühsen
Fixed in current master (fix release will be 1.19.2).

Thanks for your report and help !


With Best Regards, Tim



On 06/12/2017 06:07 PM, Chris wrote:
> Hi Tim,
> 
> I just created a test page at -
> https://www.anfractuosity.com/files/test2.html
> were I still get the issue.
> 
> The version is 'GNU Wget 1.19.1 built on linux-gnu.'
> 
> cheers
> Chris
> 
> 
> On 12 June 2017 at 15:35, Tim Rühsen  wrote:
> 
>> On 06/12/2017 10:27 AM, chris wrote:
>>> Hi Tim,
>>>
>>> Thanks for your reply, I notice the following in the debug logs:
>>>
>>> """
>>> will convert url
>>> http://www.anfractuosity.com/wp-content/uploads/2014/02/fsk.png to local
>>> site_output/fsk.png
>>> will convert url
>>> https://www.anfractuosity.com/wp-content/uploads/2014/02/fsk.png to
>> local
>>> site_output/fsk.png.html
>>> """
>>>
>>> The difference between those URLs seems to be one is https and one isn't.
>>> When I wget those URLs though, both seem to return a .png, with 'Length:
>>> 51068 (50K) [image/png]'.
>>>
>>> So I'm a bit confused why I get the fsk.png.html URL.
>>
>> What version of wget are you using ? (1.19.1 here)
>>
>> I tried some combinations of srcset (with https and http) and your
>> original options. I thought of an issue with redirection (because that's
>> an answer with text/html Content-Type).
>>
>> Could you create a small reproducer page ? e.g. like
>> 
>> > srcset="https://www.anfractuosity.com/wp-content/uploads/2014/02/fsk.png
>> 533w,
>> http://www.anfractuosity.com/wp-content/uploads/2014/02/fsk-266x300.png
>> 266w">
>> 
>>
>> With whatever paths you are using for the .png files.
>> I don't want to download tons of files (limited bandwidth here).
>>
>>> cheers
>>> Chris
>>>
>>> On Mon, Jun 12, 2017 at 9:08 AM, Tim Rühsen  wrote:
>>>
 Hi Chris,


 On 06/11/2017 05:24 PM, chris wrote:
> Hi,
>
> I'm just wondering if I've possibly found a bug, unless I'm just doing
> something incorrectly (which I assume is more likely).
>
> I grab my webpage using 'wget -T1 -t1 -E -k -H -nd -N -p -P site_output
> https://www.anfractuosity.com/projects/ultrasound-networking/ > note1
>> 2>
> note2'
>
> But i notice the srcset tags in the resulting downloaded files produce
> 'srcset="fsk.png.html 533w, fsk-266x300.png 266w" sizes="(max-width:
 533px)
> 100vw, 533px" />' in the output index.html.
>
> On the actual webpage it looks like "srcset="
> https://www.anfractuosity.com/wp-content/uploads/2014/02/fft.png
 762w,"
> no .html extension on the .png.

 You requested -E (--adjust-extension) and -k (--convert-links).
 That would change the file name when the server tags the file as
 content-type 'text/html'. You could see that in the debug output
 (options -d or --debug).

>
> Cheers
> Chris
>

 With Best Regards, Tim
>>
>>
> 



signature.asc
Description: OpenPGP digital signature