Re: [Bug-wget] Problem downloading with RIGHT SINGLE QUOTATION MARK (U+2019) in filename

2019-10-11 Thread Tim Rühsen
On 11.10.19 11:07, Eli Zaretskii wrote:
>> From: Cameron Tacklind 
>> Date: Thu, 10 Oct 2019 20:31:02 -0700
>>
>> The error is pretty clearly an encoding conversion issue, going from UTF-8,
>> assumed to be CP1252, converting into UTF-8, which becomes wrong.
> 
> I think you need to tell Wget that the page encoding is UTF-8, by
> using the --remote-encoding switch.  Did you try that?
> 

Cameron's html file contains a 'meta' tag with attribute
'charset=utf-8'. So wget should detect it and convert the URL correctly.

And I can confirm that wget is working properly here. My version is
1.20.3 and I am working on Linux.

I put this file onto my local apache web server and named it quote.html:



RIGHT SINGLE QUOTE TEST

test


My command line is
  wget -d -r http://localhost/quote.html

Output is
...
Decided to load it.
URI encoding = »utf-8«
Enqueuing http://localhost/%E2%80%99 at depth 1
Queue count 1, maxcount 1.
[IRI Enqueuing »http://localhost/%E2%80%99« with »utf-8«
Dequeuing http://localhost/%E2%80%99 at depth 1
Queue count 0, maxcount 1.
Converted file name 'localhost/’' (UTF-8) -> 'localhost/’' (UTF-8)
--2019-10-11 18:06:21--  http://localhost/%E2%80%99
...
---request begin---
GET /%E2%80%99 HTTP/1.1
Referer: http://localhost/quote.html
User-Agent: Wget/1.20.3 (linux-gnu)
Accept: */*
Accept-Encoding: identity
Host: localhost
Connection: Keep-Alive

---request end---
...


@Cameron: Your wget version seems ok, so I am a bit clueless right.now...

Could you give me the output of 'wget --version' ?
Could you test in the same way as I did above to see if that is
reproducible for you or not ?

Regards, Tim



signature.asc
Description: OpenPGP digital signature


Re: [Bug-wget] Problem downloading with RIGHT SINGLE QUOTATION MARK (U+2019) in filename

2019-10-11 Thread Eli Zaretskii
> From: Cameron Tacklind 
> Date: Thu, 10 Oct 2019 20:31:02 -0700
> 
> The error is pretty clearly an encoding conversion issue, going from UTF-8,
> assumed to be CP1252, converting into UTF-8, which becomes wrong.

I think you need to tell Wget that the page encoding is UTF-8, by
using the --remote-encoding switch.  Did you try that?



[Bug-wget] Problem downloading with RIGHT SINGLE QUOTATION MARK (U+2019) in filename

2019-10-10 Thread Cameron Tacklind
Hello,

I think I've found a bug with wget.

I originally came across this problem when recursively downloading folders
that were presented by nginx's fancy-index module. Sometimes a filename
would include a "’" [RIGHT SINGLE QUOTATION MARK (U+2019)] and wget would
always get a 404 error when downloading the file.

Downloading this simple html file (simplified output of nginx fancy-index)
shows the error:


RIGHT SINGLE QUOTE TEST

test


Full command line (Windows cmd.exe)
wget -d --no-verbose --tries 0
 --continue --show-progress --wait 0.1 --waitretry 5
 -e robots=off --rejected-log=rejected.log --recursive --level inf --reject
"index.html*,jpg,png,zip"
 --no-parent --no-host-directories --auth-no-challenge --user xxx
--password xxx -P output_dir
https://mydomain.com/test/

Debug Output:
DEBUG output created by Wget 1.20.3 on mingw32.

Reading HSTS entries from C:\ProgramData\chocolatey\lib\Wget\tools/.wget-hsts
URI encoding = 'CP1252'
iconv UTF-8 -> CP1252
iconv outlen=60 inlen=30
converted 'https://mydomain.com/test/' (CP1252) -> '
https://mydomain.com/test/' (UTF-8)
URI encoding = 'CP1252'
Enqueuing https://mydomain.com/test/ at depth 0
Queue count 1, maxcount 1.
[IRI Enqueuing 'https://mydomain.com/test/' with 'CP1252'
Dequeuing https://mydomain.com/test/ at depth 0
Queue count 0, maxcount 1.
iconv UTF-8 -> CP1252
iconv outlen=60 inlen=30
converted 'https://mydomain.com/test/' (CP1252) -> '
https://mydomain.com/test/' (UTF-8)
Converted file name 'test/index.html' (UTF-8) -> 'test/index.html' (CP1252)
Auth-without-challenge set, sending Basic credentials.
seconds 0.00, Caching mydomain.com => my.ip.add.ress
seconds 0.00, Created socket 4.
Releasing 0x00b3bf60 (new refcount 1).
Initiating SSL handshake.
seconds 900.00, Winsock error: 0
Handshake successful; connected socket 4 to SSL handle 0x00b52260
certificate:
  subject: CN=mydomain.com
  issuer:  CN=Let's Encrypt Authority X3,O=Let's Encrypt,C=US
X509 certificate successfully verified and matches host mydomain.com

---request begin---
GET /test/ HTTP/1.1

User-Agent: Wget/1.20.3 (mingw32)

Accept: */*

Accept-Encoding: identity

Authorization: Basic 

Host: mydomain.com

Connection: Keep-Alive



---request end---
seconds 900.00, Winsock error: 0

---response begin---
HTTP/1.1 200 OK

Server: nginx/1.14.1

Date: Fri, 11 Oct 2019 02:17:57 GMT

Content-Type: text/html

Content-Length: 185

Last-Modified: Fri, 11 Oct 2019 02:17:52 GMT

Connection: keep-alive

Keep-Alive: timeout=20

ETag: "5d9fe650-b9"

Accept-Ranges: bytes



---response end---
Registered socket 4 for persistent reuse.
seconds 900.00, Winsock error: 0

 0K   100%  282K=0
.001s2019-10-10 19:17:01 URL:https://mydomain.com/test/ [185/185] ->
"E:/test/poops/test/index.html.tmp" [1]
Loaded E:/test/poops/test/index.html.tmp (size 185).
URI encoding = 'CP1252'
E:/test/poops/test/index.html.tmp: merge('https://mydomain.com/test/',
'%E2%80%99') -> https://mydomain.com/test/%E2%80%99
iconv UTF-8 -> CP1252
iconv outlen=66 inlen=33
converted 'https://mydomain.com/test/%E2%80%99' (CP1252) -> '
https://mydomain.com/test/’' (UTF-8)
appending 'https://mydomain.com/test/%C3%A2%E2%82%AC%E2%84%A2' to urlpos.
URI content encoding = 'utf-8'
no-follow in E:/test/poops/test/index.html.tmp: 0
Deciding whether to enqueue "
https://mydomain.com/test/%C3%A2%E2%82%AC%E2%84%A2;.
Decided to load it.
URI encoding = 'utf-8'
Enqueuing https://mydomain.com/test/%C3%A2%E2%82%AC%E2%84%A2 at depth 1
Queue count 1, maxcount 1.
[IRI Enqueuing 'https://mydomain.com/test/%C3%A2%E2%82%AC%E2%84%A2' with
'utf-8'
Removing file due to recursive rejection criteria in recursive_retrieve():
Dequeuing https://mydomain.com/test/%C3%A2%E2%82%AC%E2%84%A2 at depth 1
Queue count 0, maxcount 1.
Converted file name 'test/’' (UTF-8) -> 'test/’' (CP1252)
Auth-without-challenge set, sending Basic credentials.
Reusing fd 4.

---request begin---
GET /test/%C3%A2%E2%82%AC%E2%84%A2 HTTP/1.1

Referer: https://mydomain.com/test/

User-Agent: Wget/1.20.3 (mingw32)

Accept: */*

Accept-Encoding: identity

Authorization: Basic 

Host: mydomain.com

Connection: Keep-Alive



---request end---
seconds 900.00, Winsock error: 0

---response begin---
HTTP/1.1 404 Not Found

Server: nginx/1.14.1

Date: Fri, 11 Oct 2019 02:17:58 GMT

Content-Type: text/html

Content-Length: 169

Connection: keep-alive

Keep-Alive: timeout=20



---response end---
Skipping 169 bytes of body: [seconds 900.00, Winsock error: 0


404 Not Found



404 Not Found

nginx/1.14.1





] done.
https://mydomain.com/test/%C3%A2%E2%82%AC%E2%84%A2:
2019-10-10 19:17:02 ERROR 404: Not Found.
FINISHED --2019-10-10 19:17:02--
Total wall clock time: 1.6s
Downloaded: 1 files, 185 in 0.001s (282 KB/s)

The error is pretty clearly an encoding conversion issue, going from UTF-8,
assumed to be CP1252, converting into UTF-8, which becomes wrong. This is
nicely described at the end of this page: