Thanks for the responses. Indeed, that seems to be the case: Shift JIS replaces ASCII \ and ~ with ¥ and ‾, respectively (with exceptions as per Andries' message).
In addition, RFC 3987 (Internationalized Resource Identifiers (IRIs)) section 6.3 states that: "In cases where the document as a whole has a native character encoding, IRIs MUST also be encoded in this character encoding and converted accordingly by a parser or interpreter." This would make it seem that the observed behavior in Wget is correct and that the document is faulty. I would also like to note that, even when the the document's links don't contain a tilde, Wget will still fail to fetch the pages as long as there is a tilde in the URL the Wget was called with. Best regards, William Prescott On Mon, Feb 6, 2017 at 6:29 PM, Andries E. Brouwer <andries.brou...@cwi.nl> wrote: > On Mon, Feb 06, 2017 at 10:55:32PM +0100, Tim Rühsen wrote: >> On Montag, 6. Februar 2017 05:02:57 CET William Prescott wrote: >> > Hello, >> > >> > I'm encountering a problem when recursively downloading from a website when >> > the URL contains a tilde and the page encoding claims to be Shift JIS. >> > >> > I've tried both Wget 1.17.1 (from Ubuntu 16.04) and 1.19 (from source, >> > with Libidn2 0.16). >> > I believe my local character encoding is UTF-8. >> > >> > The first page will download okay, but then most pages after it will get >> > the >> > tilde converted to "%E2%80%BE" ("‾"), which, as one would expect, doesn't >> > work. >> >> Hi William, >> >> reproducable by: >> >> $echo '~'|iconv -f SHIFT-JIS -t utf-8 >> ‾ >> >> $echo -n '~'|iconv -f SHIFT-JIS -t utf-8|od -t x1 >> 0000000 e2 80 be >> >> So this seems not be a Wget issue, but maybe a general character conversion >> issue. Not sure what Wget could do... >> >> Regards, Tim > > > Shift JIS is not a single well-defined character set. > There are x-sjis-unicode, x-sjis-cp932, x-sjis-jisx0221, x-sjis-jdk117 > that all are called "shift-jis", and are subtly different. > See also https://www.w3.org/TR/japanese-xml/#sjis . > > > SJIS and CP932 (the "Microsoft version of SJIS") are almost identical, > and CP932 does contain a tilde. > > Java did (does?) treat SJIS 5c and 7e as ASCII 5c and 7e. > The docs say "This is in keeping with standard industry practice within > Japan." > > Can wget use a fallback? Use the given bytes converted from SJIS. > When that fails use these bytes converted from CP932 (if different). > When that fails use these bytes without any conversion? > > > It looks like http://seesaawiki.jp/w/kou1okada/d/wget%20-%20troubleshooting > describes the same problem. There three successful suggestions are given > (for wget 1.13.4): (i) Give one of ASCII, EUC-JP or UTF-8 with the > --remote-encoding option, (ii) Give the --no-iri option, (iii) Export LANG=C. > > Andries >