On Montag, 6. Februar 2017 19:02:16 CET William Prescott wrote: > Thanks for the responses. > > Indeed, that seems to be the case: Shift JIS replaces ASCII \ and ~ > with ¥ and ‾, respectively > (with exceptions as per Andries' message). > > In addition, RFC 3987 (Internationalized Resource Identifiers (IRIs)) > section 6.3 states that: > "In cases where the document as a whole has a > native character encoding, IRIs MUST also be encoded in this > character encoding and converted accordingly by a parser or > interpreter." > This would make it seem that the observed behavior in Wget is correct and > that the document is faulty. > > I would also like to note that, even when the the document's links don't > contain a tilde, Wget will still fail to fetch the pages as long as there > is a tilde in the URL the Wget was called with.
Hi William, you are on UTF-8 and thus copy&pasting a URL from the original document does not do the Shift JIS to UTF-8 conversion. If your editor (or text viewer) is locale/charset aware (e.g. here on KDE I use kate and can manually tell it, that the charset encoding of the viewed document is 'sjis'), set it to the right encoding and then try copy&paste. Another way would be to translate your string from ShiftJIS to UTF-8 as I did in my example, like $ wget `echo 'http://domain.jp/~withtilde'|iconv -f SHIFT-JIS -t utf-8` Or you translate your whole document to UTF-8 with that trick, like $ cat shiftjis_text.html|iconv -f SHIFT-JIS -t utf-8 >utf8_text.html Now you should be able to copy&paste URLs from that document. Ah yes, that only works on Unix/Linux/BSD systems. Regards, Tim > On Mon, Feb 6, 2017 at 6:29 PM, Andries E. Brouwer > > <andries.brou...@cwi.nl> wrote: > > On Mon, Feb 06, 2017 at 10:55:32PM +0100, Tim Rühsen wrote: > >> On Montag, 6. Februar 2017 05:02:57 CET William Prescott wrote: > >> > Hello, > >> > > >> > I'm encountering a problem when recursively downloading from a website > >> > when > >> > the URL contains a tilde and the page encoding claims to be Shift JIS. > >> > > >> > I've tried both Wget 1.17.1 (from Ubuntu 16.04) and 1.19 (from source, > >> > with Libidn2 0.16). > >> > I believe my local character encoding is UTF-8. > >> > > >> > The first page will download okay, but then most pages after it will > >> > get the tilde converted to "%E2%80%BE" ("‾"), which, as one would > >> > expect, doesn't work. > >> > >> Hi William, > >> > >> reproducable by: > >> > >> $echo '~'|iconv -f SHIFT-JIS -t utf-8 > >> ‾ > >> > >> $echo -n '~'|iconv -f SHIFT-JIS -t utf-8|od -t x1 > >> 0000000 e2 80 be > >> > >> So this seems not be a Wget issue, but maybe a general character > >> conversion > >> issue. Not sure what Wget could do... > >> > >> Regards, Tim > > > > Shift JIS is not a single well-defined character set. > > There are x-sjis-unicode, x-sjis-cp932, x-sjis-jisx0221, x-sjis-jdk117 > > that all are called "shift-jis", and are subtly different. > > See also https://www.w3.org/TR/japanese-xml/#sjis . > > > > > > SJIS and CP932 (the "Microsoft version of SJIS") are almost identical, > > and CP932 does contain a tilde. > > > > Java did (does?) treat SJIS 5c and 7e as ASCII 5c and 7e. > > The docs say "This is in keeping with standard industry practice within > > Japan." > > > > Can wget use a fallback? Use the given bytes converted from SJIS. > > When that fails use these bytes converted from CP932 (if different). > > When that fails use these bytes without any conversion? > > > > > > It looks like > > http://seesaawiki.jp/w/kou1okada/d/wget%20-%20troubleshooting > > describes the same problem. There three successful suggestions are given > > (for wget 1.13.4): (i) Give one of ASCII, EUC-JP or UTF-8 with the > > --remote-encoding option, (ii) Give the --no-iri option, (iii) Export > > LANG=C. > > > > Andries
signature.asc
Description: This is a digitally signed message part.