Hi, I have a problem when using --convert-links (-k) on a utf-8 encoded web page.
How to reproduce is: wget -k --restrict-file-names=nocontrol http://ja.wikipedia.org/wiki/%E3%81%A4%E3%81%8B%E3%81%93%E3%81%86%E3%81%B8%E3%81%84 (This is a Japanese wiki page.) The file name is utf-8. To check the utf-8 sequence. iconv -f utf-8 -t utf-8 [downloadedfile(replaced for non-utf-8 env)] >/dev/null iconv: illegal input sequence at position 77822 (or open with gedit show the corruption.) If I don't have -k option, there is no broken file. This usually happens near end of the file. Typically only one or two bytes illegal utf-8 characters. And at near the illegal characters, some of the data is missing. Added illegal characters are typically 0xe3, or 0xe383, but not limited to. This problem happens depends on the input file, around 20% of Japanese wiki pages show this problem. I have not yet tried wget 1.13 and I could not find any regarding information on the web. I looked up the convert.c, but, I am not familiar with the code. Data missing is critical for me. I am currently thinking downloading files without -k option and convert links by my own program. This problem didn't happen English or German Wiki pages so far. Any hint is appreciated. Thank you! --- hitoshi@hitoshi-VirtualBox[91]bash % wget --version GNU Wget 1.12 built on linux-gnu. +digest +ipv6 +nls +ntlm +opie +md5/openssl +https -gnutls +openssl -iri Wgetrc: /etc/wgetrc (system) Locale: /usr/share/locale Compile: gcc -DHAVE_CONFIG_H -DSYSTEM_WGETRC="/etc/wgetrc" -DLOCALEDIR="/usr/share/locale" -I. -I../lib -g -O2 -DNO_SSLv2 -D_FILE_OFFSET_BITS=64 -O2 -g -Wall Link: gcc -g -O2 -DNO_SSLv2 -D_FILE_OFFSET_BITS=64 -O2 -g -Wall -Wl,-Bsymbolic-functions /usr/lib/libssl.so /usr/lib/libcrypto.so -ldl -lrt ftp-opie.o openssl.o http-ntlm.o gen-md5.o ../lib/libgnu.a --- Hitoshi Yamauchi
