On 08/06/12 18:26, [email protected] wrote: > Hi, > > I have a problem when using --convert-links (-k) on a utf-8 encoded web page. > > How to reproduce is: > > wget -k --restrict-file-names=nocontrol > http://ja.wikipedia.org/wiki/%E3%81%A4%E3%81%8B%E3%81%93%E3%81%86%E3%81%B8%E3%81%84 > (This is a Japanese wiki page.) > > The file name is utf-8. To check the utf-8 sequence. > > iconv -f utf-8 -t utf-8 [downloadedfile(replaced for non-utf-8 env)] >> /dev/null > iconv: illegal input sequence at position 77822 > (or open with gedit show the corruption.) > > If I don't have -k option, there is no broken file. This usually happens > near end of the file. Typically only one or two bytes illegal utf-8 > characters. And at near the illegal characters, some of the data is > missing. Added illegal characters are typically 0xe3, or 0xe383, but not > limited to. This problem happens depends on the input file, around 20% of > Japanese wiki pages show this problem. > > I have not yet tried wget 1.13 and I could not find any regarding > information on the web. I looked up the convert.c, but, I am not familiar > with the code. I'm not seeing that error (wget 1.13.4).
I ran: > wget > http://ja.wikipedia.org/wiki/%E3%81%A4%E3%81%8B%E3%81%93%E3%81%86%E3%81%B8%E3%81%84 > -O Without-k > wget -k > http://ja.wikipedia.org/wiki/%E3%81%A4%E3%81%8B%E3%81%93%E3%81%86%E3%81%B8%E3%81%84 > -O With-k A comparison of the changes between both files seem to be the expected ones. (I found it is converting <a href="#cite_ref-0"> to <a href="With-k#cite_ref-0">, which is unneeded, but that'd be a different bugfix) Iconv conversion doesn't show any error either: > iconv -f utf-8 -t utf-8 < With-k > /dev/null > iconv -f utf-8 -t utf-8 < Without-k-k > /dev/null
