On 09/06/12 21:03, Micah Cowan wrote: > Could you attach an example of the broken file contents? ...the full > file itself is perhaps a bit large to attach in a mailing list (~85k?), > but perhaps you could use a pastebin, or otherwise throw it up on a > server, or just post a snippet that illustrates exactly what sort of > corruption is taking place in your setup. > > Good luck, > -mjc That wikipedia page hasn't been edited since April 7th, so we are all probably working with the same content.
These are the md5sums of the files I worked with: 6d887f5796a00a24e8fb284d6f78791c Without-k 341611e10271ffa117f873a56a467960 With-k Hitoshi, if the md5 of the corrupted file is 3416... then I missed the corruption. A simple wget should be 6d88... though. A fragment of the relevant bytes (eg. hexdump -C) from both the original and transformed (broken) file could be enough for finding out the cause. The latest big change to convert.c was the CSS wonder-patch of 2008, available in 1.12, so there shouldn't be any difference in the conversion with the latest one. Still, I built and tried with ftp://ftp.gnu.org/gnu/wget/wget-1.12.tar.bz2 I did found an interesting issue: Where the file converted with current wget shows: <!-- logo --> <div id="p-logo"><a style="background-image: url(http://upload.wikimedia.org/... <!-- /logo --> <!-- navigation --> <div class="portal" id='p-navigation'> <h5>... <div class="body"> <ul> <li id="n-mainpage">... <li id="n-portal">... <li id="n-currentevents">... <li id="n-newpages">... <li id="n-recentchanges">... The one converted with 1.12 shows: <!-- panel --> <div id="mw-panel" class="noprint"> <!-- logo --> <div id="p-logo"><a style="background-image: url(//upload.wikimedia.org/.... <!-- /logo --> <!-- navigation --> <div class="portal" id='p-navigation'> <h5>... <div class="body"> <ul> <li id="n-mainpage">... <li id="n-portal">... <li id="n-currentevents"><a href="/wiki/Portal:http://upload.wikimedia... <!-- /logo --> <!-- navigation --> <div class="portal" id='p-navigation'> <h5>... <div class="body"> <ul> <li id="n-mainpage"><a href="http://ja.wikipedia.org/wiki/... <li id="n-portal"><a href="http://ja.wikipedia.org/wiki/.... <li id="n-currentevents"><a href="http://ja.wikipedia.org/wiki/Portal:%E6%9C%80%E8... <li id="n-newpages"><a href="http://ja.wikipedia.org/... <li id="n-recentchanges"><a href="http://ja.wikipedia.org/... In summary, the relative protocol link is not converted inside the inline CSS (not a big bug), then the following 9 lines of the unconverted are copied, and then the rest of the converted file including those 9 lines again. On a different fetch, I get an slightly differently corrupted file along the same lines. It is likely that depending on the way the pieces happened to copy, the UTF-8 bytes got invalid. So there was indeed a bug on 1.12 link conversion, which seems to have been fixed in the meantime.
