Am Samstag, 27. Dezember 2014, 13:57:21 schrieb Tim Rühsen: > Am Samstag, 27. Dezember 2014, 10:39:25 schrieb Eli Zaretskii: > > > From: Tim Rühsen <[email protected]> > > > Date: Thu, 25 Dec 2014 15:43:27 +0100 > > > > > > > FAIL: Test-idn-headers.px > > > > FAIL: Test-idn-meta.px > > > > > > > > These use EUC_JP encoded file name, but do not state > > > > --local-encoding on the wget command line, so the non-ASCII > > > > characters get mangled by Windows (because Windows tries to convert > > > > non-Unicode non-ASCII strings to the current system codepage). > > > > Test-idn-* tests that do state --local-encoding do succeed. Is it > > > > possible that the tests assume something about the local encoding, > > > > like that it's UTF-8? > > > > > > Let's start with 'Test-idn-meta'. > > > No non-ASCII filename will be written to disk, the Content-type is > > > stated > > > correctly. --local-encoding set the encoding for when reading a local > > > file > > > or the command line. So it shouldn't influence this test. And i can't > > > reproduce the stated behavior. > > > > > > Please send me the --debug output of this test with and without --local- > > > encoding given. > > > > The output is attached. I collected that by redirecting the test > > script's stderr to a file, I hope that's what you meant. > > > > I noticed that the output says: > > converted 'http://<bunch of octal escapes>/' (CP1255) -> > > 'http://<another > > > > bunch of octal escapes/' (UTF-8) > > > > So I tried to use --local-encoding=EUC-JP, and that made the test > > succeed. The third attachment below is from that successful run. > > Thanks, Eli. > > Your tests helped me to reproduce the problem: > - install (and set) a non-UTF-8 and non-C/POSIX locale > - use this locale for testing, e.g.: > TESTS_ENVIRONMENT="LC_ALL=de_DE.iso885915@euro" make check TESTS=Test-idn- > meta > > And what I see in the logs Wget has a severe problem. > When loading a saved (HTML) document, Wget parses it with the local-encoding > instead of the encoding stated by the server (or document). Of course this > can't work and this is the reason why your 3rd test works (setting the > local- encoding to the real encoding of the document). > > After the 400 server response, Wget loads the document again, now with the > correct encoding. But Wget 'remembers' some incorrect conversions from the > first try and thus fails again. > > > I would expect Wget to load the document with the correct encoding in the > first place... but it looks that this 'double loading' has been done on > purpose.
After having a deeper look into IRI/IDN design of Wget I have to correct myself. IMHO, Wget's IRI support seems to be deeply broken. I guess it needs a redesign to fix it. And that exceeds the amount of time that I have. Tim
signature.asc
Description: This is a digitally signed message part.
