On Thursday, February 16, 2017 4:10:22 PM CET YX Hao wrote: > My bad! I made a stupid mistake! > Then, how can Tim's case pass the 'iconv' function? Maybe the > 'from_encoding' in 'convert_fname' function is the same as the > 'to_encoding'. Did he download from a same encoding server??? > 在2017年02月16 14时07分, "Eli Zaretskii"<e...@gnu.org>写道: > > Date: Thu, 16 Feb 2017 12:42:23 +0800 (CST) > > From: "YX Hao" <lifenjoi...@163.com> > > > > I downloaded the 'mbox format' original, and found out the reason why you > > can't reproduce the issue. The non-ASCII characters you use is encoded in > > "iso-8859-1" in your email, and should be displayed correctly in your > > environment. So, your encoding is compatible with 'UTF8', which is the > > remote server's default encoding. That won't cause iconv error :) Think > > about 'UFT8' incompatible encoding envrionments ... > > Maybe I misunderstand, but ISO-8859-1 (a.k.a. "Latin-1") is NOT > compatible with UTF-8. Trying to decode Latin-1 text as UTF-8 will > get you errors from the conversion routines, because Latin-1 byte > sequences are generally not valid UTF-8 sequences.
You might be right... I made up a test that reproduces the issue (i guess/ hope). The patch is attached for playing around and here are the steps that I made, depending on my installed locales available. $ locale LANG=en_US.UTF-8 ... (everything set to en_US.UTF-8) $ locale -a C C.UTF-8 de_DE@euro de_DE.iso885915@euro de_DE.utf8 en_US.iso885915 en_US.utf8 POSIX tr_TR.utf8 Convert a special character from utf-8 to iso and get it's byte sequence: $ echo -n ü|iconv -f utf-8 -t iso-8859-15|od -t x1 0000000 fc Now I copied tests/Test-iri.px to Test-iri-P.px amended it and added it to Makefile.am (don't forget to recreate Makefile with ./config.status in the main directory). All I changed in the new test is my $iso885915_path = "\xfc"; my $cmdline = $WgetTest::WGETPATH . " -d -P ${iso885915_path} --iri --trust- server-names --restrict-file-names=nocontrol -nH -r http://localhost: {{port}}/"; $ cd tests $ LC_ALL=en_US.iso885915 make check TESTS=Test-iri-P And voila, in the .log file: Incomplete or invalid multibyte sequence encountered Failed to convert file name 'ü/index.html' (UTF-8) -> '?' (ISO-8859-15) My editor (kwrite) auto-detected iso-8859-15, so by copy&pasting the above 'ü' is whatever encoding this email might have. But in the log it is correctly iso-8859-15 encoded (0xFC). The above error occurs even before the first download (I guess when building the local filename). That means, we can reduce the test much further... Regards, Tim
From 5b262b1f0e006b31118706ac45d5089db08a80ba Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Tim=20R=C3=BChsen?= <tim.rueh...@gmx.de> Date: Thu, 16 Feb 2017 12:05:06 +0100 Subject: [PATCH] Add tests/Test-iri-P.log --- tests/Makefile.am | 1 + tests/Test-iri-P.px | 209 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 210 insertions(+) create mode 100755 tests/Test-iri-P.px diff --git a/tests/Makefile.am b/tests/Makefile.am index c27c4ce2..3e613fe4 100644 --- a/tests/Makefile.am +++ b/tests/Makefile.am @@ -85,6 +85,7 @@ PX_TESTS = \ Test-idn-robots.px \ Test-idn-robots-utf8.px \ Test-iri.px \ + Test-iri-P.px \ Test-iri-percent.px \ Test-iri-disabled.px \ Test-iri-forced-remote.px \ diff --git a/tests/Test-iri-P.px b/tests/Test-iri-P.px new file mode 100755 index 00000000..42f4e691 --- /dev/null +++ b/tests/Test-iri-P.px @@ -0,0 +1,209 @@ +#!/usr/bin/env perl + +use strict; +use warnings; + +use WgetFeature qw(iri); +use HTTPTest; + +# cf. http://en.wikipedia.org/wiki/Latin1 +# http://en.wikipedia.org/wiki/ISO-8859-15 + +############################################################################### +# +# mime : charset found in Content-Type HTTP MIME header +# meta : charset found in Content-Type meta tag +# +# index.html mime + file = iso-8859-15 +# p1_français.html meta + file = iso-8859-1, mime = utf-8 +# p2_één.html meta + file = utf-8, mime =iso-8859-1 +# p3_€€€.html meta + file = utf-8, mime = iso-8859-1 +# p4_méér.html mime + file = utf-8 +# + +my $ccedilla_l15 = "\xE7"; +my $ccedilla_u8 = "\xC3\xA7"; +my $eacute_l1 = "\xE9"; +my $eacute_u8 = "\xC3\xA9"; +my $eurosign_l15 = "\xA4"; +my $eurosign_u8 = "\xE2\x82\xAC"; + +my $pageindex = <<EOF; +<html> +<head> + <title>Main Page</title> +</head> +<body> + <p> + Link to page 1 <a href="http://localhost:{{port}}/p1_fran${ccedilla_l15}ais.html">La seule page en français</a>. + Link to page 3 <a href="http://localhost:{{port}}/p3_${eurosign_l15}${eurosign_l15}${eurosign_l15}.html">My tailor is rich</a>. + </p> +</body> +</html> +EOF + +# specifying a wrong charset in http-equiv - it will be overridden by Content-Type HTTP header +my $pagefrancais = <<EOF; +<html> +<head> + <title>La seule page en français</title> + <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/> +</head> +<body> + <p> + Link to page 2 <a href="http://localhost:{{port}}/p2_${eacute_l1}${eacute_l1}n.html">Die enkele nerderlangstalige pagina</a>. + </p> +</body> +</html> +EOF + +my $pageeen = <<EOF; +<html> +<head> + <title>Die enkele nederlandstalige pagina</title> + <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/> +</head> +<body> + <p> + Één is niet veel maar toch meer dan nul.<br/> + Nerdelands is een mooie taal... dit zin stuckje spreekt vanzelf, of niet :)<br/> + <a href="http://localhost:{{port}}/p4_m${eacute_u8}${eacute_u8}r.html">Méér</a> + </p> +</body> +</html> +EOF + +my $pageeuro = <<EOF; +<html> +<head> + <title>Euro page</title> + <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/> +</head> +<body> + <p> + My tailor isn't rich anymore. + </p> +</body> +</html> +EOF + +my $pagemeer = <<EOF; +<html> +<head> + <title>Bekende supermarkt</title> +</head> +<body> + <p> + Ik ben toch niet gek ! + </p> +</body> +</html> +EOF + +my $page404 = <<EOF; +<html> +<head> + <title>404</title> +</head> +<body> + <p> + Nop nop nop... + </p> +</body> +</html> +EOF + +# code, msg, headers, content +my %urls = ( + '/index.html' => { + code => "200", + msg => "Ok", + headers => { + "Content-type" => "text/html; charset=ISO-8859-15", + }, + content => $pageindex, + }, + '/robots.txt' => { + code => "200", + msg => "Ok", + headers => { + "Content-type" => "text/plain", + }, + content => "", + }, + '/p1_fran%C3%A7ais.html' => { # UTF-8 encoded + code => "200", + msg => "Ok", + headers => { + # Content-Type header overrides http-equiv Content-Type + "Content-type" => "text/html; charset=ISO-8859-15", + }, + content => $pagefrancais, + }, + '/p2_%C3%A9%C3%A9n.html' => { # UTF-8 encoded + code => "200", + msg => "Ok", + request_headers => { + "Referer" => qr|http://localhost:[0-9]+/p1_fran%C3%A7ais.html|, + }, + headers => { + "Content-type" => "text/html; charset=UTF-8", + }, + content => $pageeen, + }, + '/p3_%E2%82%AC%E2%82%AC%E2%82%AC.html' => { # UTF-8 encoded + code => "200", + msg => "Ok", + headers => { + "Content-type" => "text/plain; charset=ISO-8859-1", + }, + content => $pageeuro, + }, + '/p4_m%C3%A9%C3%A9r.html' => { + code => "200", + msg => "Ok", + request_headers => { + "Referer" => qr|http://localhost:[0-9]+/p2_%C3%A9%C3%A9n.html|, + }, + headers => { + "Content-type" => "text/plain; charset=UTF-8", + }, + content => $pagemeer, + }, +); + +my $iso885915_path = "\xfc"; +my $cmdline = $WgetTest::WGETPATH . " -d -P ${iso885915_path} --iri --trust-server-names --restrict-file-names=nocontrol -nH -r http://localhost:{{port}}/"; + +my $expected_error_code = 0; + +my %expected_downloaded_files = ( + 'index.html' => { + content => $pageindex, + }, + 'robots.txt' => { + content => "", + }, + "p1_fran${ccedilla_u8}ais.html" => { + content => $pagefrancais, + }, + "p2_${eacute_u8}${eacute_u8}n.html" => { + content => $pageeen, + }, + "p3_${eurosign_u8}${eurosign_u8}${eurosign_u8}.html" => { + content => $pageeuro, + }, + "p4_m${eacute_u8}${eacute_u8}r.html" => { + content => $pagemeer, + }, +); + +############################################################################### + +my $the_test = HTTPTest->new (input => \%urls, + cmdline => $cmdline, + errcode => $expected_error_code, + output => \%expected_downloaded_files); +exit $the_test->run(); + +# vim: et ts=4 sw=4 -- 2.11.0
signature.asc
Description: This is a digitally signed message part.