On Thursday, February 16, 2017 4:10:22 PM CET YX Hao wrote:
> My bad! I made a stupid mistake!
> Then, how can Tim's case pass the 'iconv' function? Maybe the
> 'from_encoding' in 'convert_fname' function is the same as the
> 'to_encoding'. Did he download from a same encoding server???
> 在2017年02月16 14时07分, "Eli Zaretskii"<e...@gnu.org>写道:
> > Date: Thu, 16 Feb 2017 12:42:23 +0800 (CST)
> > From: "YX Hao" <lifenjoi...@163.com>
> > 
> > I downloaded the 'mbox format' original, and found out the reason why you
> > can't reproduce the issue. The non-ASCII characters you use is encoded in
> > "iso-8859-1" in your email, and should be displayed correctly in your
> > environment. So, your encoding is compatible with 'UTF8', which is the
> > remote server's default encoding. That won't cause iconv error :) Think
> > about 'UFT8' incompatible encoding envrionments ...
> 
> Maybe I misunderstand, but ISO-8859-1 (a.k.a. "Latin-1") is NOT
> compatible with UTF-8.  Trying to decode Latin-1 text as UTF-8 will
> get you errors from the conversion routines, because Latin-1 byte
> sequences are generally not valid UTF-8 sequences.

You might be right... I made up a test that reproduces the issue (i guess/
hope). The patch is attached for playing around and here are the steps that I 
made, depending on my installed locales available.

$ locale
LANG=en_US.UTF-8
... (everything set to en_US.UTF-8)

$ locale -a
C
C.UTF-8
de_DE@euro
de_DE.iso885915@euro
de_DE.utf8
en_US.iso885915
en_US.utf8
POSIX
tr_TR.utf8

Convert a special character from utf-8 to iso and get it's byte sequence:
$ echo -n ü|iconv -f utf-8 -t iso-8859-15|od -t x1
0000000 fc

Now I copied tests/Test-iri.px to Test-iri-P.px amended it and added it to 
Makefile.am (don't forget to recreate Makefile with ./config.status in the main 
directory).
All I changed in the new test is
my $iso885915_path = "\xfc";
my $cmdline = $WgetTest::WGETPATH . " -d -P ${iso885915_path} --iri --trust-
server-names --restrict-file-names=nocontrol -nH -r http://localhost:
{{port}}/";

$ cd tests
$ LC_ALL=en_US.iso885915 make check TESTS=Test-iri-P

And voila, in the .log file:
Incomplete or invalid multibyte sequence encountered
Failed to convert file name 'ü/index.html' (UTF-8) -> '?' (ISO-8859-15)

My editor (kwrite) auto-detected iso-8859-15, so by copy&pasting the above 'ü' 
is whatever encoding this email might have. But in the log it is correctly 
iso-8859-15 encoded (0xFC).

The above error occurs even before the first download (I guess when building 
the local filename). That means, we can reduce the test much further...

Regards, Tim
From 5b262b1f0e006b31118706ac45d5089db08a80ba Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Tim=20R=C3=BChsen?= <tim.rueh...@gmx.de>
Date: Thu, 16 Feb 2017 12:05:06 +0100
Subject: [PATCH] Add tests/Test-iri-P.log

---
 tests/Makefile.am   |   1 +
 tests/Test-iri-P.px | 209 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 210 insertions(+)
 create mode 100755 tests/Test-iri-P.px

diff --git a/tests/Makefile.am b/tests/Makefile.am
index c27c4ce2..3e613fe4 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -85,6 +85,7 @@ PX_TESTS = \
              Test-idn-robots.px \
              Test-idn-robots-utf8.px \
              Test-iri.px \
+             Test-iri-P.px \
              Test-iri-percent.px \
              Test-iri-disabled.px \
              Test-iri-forced-remote.px \
diff --git a/tests/Test-iri-P.px b/tests/Test-iri-P.px
new file mode 100755
index 00000000..42f4e691
--- /dev/null
+++ b/tests/Test-iri-P.px
@@ -0,0 +1,209 @@
+#!/usr/bin/env perl
+
+use strict;
+use warnings;
+
+use WgetFeature qw(iri);
+use HTTPTest;
+
+# cf. http://en.wikipedia.org/wiki/Latin1
+#     http://en.wikipedia.org/wiki/ISO-8859-15
+
+###############################################################################
+#
+# mime : charset found in Content-Type HTTP MIME header
+# meta : charset found in Content-Type meta tag
+#
+# index.html                  mime + file = iso-8859-15
+# p1_français.html            meta + file = iso-8859-1, mime = utf-8
+# p2_één.html                 meta + file = utf-8, mime =iso-8859-1
+# p3_€€€.html                 meta + file = utf-8, mime = iso-8859-1
+# p4_méér.html                mime + file = utf-8
+#
+
+my $ccedilla_l15 = "\xE7";
+my $ccedilla_u8 = "\xC3\xA7";
+my $eacute_l1 = "\xE9";
+my $eacute_u8 = "\xC3\xA9";
+my $eurosign_l15 = "\xA4";
+my $eurosign_u8 = "\xE2\x82\xAC";
+
+my $pageindex = <<EOF;
+<html>
+<head>
+  <title>Main Page</title>
+</head>
+<body>
+  <p>
+    Link to page 1 <a href="http://localhost:{{port}}/p1_fran${ccedilla_l15}ais.html";>La seule page en fran&ccedil;ais</a>.
+    Link to page 3 <a href="http://localhost:{{port}}/p3_${eurosign_l15}${eurosign_l15}${eurosign_l15}.html";>My tailor is rich</a>.
+  </p>
+</body>
+</html>
+EOF
+
+# specifying a wrong charset in http-equiv - it will be overridden by Content-Type HTTP header
+my $pagefrancais = <<EOF;
+<html>
+<head>
+  <title>La seule page en français</title>
+  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
+</head>
+<body>
+  <p>
+    Link to page 2 <a href="http://localhost:{{port}}/p2_${eacute_l1}${eacute_l1}n.html";>Die enkele nerderlangstalige pagina</a>.
+  </p>
+</body>
+</html>
+EOF
+
+my $pageeen = <<EOF;
+<html>
+<head>
+  <title>Die enkele nederlandstalige pagina</title>
+  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
+</head>
+<body>
+  <p>
+    &Eacute;&eacute;n is niet veel maar toch meer dan nul.<br/>
+    Nerdelands is een mooie taal... dit zin stuckje spreekt vanzelf, of niet :)<br/>
+    <a href="http://localhost:{{port}}/p4_m${eacute_u8}${eacute_u8}r.html";>M&eacute&eacute;r</a>
+  </p>
+</body>
+</html>
+EOF
+
+my $pageeuro = <<EOF;
+<html>
+<head>
+  <title>Euro page</title>
+  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
+</head>
+<body>
+  <p>
+    My tailor isn't rich anymore.
+  </p>
+</body>
+</html>
+EOF
+
+my $pagemeer = <<EOF;
+<html>
+<head>
+  <title>Bekende supermarkt</title>
+</head>
+<body>
+  <p>
+    Ik ben toch niet gek !
+  </p>
+</body>
+</html>
+EOF
+
+my $page404 = <<EOF;
+<html>
+<head>
+  <title>404</title>
+</head>
+<body>
+  <p>
+    Nop nop nop...
+  </p>
+</body>
+</html>
+EOF
+
+# code, msg, headers, content
+my %urls = (
+    '/index.html' => {
+        code => "200",
+        msg => "Ok",
+        headers => {
+            "Content-type" => "text/html; charset=ISO-8859-15",
+        },
+        content => $pageindex,
+    },
+    '/robots.txt' => {
+        code => "200",
+        msg => "Ok",
+        headers => {
+            "Content-type" => "text/plain",
+        },
+        content => "",
+    },
+    '/p1_fran%C3%A7ais.html' => {	# UTF-8 encoded
+        code => "200",
+        msg => "Ok",
+        headers => {
+            # Content-Type header overrides http-equiv Content-Type
+            "Content-type" => "text/html; charset=ISO-8859-15",
+        },
+        content => $pagefrancais,
+    },
+    '/p2_%C3%A9%C3%A9n.html' => {	# UTF-8 encoded
+        code => "200",
+        msg => "Ok",
+        request_headers => {
+            "Referer" => qr|http://localhost:[0-9]+/p1_fran%C3%A7ais.html|,
+        },
+        headers => {
+            "Content-type" => "text/html; charset=UTF-8",
+        },
+        content => $pageeen,
+    },
+    '/p3_%E2%82%AC%E2%82%AC%E2%82%AC.html' => {	# UTF-8 encoded
+        code => "200",
+        msg => "Ok",
+        headers => {
+            "Content-type" => "text/plain; charset=ISO-8859-1",
+        },
+        content => $pageeuro,
+    },
+    '/p4_m%C3%A9%C3%A9r.html' => {
+        code => "200",
+        msg => "Ok",
+        request_headers => {
+            "Referer" => qr|http://localhost:[0-9]+/p2_%C3%A9%C3%A9n.html|,
+        },
+        headers => {
+            "Content-type" => "text/plain; charset=UTF-8",
+        },
+        content => $pagemeer,
+    },
+);
+
+my $iso885915_path = "\xfc";
+my $cmdline = $WgetTest::WGETPATH . " -d -P ${iso885915_path} --iri --trust-server-names --restrict-file-names=nocontrol -nH -r http://localhost:{{port}}/";;
+
+my $expected_error_code = 0;
+
+my %expected_downloaded_files = (
+    'index.html' => {
+        content => $pageindex,
+    },
+    'robots.txt' => {
+        content => "",
+    },
+    "p1_fran${ccedilla_u8}ais.html" => {
+        content => $pagefrancais,
+    },
+    "p2_${eacute_u8}${eacute_u8}n.html" => {
+        content => $pageeen,
+    },
+    "p3_${eurosign_u8}${eurosign_u8}${eurosign_u8}.html" => {
+        content => $pageeuro,
+    },
+    "p4_m${eacute_u8}${eacute_u8}r.html" => {
+        content => $pagemeer,
+    },
+);
+
+###############################################################################
+
+my $the_test = HTTPTest->new (input => \%urls,
+                              cmdline => $cmdline,
+                              errcode => $expected_error_code,
+                              output => \%expected_downloaded_files);
+exit $the_test->run();
+
+# vim: et ts=4 sw=4
-- 
2.11.0

Attachment: signature.asc
Description: This is a digitally signed message part.

Reply via email to