Thu Aug 16 06:22:19 2018: Request 126280 was acted upon.
Transaction: Correspondence added by m...@xenu.pl
       Queue: PAR-Packer
     Subject: Re: [rt.cpan.org #126280] 90-rt122949.t fails when "Use Unicode 
UTF-8 for worldwide language support" is enabled
   Broken in: (no value)
    Severity: (no value)
       Owner: Nobody
  Requestors: x...@cpan.org
      Status: open
 Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=126280 >



On Thu, 16 Aug 2018 05:11:14 -0400
"Roderich Schupp via RT" <bug-par-pac...@rt.cpan.org> wrote:

> <URL: https://rt.cpan.org/Ticket/Display.html?id=126280 >
> 
> On 2018-08-15 19:56:57, XENU wrote:
> > "\357\277\275" is a REPLACEMENT CHARACTER. It seems that when the UTF-
> > 8 checkbox is enabled, bytes that aren't valid UTF-8 are being
> > replaced with that character. "\x{85}" obviously isn't a valid UTF-8
> > character.
> 
> Nope, "\x{85}" is a valid Unicode code point (there's no such thing as a
> "UTF-8 character"), cf. http://www.unicode.org/charts/PDF/U0080.pdf

Of course U+0085 exists, but it's irrelevant because in this case we're
talking about raw bytes. And by "UTF-8 character" I meant "UTF-8 encoded
codepoint". "\xc2\x85" (or Encode::encode("UTF-8", "\x85")) would work
fine, I have tested that.

> For backgroud information, we're in a murky Windows area here: 
> when you call the C-level function (somewhere in the guts of PAR::Packer)
> 
>   spawnvp(P_WAIT, "some.exe", argv)
> 
> you have to actually manipulate the strings in argv[] so that some.exe
> actually sees the original argv in its
> 
>    main(argc, argv)
> 
> The most obvious gotcha is when some argv[i] contains blanks, e.g. 
> "foo bar quux", which will arrive at some.exe as *three* separate elements of 
> argv[],
> "foo", "bar", "quux". See Win32::ShellQuote for details, that's where I stole
> most of the test cases from.
> 
> Anyway, a 100% solution is probably not possible and "\x{85}", while legal 
> Unicode,
> isn't a very relevant test case - it's a control char ("NEXT LINE"). So there 
> may
> be a reason why Microsoft treats it differently under "Use Unicode UTF-8 for 
> worldwide language support".
> Let's replace this test case with some more relevant cases uses of strings 
> with non-ASCII chars:
> 
>   [ qq[german umlaute \x{E4}\x{F6}\x{FC}] ],
>   [ qq[chinese zhongwen \x{4E2D}\{6587}] ],
> 
> Can you rerun the failing test with these modifications under "Use 
> Unicode..."?

Both of them fail:

ok 110 - successfully ran 
"C:\Users\xenu\AppData\Local\Temp\qn5gz65wHX\packed.exe german umlaute "
not ok 111
#   Failed test at t\90-rt122949.t line 79.
#          got: '$VAR1 = [
#   "german umlaute \357\277\275\357\277\275\357\277\275"
# ];
# '
#     expected: '$VAR1 = [
#   "german umlaute \344\366\374"
# ];
# '
Wide character in print at C:/Strawberry/perl/lib/Test2/Formatter/TAP.pm line 
144.
ok 112 - successfully ran 
"C:\Users\xenu\AppData\Local\Temp\qn5gz65wHX\packed.exe chinese zhongwen ??"
not ok 113
#   Failed test at t\90-rt122949.t line 79.
#          got: '$VAR1 = [
#   "chinese zhongwen \344\270\255\346\226\207"
# ];
# '
#     expected: '$VAR1 = [
#   "chinese zhongwen \x{4e2d}\x{6587}"
# ];
# '
# Looks like you failed 2 tests of 113.

However, if I replace them with qq[german umlaute
\xc3\xa4\xc3\xb6\xc3\xbc] and qq[chinese zhongwen
\xe4\xb8\xab\xe6\x96\x87] the test passes.

> 
> Cheers, Roderich

Reply via email to