Michael Felt <aixto...@felt.demon.nl> added the comment:

Starting this discussion again. Please take time to read. I have spent hours 
trying to understand what is failing. Please spend a few minutes reading.

Sadly, there is a lot of text - but I do not know what I could leave out 
without damaging the process of discovery.

The failing result is:

    self.assertEqual(args, ascii(expected), out)
AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != 
- ['h\xc3\xa9\xe2\x82\xac']
+ ['h\udcc3\udca9\udce2\udc82\udcac']
 : ISO8859-1:['h\xc3\xa9\xe2\x82\xac']

The test code is:
  +207      @unittest.skipIf(MS_WINDOWS, 'test specific to Unix')
  +208      def test_cmd_line(self):
  +209          arg = 'h\xe9\u20ac'.encode('utf-8')
  +210          arg_utf8 = arg.decode('utf-8')
  +211          arg_ascii = arg.decode('ascii', 'surrogateescape')
  +212          code = 'import locale, sys; print("%s:%s" % 
(locale.getpreferredencoding(), ascii(sys.argv[1:])))'
  +214          def check(utf8_opt, expected, **kw):
  +215              out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
  +216              args = out.partition(':')[2].rstrip()
  +217              self.assertEqual(args, ascii(expected), out)
  +219          check('utf8', [arg_utf8])
  +220          if sys.platform == 'darwin' or support.is_android:
  +221              c_arg = arg_utf8
  +222          else:
  +223              c_arg = arg_ascii
  +224          check('utf8=0', [c_arg], LC_ALL='C')

Question 1: why is windows excluded? Because it does not use UTF-8 as it's 
default (it's default is CP1252)

Question 2: It seems that what the test is 'checking' is that 
object.encode('utf-8') gets decoded by ascii() based on the utf8_mode set.

 +215              out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw)

rewrites (less indent) as:
 +215  out = self.get_output('-X', utf8_opt, '-c', code, 
'h\xe9\u20ac'.encode('utf-8'), **kw)

out = self.get_output('-X', utf8_opt, '-c', code, b'h\xc3\xa9\xe2\x82\xac', 

Finally, in  Lib/test/support/script_helper.py we have
  +127      print("\n", cmd_line) # debug info, ignore
  +128      proc = subprocess.Popen(cmd_line, stdin=subprocess.PIPE,
  +129                           stdout=subprocess.PIPE, stderr=subprocess.PIPE,
  +130                           env=env, cwd=cwd)

Which gives:

 ['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8', 
'-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), 
ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac']

Above - utf8=1 - is successful

 ['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8=0', 
'-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), 
ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac']

Here: utf8=0 fails. The arg to the CLI is equal in both cases.

## Goiing back to check() and what does it have:
## Add some debug. The first line is the 'raw' expected,
## the second line is ascii(decoded)
## the final is the value extracted from get_output

  +214          def check(utf8_opt, expected, **kw):
  +215              out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
  +216              args = out.partition(':')[2].rstrip()
  +217              print("")
  +218              print("%s: expected\n%s:ascii(expected)\n%s:out" % 
(expected, ascii(expected), out))
  +219              self.assertEqual(args, ascii(expected), out)

For: utf8 mode true, it works:
['h▒\u20ac']: expected

  +221          check('utf8', [arg_utf8])

But not for utf8=0
  +226          check('utf8=0', [c_arg], LC_ALL='C')
 # note, different values for LC_ALL='C' have been tried
['h\udcc3\udca9\udce2\udc82\udcac']: expected

## re: expected and ascii(expected)
When utf8=1 expected and ascii(expected) differ. "arg" looks different from 
both - but after processing by get_object() expected and out match.

When utf8=0 there is no difference is "arg1" passed to "code".
However, whith check - the values for both expected and ascii(expected) are 
identical. And, sadly, the value coming back via get_output looks nothing like 

In short, when utf8=1 ascii(b'h\xc3\xa9\xe2\x82\xac') becomes ['h\xe9\u20ac' 
which is what is desired. But when utf8=0 ascii(b'h\xc3\xa9\xe2\x82\xac') is 
b'h\xc3\xa9\xe2\x82\xac' not 'h\udcc3\udca9\udce2\udc82\udcac'

Finally, when I run the command from the command line (after rewrites)

What passes:
./python '-X' 'faulthandler' '-X' 'utf8=1' '-c' 'import locale, sys; 
print("%s:%s" % (locale.getpreferredencoding(), ascii(
sys.argv[1:])))' b'h\xc3\xa9\xe2\x82\xac'

encoding is UTF-8, but the result of ascii(argv[1]) is the same as argv[1]

./python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; 
print("%s:%s" % (locale.getpreferredencoding(), ascii(
sys.argv[1:])))' b'h\xc3\xa9\xe2\x82\xac'


Here, the only difference in the output is that the "UTF-8" has been changed to 
"ISO8859-1", i.e., I was expecting a difference is the result of 
ascii('bh\\xc3\\xa9\\xe2\\x82\\xac'). Instead, I see "bytes obj in", "bytes obj 
out" -- apparently unchanged. HOWEVER, the result returned by get_output is 
always different, even it is just limited to removing the 'b' quality.

Again: test result includes:
 ISO8859-1:['h\xc3\xa9\xe2\x82\xac'] - which is not equal to manual CLI with 

So, I feel the issue is not with test, but within what happens after:

  +127      proc = subprocess.Popen(cmd_line, stdin=subprocess.PIPE,
  +128                           stdout=subprocess.PIPE, stderr=subprocess.PIPE,
  +129                           env=env, cwd=cwd)

Specifically: here.

  +130      with proc:
  +131          try:
  +132              out, err = proc.communicate()
  +133          finally:
  +134              proc.kill()
  +135              subprocess._cleanup()
  +136      rc = proc.returncode
  +137      err = strip_python_stderr(err)
  +138      return _PythonRunResult(rc, out, err), cmd_line

['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8', 
'-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), 
ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac']
 0 b"UTF-8:['h\\xe9\\u20ac']\n" b''

['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8=0', 
'-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), 
ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac']
 0 b"ISO8859-1:['h\\xc3\\xa9\\xe2\\x82\\xac']\n" b''

Seems the 'b' quality disappears somehow with:
  +216              args = out.partition(':')[2].rstrip()

So, maybe it is in test - in that line.

However, this goes well beyond my comprehension of python internal workings.

Hope this helps. Please comment.


Python tracker <rep...@bugs.python.org>
Python-bugs-list mailing list

Reply via email to