Bug#1010368: python3.10: python variables called _m lead to unreproducible pyc installations

2022-04-29 Thread Chris Lamb
Hey Johannes,

> I'm not familiar with the pyc format so I cannot tell what the bits that
> differ mean but maybe somebody who can, can figure this out given the
> hexdump difference from above.

As I understand it, a .pyc file consists of .pyc-specific header but
the bulk of the file is "just" a marshalled PyCode object. The hexdump
you referenced has the change within this marshalled part. When I
disassemble this part using the dis module, there is no "semantic"
difference between two different .pyc files from your loop:

  1   0 LOAD_NAME0 (_m)
  2 POP_TOP
  4 LOAD_CONST   0 (None)
  6 RETURN_VALUE


This suggests that the difference is some internal implementation
detail of the marshalled PyCode object which does not affect its
execution semantics. I could imagine that some kind of string
internalisation algorithm is resulting in nondeterministic hashmap
entry numbers... or something. Still, it might not even be an
implementation detail: it could merely be uninitialised memory that is
happily skipped over by the parser.



As it happens, I don't think you are the first to discover the
peculiarity of "_m" — take a look at this enigmatic comment:

  https://github.com/python/cpython/issues/78903#issuecomment-1093799639


Regards,

-- 
  ,''`.
 : :'  : Chris Lamb
 `. `'`  la...@debian.org  chris-lamb.co.uk
   `-



Bug#1010368: python3.10: python variables called _m lead to unreproducible pyc installations

2022-04-29 Thread Johannes Schauer Marin Rodrigues
Source: python3.10
Version: 3.10.4-3
Severity: wishlist
Tags: patch
User: reproducible-bui...@lists.alioth.debian.org
Usertags: randomness
X-Debbugs-Cc: reproducible-b...@lists.alioth.debian.org

Hi,

if a package contains python code with a variable named _m, then after
installing that package the pyc file resulting from that code is
unreproducible because of some randomness. Minimal reproducer:

export SOURCE_DATE_EPOCH="$(date +%s)"
for i in `seq 1 10`; do
mmdebstrap --quiet --variant=apt --include=python3.10 \
--customize-hook='echo _m > "$1"/tmp/decoder.py' \
--customize-hook='chroot "$1" python3.10 -m py_compile /tmp/decoder.py' \
--customize-hook='cat "$1"/tmp/__pycache__/decoder.cpython-310.pyc | 
md5sum' \
unstable /dev/null 2>&1
done | sort | uniq -c

The above will print something like:

  6 4662176a6024d5eec15033097cd7e588  -
  4 aeb00bedc784e7cca3eb42cf50e92f8d  -

If you run the loop more often, one can see that 2/3 of the times, the
pyc file will have one hash and the other 1/3 of the times the other. So
there are two distinct possible contents that the pyc file generated
from the same python script just containing "_m" can have. Below you can
find a difference between the hexdump these two possible pyc versions.

I have no idea why this happens. But why does it matter? Since #1004558
got fixed, a Priority:standard chroot is now mostly bit-by-bit
identical. Only "mostly" because there is one remaining difference:

   /usr/lib/python3.10/json/__pycache__/decoder.cpython-310.pyc

But why does that pyc file differ (randomly) while all the others remain
stable? Even if it sounds ridiculous, I tracked it down to the use of
the variable _m in /usr/lib/python3.10/json/decoder.py.

Also, the problem only shows when compiling all pyc files in a fresh
chroot. Given the same chroot with all pyc files already generated, the
pyc file generated from the minimal test case (just a python script
containing the variable name "_m" as above) will remain stable. So the
following will *not* reproduce the problem:

echo _m > test.py
for i in `seq 1 100`; do
rm -rf __pycache__
python3.10 -m py_compile test.py
md5sum __pycache__/test.cpython-310.pyc
done

It needs to be done in a fresh chroot. Since the pyc contents also rely
on the modification time of the python scripts involved, maybe the
reason for this is behaviour is some unreproducible mtimes after
unpacking the packages? This is why I'm filing it here. This might as
well be some sort of packaging problem.

For the minimal test case (a python script just containing the variable
name "_m"), the pyc file is very tiny and the diffoscope output will
display the whole file via the diff context:

@@ -1,8 +1,8 @@
 : 6f0d 0d0a 0300  5371 fe33 17b6 dd59  o...Sq.3...Y
 0010: e300         
 0020: 0001  0040  0073 0800  6500  .@...se.
-0030: 0100 6400 5300 2901 4e29 01da 025f 6da9  ..d.S.).N)..._m.
-0040: 0072 0200  7202  00fa 0f2f 746d  .rr../tm
+0030: 0100 6400 5300 2901 4e29 015a 025f 6da9  ..d.S.).N).Z._m.
+0040: 0072 0100  7201  00fa 0f2f 746d  .rr../tm
 0050: 702f 6465 636f 6465 722e 7079 da08 3c6d  p/decoder.py..s.
 0070: 00   .

I'm not familiar with the pyc format so I cannot tell what the bits that
differ mean but maybe somebody who can, can figure this out given the
hexdump difference from above.

But it's crazy that a simple choice of variable name triggers randomness
in the pyc files, right? So to further test this theory, I patched the
python3.10 source package like this:

--- a/Lib/json/decoder.py
+++ b/Lib/json/decoder.py
@@ -67,7 +67,7 @@ def _decode_u(s, pos):
 raise JSONDecodeError(msg, s, pos)

 def py_scanstring(s, end, strict=True,
-_b=BACKSLASH, _m=STRINGCHUNK.match):
+_b=BACKSLASH, m=STRINGCHUNK.match):
 """Scan the string s for a JSON string. End is the index of the
 character in s after the quote that started the JSON string.
 Unescapes all valid JSON string escape sequences and raises ValueError
@@ -80,7 +80,7 @@ def py_scanstring(s, end, strict=True,
 _append = chunks.append
 begin = end - 1
 while 1:
-chunk = _m(s, end)
+chunk = m(s, end)
 if chunk is None:
 raise JSONDecodeError("Unterminated string starting at", s, begin)
 end = chunk.end()


This solves the problem of random unreproducibility. All pyc files in a
priority:standard chroot are now reproducible even when running the
producer from the top of this mail 100 times. This is why I'm tagging
this bug with "patch". I know this is just a workaround but maybe it can
be applied until the underlying problem is identified?

With above patch, a priority:standard chroot is now finally always
bit-by-bit reproducible.  I know that I also claimed that this were the
case for the