Den sön 24 nov. 2024 kl 17:11 skrev Daniel Shahaf <d...@daniel.shahaf.name>:

> Daniel Sahlberg wrote on Sat, 23 Nov 2024 16:14 +00:00:
> > After reviewing history, I believe the above change is the "obvious fix".
> > One could argue if decode() should use some other encoding, I don't
> really
> > know this. I've committed as r1922041.
>
> I'm afraid this fix is incomplete.
>
> The key point, as you quote from the log message of r1742950, is that
> dumpfiles are binary data (because of how they represent files and
> properties).  That means they are not encoded in /any/ encoding.
>
> Formally, a dumpfile is a finite sequence of octets, which is not
> guaranteed to represent a sequence of characters, a sequence of Unicode
> codepoints, or anything of the sort.  It's the kind of file that one can
> reasonably gzip, base64, or download, but not spell check, wrap to
> 80 columns, or convert from one encoding or character set to another.
>
> In particular, the call «open('foo.dump', 'rb').read().decode(bar)» is
> incorrect for all values of BAR.
>
> Therefore, notwithstanding that calling .decode() improves the behaviour
> on the case at hand, it's not The Fix.
>

I understand this, but I'd like to point out that the code first verify if
the diffs are different using the binary content (no decode). "if
parsed_expected != parsed_actual", the code goes on to use ndiff() to
display the diff:ing lines in tests.log.


> Incidentally, decoding the binary dumpfile contents as UTF-8 in
> particular is problematic for two reasons:
>
> - Some dumpfiles are in fact not valid UTF-8.  On such dumpfiles,
>   compare_dump_files() will raise UnicodeDecodeError (from the .decode()
>   call, before ndiff() is ever entered).
>

Valid point. Avoiding an unnecessary exception is a good argument.


>   .
>   Usually, dumping a repository where any file contains invalid UTF-8
>   will result in a non-UTF-8 dump.
>
> - It's not obvious whether .decode("utf-8") is a one-to-one transformation,
>   which is to say, whether two different bytes values [sic] could possibly
>   decode to the same str value, in which case compare_dump_files() might
>   false negative.
>
> So, what /should/ the test be doing?
>
> - Compare the two lists of bytes objects as, well, two lists of bytes
>   objects?  That would be nice, if Python's standard library enables it.
>

We could use difflib.diff_bytes() but we would anyway have to convert it to
a string to include it in the log. Since the dump file mostly is a "text
file with binary data" (and the actual comparison is done earlier) I don't
see a reason to use this function.


>
> - .decode("iso-8859-1")?  This would work, sort of, insofar as it would
>   be one-to-one and wouldn't raise for any input value.  However, when
>   difflib prints a line, the octets printed would not equal the octets
>   in the file: for instance, the byte 0xD7 in the file would likely be
>   printed as the bytes 0xC3 0x97 [==
> b"\xD7".decode("iso-8859-1").encode("utf-8")].
>   While this transformation /is/ reversible, and won't slow down debugging
>   too much for anyone who knew to expect it and how to reverse it, for
>   the test harness to do this would be rather unintuitive and surprising.
>
> - I think the easiest solution is to change x.decode("utf-8") to
>   repr(x).  repr() is guaranteed to return a string
>   (https://docs.python.org/3/library/functions.html#repr), as required
>   by difflib, and shouldn't introduce either false positives or false
>   negatives.
>

This seems like a good solution. Committed as r1922064.


> - Also, the line should be changed to not exceed 80 columns.
>
> >
> > >
> > > 2. The next test was run.
>
[...]
Thanks for the review and the detailed explaination.
It was my intention to have time to go through this as well tonight but it
is getting way to late and I'll have to do this tomorrow instead.

Cheers,
Daniel

Reply via email to