Маша Глухова: > I believe the attached patch would provide the requested functionality.
Nice work! :) > From: Maria Glukhova <siamez...@gmail.com> > Date: Sat, 24 Dec 2016 12:29:57 +0200 > Subject: [PATCH] Add detection of order-only difference in plain text format. > > Detect if the text files' contents differ only in line ordering, and give an > appropriate comment. > […] > +def order_only_difference(unified_diff): > + diff_lines = unified_diff.splitlines() > + added_lines = [line[1:] for line in diff_lines if line.startswith('+')] > + removed_lines = [line[1:] for line in diff_lines if line.startswith('-')] > + # Faster check: does number of lines match? > + if len(added_lines) != len(removed_lines): > + return False > + # Counter stores line and number of its occurrences. > + return sorted(added_lines) == sorted(removed_lines) I guess it's a fine approach to the problem, but I wonder if it would not be better to use a slightly less accurate strategy that would be nicer to memory and CPU. What I have in mind is to incrementally compute a hash value that would give the same result even if the lines are in different order. Drawing from discussions on StackOverflow [1], I think doing a sum of Python's hash() would work. My test was: def unordered_hash(lines): h = 0 for line in lines: h += hash(line) return h h1 = unordered_hash(open('tests/data/text_order1').readlines()) h2 = unordered_hash(open('tests/data/text_order2').readlines()) print(h1, h2, h1 == h2) That way, it could probably be implemented directly in the difference module and work for other file types than just text files. [1]: https://stackoverflow.com/questions/30734848/order-independant-hash-algorithm -- Lunar .''`. lu...@debian.org : :Ⓐ : # apt-get install anarchism `. `'` `-
signature.asc
Description: Digital signature
_______________________________________________ Reproducible-builds mailing list Reproducible-builds@lists.alioth.debian.org http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/reproducible-builds