Bug#848049: diffoscope: Add detection of order-only differences in plain text formats

Jérémy Bobbio Sun, 25 Dec 2016 03:45:31 -0800

Маша Глухова:
> I believe the attached patch would provide the requested functionality.


Nice work! :)

> From: Maria Glukhova <siamez...@gmail.com>
> Date: Sat, 24 Dec 2016 12:29:57 +0200
> Subject: [PATCH] Add detection of order-only difference in plain text format.
> 
> Detect if the text files' contents differ only in line ordering, and give an 
> appropriate comment.
> […]
> +def order_only_difference(unified_diff):
> +    diff_lines = unified_diff.splitlines()
> +    added_lines = [line[1:] for line in diff_lines if line.startswith('+')]
> +    removed_lines = [line[1:] for line in diff_lines if line.startswith('-')]
> +    # Faster check: does number of lines match?
> +    if len(added_lines) != len(removed_lines):
> +        return False
> +    # Counter stores line and number of its occurrences.
> +    return sorted(added_lines) == sorted(removed_lines)

I guess it's a fine approach to the problem, but I wonder if it would
not be better to use a slightly less accurate strategy that would be
nicer to memory and CPU.

What I have in mind is to incrementally compute a hash value that would
give the same result even if the lines are in different order.

Drawing from discussions on StackOverflow [1], I think doing a sum of
Python's hash() would work. My test was:

    def unordered_hash(lines):
        h = 0
        for line in lines:
            h += hash(line)
        return h

    h1 = unordered_hash(open('tests/data/text_order1').readlines())
    h2 = unordered_hash(open('tests/data/text_order2').readlines())
    print(h1, h2, h1 == h2)

That way, it could probably be implemented directly in the difference
module and work for other file types than just text files.

 [1]: 
https://stackoverflow.com/questions/30734848/order-independant-hash-algorithm

-- 
Lunar                                .''`. 
lu...@debian.org                    : :Ⓐ  :  # apt-get install anarchism
                                    `. `'` 
                                      `-

signature.asc
Description: Digital signature

_______________________________________________
Reproducible-builds mailing list
Reproducible-builds@lists.alioth.debian.org
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/reproducible-builds

Bug#848049: diffoscope: Add detection of order-only differences in plain text formats

Reply via email to