On Mon, Sep 5, 2016 at 7:20 PM, Stefan Beller <sbel...@google.com> wrote:
>> If I understand, this is to ensure that we don't keep re-hashing each
>> line right?
> No, this is to ensure we have the context sensitivity of one prior line.
> In the collection phase we look at each line of the patch and make a hash of
> Then we store the hash temporarily (think of a state machine that goes line by
> line and always keeps the hash of the last line)
> What we store in the hashmaps is the hash(current line) ^
> hash(previous applicable line).
> With previous applicable line I mean any line starting with " " or "+"
> when the current
> line starts with "+" and " " or "-" when the current line starts with "-".
> When going through the second pass and actually emitting colored lines
> we only find matches in the hash map if the current line AND the previous line
> match as we lookup by hash code, i.e. if we have a moved line, but the
> previous line
> changed we do not find it in the hashmap and we don't color it, such
> that the reviewer
> can spot a permutation.
> So in the second phase we also need to have access to previous line, so maybe
> we could also go with just taking the line with us instead of 2 hash codes.
> But that implementation detail seems like a trade off to me, where I'd lean
> on keeping the hashes around as lines may be very long in bad cases, whereas
> the hashcode is short and it is a cheap hash.
> (I am referring to http://i.imgur.com/MnaSZ1D.png where in the malicious
> case all lines were moved to there as well, but permutated)
Good that clears up a lot of my questions. It seems like a reasonable
algorithm, and I like the example of per-muted lines.
I think if we can find some real-world data on how much slower this
is, we can find out what kind of configuration we want to give it? It
seems like something incredibly helpful to me, but having a way out is
nice incase it ends up being too costly. I suspect that will only be
the case for a large patch.