Re: [PATCHv3 20/20] diff.c: color moved lines differently
On Fri, May 19, 2017 at 11:40 AM, Stefan Bellerwrote: >>> +static unsigned get_line_hash(struct buffered_patch_line *line, unsigned >>> ignore_ws) >>> +{ >>> + static struct strbuf sb = STRBUF_INIT; >>> + >>> + if (ignore_ws) { >>> + strbuf_reset(); >>> + get_ws_cleaned_string(line, ); >> >> Memory leak here, I think. > > It's static, so we don't care. > I can make it non-static and release the memory in a resend. Ah, I missed the "static". It seems that "static" is used elsewhere too, so these functions are not reentrant anyway, so this is fine.
Re: [PATCHv3 20/20] diff.c: color moved lines differently
On Fri, May 19, 2017 at 11:23 AM, Jonathan Tanwrote: > On Thu, 18 May 2017 12:37:46 -0700 > Stefan Beller wrote: > > [snip] > >> Instead this provides a dynamic programming greedy algorithm that > > Not sure if this is called "dynamic programming". https://loveforprogramming.quora.com/Backtracking-Memoization-Dynamic-Programming http://stackoverflow.com/questions/3592943/difference-between-back-tracking-and-dynamic-programming Instead of doing backtracking (finding the lengthiest hunk for each line), we keep a set of potential hunks around, this sounds very much like the examples given in these links. > The first part of the commit message could probably be written more > concisely, like the following: ... > Having said that, thanks - this version is much more like what I would > expect. Thanks for giving a more concise commit message, will fix in a reroll. > >> +static int buffered_patch_line_cmp_no_ws(const struct buffered_patch_line >> *a, > >> +static int buffered_patch_line_cmp(const struct buffered_patch_line *a, > > Instead of having 2 versions of all the comparison functions, could the > ws-ness be passed as the keydata? No, this is misuse use of the API, peff explains: https://public-inbox.org/git/20170513085050.plmau5ffvzn6i...@sigill.intra.peff.net/ > >> +static unsigned get_line_hash(struct buffered_patch_line *line, unsigned >> ignore_ws) >> +{ >> + static struct strbuf sb = STRBUF_INIT; >> + >> + if (ignore_ws) { >> + strbuf_reset(); >> + get_ws_cleaned_string(line, ); > > Memory leak here, I think. It's static, so we don't care. I can make it non-static and release the memory in a resend. > >> + return memhash(sb.buf, sb.len); >> + } else { >> + return memhash(line->line, line->len); >> + } >> +} > > [snip] > >> +static void add_lines_to_move_detection(struct diff_options *o) >> +{ >> + struct moved_entry *prev_line; > > gcc says (rightly) that this must be initialized. This is one of the last refactorings I did on this patch, moving the prev_line out of the diff_options struct (which is memset in its init), forgot to init it here. will fix. >> + int alt_flag = 0; > > Probably call this "use_alt_color" or something similar. Sounds better than alt_flag. >> + struct moved_entry *p = pmb[i]; >> + struct moved_entry *pnext = (p && p->next_line) ? >> + p->next_line : NULL; >> + if (pnext && >> + !buffered_patch_line_cmp(pnext->line, l, o)) { >> + pmb[i] = p->next_line; >> + } else { >> + pmb[i] = NULL; >> + } > > Memory leak of pmb[i] somewhere here? pmb[] holds pointers into moved)entry elements that are obtained via hashmap_get_next(hm, match), such that any pmb[] element is also part of a hashmap. When freeing the hashmap, we'll free the memory. This array doesn't own the underlying memory. >> @@ -4874,6 +5114,11 @@ static void diff_flush_patch_all_file_pairs(struct >> diff_options *o) >> >> if (o->use_buffer) { >> + if (o->color_moved) { > > Can you just declare the two hashmaps here, so that we do not need to > put them in o? They don't seem to be used outside this block anyway. Obviously. Thanks for that pointer as well. >> diff --git a/t/t4015-diff-whitespace.sh b/t/t4015-diff-whitespace.sh >> index 289806d0c7..232d9ad55e 100755 >> --- a/t/t4015-diff-whitespace.sh >> +++ b/t/t4015-diff-whitespace.sh > > As for the tests, also add a test checking the interaction with > whitespace highlighting, and a test showing that diff errors out if we > ask for both move coloring and word-by-word diffing. We do not error out, but ignore the move heuristic doesn't find any blocks. I can make it error out, instead. (and add tests) Thanks, Stefan
Re: [PATCHv3 20/20] diff.c: color moved lines differently
On Thu, 18 May 2017 12:37:46 -0700 Stefan Bellerwrote: [snip] > Instead this provides a dynamic programming greedy algorithm that Not sure if this is called "dynamic programming". > finds the largest moved hunk and then switches color to the > alternative color for the next hunk. By doing this any permutation is > recognized and displayed. That implies that there is no dedicated > boundary or inside-hunk color, but instead we'll have just two colors > alternating for hunks. [snip] I would title this "color moved blocks differently" to emphasize that we are treating the moves in terms of blocks, not just lines. The first part of the commit message could probably be written more concisely, like the following: When a patch consists mostly of moving blocks of code around, it can be quite tedious to ensure that the blocks are moved verbatim, and not undesirably modified in the move. To that end, color blocks that are moved within the same patch differently. For example (OM, del, add, and NM are different colors): [OM] -void sensitive_stuff(void) [OM] -{ [OM] -if (!is_authorized_user()) [OM] -die("unauthorized"); [OM] -sensitive_stuff(spanning, [OM] -multiple, [OM] -lines); [OM] -} void another_function() { [del] -printf("foo"); [add] +printf("bar"); } [NM] +void sensitive_stuff(void) [NM] +{ [NM] +if (!is_authorized_user()) [NM] +die("unauthorized"); [NM] +sensitive_stuff(spanning, [NM] +multiple, [NM] +lines); [NM] +} Adjacent blocks are colored differently. For example, in this potentially malicious patch, the swapping of blocks can be spotted: [OM] -void sensitive_stuff(void) [OM] -{ [OMA] -if (!is_authorized_user()) [OMA] -die("unauthorized"); [OM] -sensitive_stuff(spanning, [OM] -multiple, [OM] -lines); [OMA] -} void another_function() { [del] -printf("foo"); [add] +printf("bar"); } [NM] +void sensitive_stuff(void) [NM] +{ [NMA] +sensitive_stuff(spanning, [NMA] +multiple, [NMA] +lines); [NM] +if (!is_authorized_user()) [NM] +die("unauthorized"); [NMA] +} Having said that, thanks - this version is much more like what I would expect. > +static int buffered_patch_line_cmp_no_ws(const struct buffered_patch_line *a, > + const struct buffered_patch_line *b, > + const void *keydata) > +{ > + int ret; > + struct strbuf sba = STRBUF_INIT; > + struct strbuf sbb = STRBUF_INIT; > + > + get_ws_cleaned_string(a, ); > + get_ws_cleaned_string(b, ); > + ret = sba.len != sbb.len || strncmp(sba.buf, sbb.buf, sba.len); > + strbuf_release(); > + strbuf_release(); > + return ret; > +} > + > +static int buffered_patch_line_cmp(const struct buffered_patch_line *a, > +const struct buffered_patch_line *b, > +const void *keydata) > +{ > + return a->len != b->len || strncmp(a->line, b->line, a->len); > +} Instead of having 2 versions of all the comparison functions, could the ws-ness be passed as the keydata? > +static unsigned get_line_hash(struct buffered_patch_line *line, unsigned > ignore_ws) > +{ > + static struct strbuf sb = STRBUF_INIT; > + > + if (ignore_ws) { > + strbuf_reset(); > + get_ws_cleaned_string(line, ); Memory leak here, I think. > + return memhash(sb.buf, sb.len); > + } else { > + return memhash(line->line, line->len); > + } > +} [snip] > +static void add_lines_to_move_detection(struct diff_options *o) > +{ > + struct moved_entry *prev_line; gcc says (rightly) that this must be initialized. > + > + int n; > + for (n = 0; n < o->line_buffer_nr; n++) { > + int sign = 0; > + struct hashmap *hm; > + struct moved_entry *key; > + > + switch (o->line_buffer[n].sign) { > + case '+': > + sign = '+'; > + hm = o->added_lines; > + break; > + case '-': > + sign = '-'; > + hm = o->deleted_lines; > + break; > + case ' ': > + default: > + prev_line = NULL; > + continue; > + } > + > + key = prepare_entry(o, n); > + if (prev_line && > +
[PATCHv3 20/20] diff.c: color moved lines differently
When there is a lot of code moved around such as in 11979b9 (2005-11-18, "http.c: reorder to avoid compilation failure.") for example, the review process is quite hard, as it is not mentally challenging. It is a rather tedious process, that gets boring quickly. However you still need to read through all of the code to make sure the moved lines are there as supposed. While it is trivial to color up a patch like the following $ git diff diff --git a/file2.c b/file2.c index 9163a0f..8e66dc0 100644 --- a/file2.c +++ b/file2.c @@ -3,13 +3,6 @@ void *xmemdupz(const void *data, size_t len) return memcpy(xmallocz(len), data, len); } -int secure_foo(struct user *u) -{ - if (!u->is_allowed_foo) - return; - foo(u); -} - char *xstrndup(const char *str, size_t len) { char *p = memchr(str, '\0', len); diff --git a/test.c b/test.c index a95e6fe..81eb0eb 100644 --- a/test.c +++ b/test.c @@ -18,6 +18,13 @@ ssize_t pread_in_full(int fd, void *buf, size_t count, off_t offset) return total; } +int secure_foo(struct user *u) +{ + if (!u->is_allowed_foo) + return; + foo(u); +} + int xdup(int fd) { int ret = dup(fd); as in this patch all lines that add or remove lines should be colored in the new color that indicates moved lines. However the intention of this patch is to aid reviewers to spotting permutations in the moved code. So consider the following malicious move: diff --git a/file2.c b/file2.c index 9163a0f..8e66dc0 100644 --- a/file2.c +++ b/file2.c @@ -3,13 +3,6 @@ void *xmemdupz(const void *data, size_t len) return memcpy(xmallocz(len), data, len); } -int secure_foo(struct user *u) -{ - if (!u->is_allowed_foo) - return; - foo(u); -} - char *xstrndup(const char *str, size_t len) { char *p = memchr(str, '\0', len); diff --git a/test.c b/test.c index a95e6fe..a679c40 100644 --- a/test.c +++ b/test.c @@ -18,6 +18,13 @@ ssize_t pread_in_full(int fd, void *buf, size_t count, off_t offset) return total; } +int secure_foo(struct user *u) +{ + foo(u); + if (!u->is_allowed_foo) + return; +} + int xdup(int fd) { int ret = dup(fd); If the moved code is larger, it is easier to hide some permutation in the code, which is why we would not want to color all lines as "moved" in this case. So we do not just need to color lines differently that are added and removed in the same diff, we need to tweak the algorithm a bit more. As the reviewers attention should be brought to the places, where the difference is introduced to the moved code, we cannot just have one new color for all of moved code. First I implemented an alternative design, which would show a moved hunk in one color, and its boundaries in another color. This idea was error prone as it inspected each line and its neighboring lines to determine if the line was (a) moved and (b) if was deep inside a hunk by having matching neighboring lines. This is unreliable as the we can construct hunks which have equal neighbors that just exceed the number of lines inspected. (Think of 'AXYZBXYZCXYZD..' with each letter as a line, that is permutated to AXYZCXYZBXYZD..'). Instead this provides a dynamic programming greedy algorithm that finds the largest moved hunk and then switches color to the alternative color for the next hunk. By doing this any permutation is recognized and displayed. That implies that there is no dedicated boundary or inside-hunk color, but instead we'll have just two colors alternating for hunks. It would be a bit more UX friendly if the two corresponding hunks (of added and deleted lines) for one move would get the same color id. (Both get "regular moved" or "alternative moved"). This problem is deferred to a later patch for now. A note on the options '--submodule=diff' and '--color-words/--word-diff': In the conversion to use emit_line in the prior patches both submodules as well as word diff output carefully chose to call emit_line with sign=0. All output with sign=0 is ignored for move detection purposes in this patch, such that no weird looking output will be generated for these cases. This leads to another thought: We could pass on '--color-moved' to submodules such that they color up moved lines for themselves. If we'd do so only line moves within a repository boundary are marked up. Algorithm-by: Jonathan TanSigned-off-by: Stefan Beller --- Documentation/config.txt | 14 ++- diff.c | 266 +++-- diff.h | 11 +- t/t4015-diff-whitespace.sh | 229