Re: [PATCHv3 20/20] diff.c: color moved lines differently

2017-05-19 Thread Jonathan Tan
On Fri, May 19, 2017 at 11:40 AM, Stefan Beller  wrote:
>>> +static unsigned get_line_hash(struct buffered_patch_line *line, unsigned 
>>> ignore_ws)
>>> +{
>>> + static struct strbuf sb = STRBUF_INIT;
>>> +
>>> + if (ignore_ws) {
>>> + strbuf_reset();
>>> + get_ws_cleaned_string(line, );
>>
>> Memory leak here, I think.
>
> It's static, so we don't care.
> I can make it non-static and release the memory in a resend.

Ah, I missed the "static". It seems that "static" is used elsewhere
too, so these functions are not reentrant anyway, so this is fine.


Re: [PATCHv3 20/20] diff.c: color moved lines differently

2017-05-19 Thread Stefan Beller
On Fri, May 19, 2017 at 11:23 AM, Jonathan Tan  wrote:
> On Thu, 18 May 2017 12:37:46 -0700
> Stefan Beller  wrote:
>
> [snip]
>
>> Instead this provides a dynamic programming greedy algorithm that
>
> Not sure if this is called "dynamic programming".

https://loveforprogramming.quora.com/Backtracking-Memoization-Dynamic-Programming
http://stackoverflow.com/questions/3592943/difference-between-back-tracking-and-dynamic-programming

Instead of doing backtracking (finding the lengthiest hunk for
each line), we keep a set of potential hunks around, this sounds
very much like the examples given in these links.


> The first part of the commit message could probably be written more
> concisely, like the following:
...
> Having said that, thanks - this version is much more like what I would
> expect.

Thanks for giving a more concise commit message, will fix in a reroll.


>
>> +static int buffered_patch_line_cmp_no_ws(const struct buffered_patch_line 
>> *a,
>
>> +static int buffered_patch_line_cmp(const struct buffered_patch_line *a,
>
> Instead of having 2 versions of all the comparison functions, could the
> ws-ness be passed as the keydata?

No, this is misuse use of the API, peff explains:

https://public-inbox.org/git/20170513085050.plmau5ffvzn6i...@sigill.intra.peff.net/



>
>> +static unsigned get_line_hash(struct buffered_patch_line *line, unsigned 
>> ignore_ws)
>> +{
>> + static struct strbuf sb = STRBUF_INIT;
>> +
>> + if (ignore_ws) {
>> + strbuf_reset();
>> + get_ws_cleaned_string(line, );
>
> Memory leak here, I think.

It's static, so we don't care.
I can make it non-static and release the memory in a resend.

>
>> + return memhash(sb.buf, sb.len);
>> + } else {
>> + return memhash(line->line, line->len);
>> + }
>> +}
>
> [snip]
>
>> +static void add_lines_to_move_detection(struct diff_options *o)
>> +{
>> + struct moved_entry *prev_line;
>
> gcc says (rightly) that this must be initialized.

This is one of the last refactorings I did on this patch, moving
the prev_line out of the diff_options struct (which is memset in its
init), forgot to init it here. will fix.

>> + int alt_flag = 0;
>
> Probably call this "use_alt_color" or something similar.

Sounds better than alt_flag.

>> + struct moved_entry *p = pmb[i];
>> + struct moved_entry *pnext = (p && p->next_line) ?
>> + p->next_line : NULL;
>> + if (pnext &&
>> + !buffered_patch_line_cmp(pnext->line, l, o)) {
>> + pmb[i] = p->next_line;
>> + } else {
>> + pmb[i] = NULL;
>> + }
>
> Memory leak of pmb[i] somewhere here?

pmb[] holds pointers into moved)entry elements that
are obtained via  hashmap_get_next(hm, match), such that
any pmb[] element is also part of a hashmap.

When freeing the hashmap, we'll free the memory. This
array doesn't own the underlying memory.

>> @@ -4874,6 +5114,11 @@ static void diff_flush_patch_all_file_pairs(struct 
>> diff_options *o)
>>
>>   if (o->use_buffer) {
>> + if (o->color_moved) {
>
> Can you just declare the two hashmaps here, so that we do not need to
> put them in o? They don't seem to be used outside this block anyway.

Obviously. Thanks for that pointer as well.


>> diff --git a/t/t4015-diff-whitespace.sh b/t/t4015-diff-whitespace.sh
>> index 289806d0c7..232d9ad55e 100755
>> --- a/t/t4015-diff-whitespace.sh
>> +++ b/t/t4015-diff-whitespace.sh
>
> As for the tests, also add a test checking the interaction with
> whitespace highlighting, and a test showing that diff errors out if we
> ask for both move coloring and word-by-word diffing.

We do not error out, but ignore the move heuristic doesn't find any
blocks. I can make it error out, instead. (and add tests)

Thanks,
Stefan


Re: [PATCHv3 20/20] diff.c: color moved lines differently

2017-05-19 Thread Jonathan Tan
On Thu, 18 May 2017 12:37:46 -0700
Stefan Beller  wrote:

[snip]

> Instead this provides a dynamic programming greedy algorithm that

Not sure if this is called "dynamic programming".

> finds the largest moved hunk and then switches color to the
> alternative color for the next hunk. By doing this any permutation is
> recognized and displayed. That implies that there is no dedicated
> boundary or inside-hunk color, but instead we'll have just two colors
> alternating for hunks.

[snip]

I would title this "color moved blocks differently" to emphasize that we
are treating the moves in terms of blocks, not just lines.

The first part of the commit message could probably be written more
concisely, like the following:

  When a patch consists mostly of moving blocks of code around, it can
  be quite tedious to ensure that the blocks are moved verbatim, and not
  undesirably modified in the move. To that end, color blocks that are
  moved within the same patch differently. For example (OM, del, add,
  and NM are different colors):

[OM]  -void sensitive_stuff(void)
[OM]  -{
[OM]  -if (!is_authorized_user())
[OM]  -die("unauthorized");
[OM]  -sensitive_stuff(spanning,
[OM]  -multiple,
[OM]  -lines);
[OM]  -}

   void another_function()
   {
[del] -printf("foo");
[add] +printf("bar");
   }

[NM]  +void sensitive_stuff(void)
[NM]  +{
[NM]  +if (!is_authorized_user())
[NM]  +die("unauthorized");
[NM]  +sensitive_stuff(spanning,
[NM]  +multiple,
[NM]  +lines);
[NM]  +}

  Adjacent blocks are colored differently. For example, in this
  potentially malicious patch, the swapping of blocks can be spotted:

[OM]  -void sensitive_stuff(void)
[OM]  -{
[OMA] -if (!is_authorized_user())
[OMA] -die("unauthorized");
[OM]  -sensitive_stuff(spanning,
[OM]  -multiple,
[OM]  -lines);
[OMA] -}

   void another_function()
   {
[del] -printf("foo");
[add] +printf("bar");
   }

[NM]  +void sensitive_stuff(void)
[NM]  +{
[NMA] +sensitive_stuff(spanning,
[NMA] +multiple,
[NMA] +lines);
[NM]  +if (!is_authorized_user())
[NM]  +die("unauthorized");
[NMA] +}

Having said that, thanks - this version is much more like what I would
expect.

> +static int buffered_patch_line_cmp_no_ws(const struct buffered_patch_line *a,
> +  const struct buffered_patch_line *b,
> +  const void *keydata)
> +{
> + int ret;
> + struct strbuf sba = STRBUF_INIT;
> + struct strbuf sbb = STRBUF_INIT;
> +
> + get_ws_cleaned_string(a, );
> + get_ws_cleaned_string(b, );
> + ret = sba.len != sbb.len || strncmp(sba.buf, sbb.buf, sba.len);
> + strbuf_release();
> + strbuf_release();
> + return ret;
> +}
> +
> +static int buffered_patch_line_cmp(const struct buffered_patch_line *a,
> +const struct buffered_patch_line *b,
> +const void *keydata)
> +{
> + return a->len != b->len || strncmp(a->line, b->line, a->len);
> +}

Instead of having 2 versions of all the comparison functions, could the
ws-ness be passed as the keydata?

> +static unsigned get_line_hash(struct buffered_patch_line *line, unsigned 
> ignore_ws)
> +{
> + static struct strbuf sb = STRBUF_INIT;
> +
> + if (ignore_ws) {
> + strbuf_reset();
> + get_ws_cleaned_string(line, );

Memory leak here, I think.

> + return memhash(sb.buf, sb.len);
> + } else {
> + return memhash(line->line, line->len);
> + }
> +}

[snip]

> +static void add_lines_to_move_detection(struct diff_options *o)
> +{
> + struct moved_entry *prev_line;

gcc says (rightly) that this must be initialized.

> +
> + int n;
> + for (n = 0; n < o->line_buffer_nr; n++) {
> + int sign = 0;
> + struct hashmap *hm;
> + struct moved_entry *key;
> +
> + switch (o->line_buffer[n].sign) {
> + case '+':
> + sign = '+';
> + hm = o->added_lines;
> + break;
> + case '-':
> + sign = '-';
> + hm = o->deleted_lines;
> + break;
> + case ' ':
> + default:
> + prev_line = NULL;
> + continue;
> + }
> +
> + key = prepare_entry(o, n);
> + if (prev_line &&
> + 

[PATCHv3 20/20] diff.c: color moved lines differently

2017-05-18 Thread Stefan Beller
When there is a lot of code moved around such as in 11979b9 (2005-11-18,
"http.c: reorder to avoid compilation failure.") for example, the review
process is quite hard, as it is not mentally challenging.  It is a rather
tedious process, that gets boring quickly. However you still need to read
through all of the code to make sure the moved lines are there as supposed.

While it is trivial to color up a patch like the following

$ git diff
diff --git a/file2.c b/file2.c
index 9163a0f..8e66dc0 100644
--- a/file2.c
+++ b/file2.c
@@ -3,13 +3,6 @@ void *xmemdupz(const void *data, size_t len)
return memcpy(xmallocz(len), data, len);
 }

-int secure_foo(struct user *u)
-{
-   if (!u->is_allowed_foo)
-   return;
-   foo(u);
-}
-
 char *xstrndup(const char *str, size_t len)
 {
char *p = memchr(str, '\0', len);
diff --git a/test.c b/test.c
index a95e6fe..81eb0eb 100644
--- a/test.c
+++ b/test.c
@@ -18,6 +18,13 @@ ssize_t pread_in_full(int fd, void *buf, size_t count, 
off_t offset)
return total;
 }

+int secure_foo(struct user *u)
+{
+   if (!u->is_allowed_foo)
+   return;
+   foo(u);
+}
+
 int xdup(int fd)
 {
int ret = dup(fd);

as in this patch all lines that add or remove lines
should be colored in the new color that indicates moved
lines.

However the intention of this patch is to aid reviewers
to spotting permutations in the moved code. So consider the
following malicious move:

diff --git a/file2.c b/file2.c
index 9163a0f..8e66dc0 100644
--- a/file2.c
+++ b/file2.c
@@ -3,13 +3,6 @@ void *xmemdupz(const void *data, size_t len)
return memcpy(xmallocz(len), data, len);
 }

-int secure_foo(struct user *u)
-{
-   if (!u->is_allowed_foo)
-   return;
-   foo(u);
-}
-
 char *xstrndup(const char *str, size_t len)
 {
char *p = memchr(str, '\0', len);
diff --git a/test.c b/test.c
index a95e6fe..a679c40 100644
--- a/test.c
+++ b/test.c
@@ -18,6 +18,13 @@ ssize_t pread_in_full(int fd, void *buf, size_t count, 
off_t offset)
return total;
 }

+int secure_foo(struct user *u)
+{
+   foo(u);
+   if (!u->is_allowed_foo)
+   return;
+}
+
 int xdup(int fd)
 {
int ret = dup(fd);

If the moved code is larger, it is easier to hide some permutation in the
code, which is why we would not want to color all lines as "moved" in this
case. So we do not just need to color lines differently that are added and
removed in the same diff, we need to tweak the algorithm a bit more.

As the reviewers attention should be brought to the places, where the
difference is introduced to the moved code, we cannot just have one new
color for all of moved code.

First I implemented an alternative design, which would show a moved hunk
in one color, and its boundaries in another color. This idea was error
prone as it inspected each line and its neighboring lines to determine
if the line was (a) moved and (b) if was deep inside a hunk by having
matching neighboring lines. This is unreliable as the we can construct
hunks which have equal neighbors that just exceed the number of lines
inspected. (Think of 'AXYZBXYZCXYZD..' with each letter as a line, that
is permutated to AXYZCXYZBXYZD..').

Instead this provides a dynamic programming greedy algorithm that finds
the largest moved hunk and then switches color to the alternative color
for the next hunk. By doing this any permutation is recognized and
displayed. That implies that there is no dedicated boundary or
inside-hunk color, but instead we'll have just two colors alternating
for hunks.

It would be a bit more UX friendly if the two corresponding hunks
(of added and deleted lines) for one move would get the same color id.
(Both get "regular moved" or "alternative moved"). This problem is
deferred to a later patch for now.

A note on the options '--submodule=diff' and '--color-words/--word-diff':
In the conversion to use emit_line in the prior patches both submodules
as well as word diff output carefully chose to call emit_line with sign=0.
All output with sign=0 is ignored for move detection purposes in this
patch, such that no weird looking output will be generated for these
cases. This leads to another thought: We could pass on '--color-moved' to
submodules such that they color up moved lines for themselves. If we'd do
so only line moves within a repository boundary are marked up.

Algorithm-by: Jonathan Tan 
Signed-off-by: Stefan Beller 
---
 Documentation/config.txt   |  14 ++-
 diff.c | 266 +++--
 diff.h |  11 +-
 t/t4015-diff-whitespace.sh | 229