[PATCH v5 4/9] patch-id: make it stable against hunk reordering

2014-04-24 Thread Michael S. Tsirkin
Patch id changes if users
1. reorder file diffs that make up a patch
or
2. split a patch up to multiple diffs that touch the same path
(keeping hunks within a single diff ordered to make patch valid).

As the result is functionally equivalent, a different patch id is
surprising to many users.
In particular, reordering files using diff -O is helpful to make patches
more readable (e.g. API header diff before implementation diff).

Add an option to change patch-id behaviour making it stable against
these two kinds of patch change:
1. calculate SHA1 hash for each hunk separately and sum all hashes
(using a symmetrical sum) to get patch id
2. hash the file-level headers together with each hunk (not just the
first hunk)

We use a 20byte sum and not xor - since xor would give 0 output
for patches that have two identical diffs, which isn't all that
unlikely (e.g. append the same line in two places).

The new behaviour is enabled
- when patchid.stable is true
- when --stable flag is present

Using a new flag --unstable or setting patchid.stable to false force
the historical behaviour.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 builtin/patch-id.c | 89 --
 1 file changed, 73 insertions(+), 16 deletions(-)

diff --git a/builtin/patch-id.c b/builtin/patch-id.c
index 3cfe02d..037cf2f 100644
--- a/builtin/patch-id.c
+++ b/builtin/patch-id.c
@@ -1,17 +1,14 @@
 #include builtin.h
 
-static void flush_current_id(int patchlen, unsigned char *id, git_SHA_CTX *c)
+static void flush_current_id(int patchlen, unsigned char *id, unsigned char 
*result)
 {
-   unsigned char result[20];
char name[50];
 
if (!patchlen)
return;
 
-   git_SHA1_Final(result, c);
memcpy(name, sha1_to_hex(id), 41);
printf(%s %s\n, sha1_to_hex(result), name);
-   git_SHA1_Init(c);
 }
 
 static int remove_space(char *line)
@@ -56,10 +53,31 @@ static int scan_hunk_header(const char *p, int *p_before, 
int *p_after)
return 1;
 }
 
-static int get_one_patchid(unsigned char *next_sha1, git_SHA_CTX *ctx, struct 
strbuf *line_buf)
+static void flush_one_hunk(unsigned char *result, git_SHA_CTX *ctx)
 {
-   int patchlen = 0, found_next = 0;
+   unsigned char hash[20];
+   unsigned short carry = 0;
+   int i;
+
+   git_SHA1_Final(hash, ctx);
+   git_SHA1_Init(ctx);
+   /* 20-byte sum, with carry */
+   for (i = 0; i  20; ++i) {
+   carry += result[i] + hash[i];
+   result[i] = carry;
+   carry = 8;
+   }
+}
+
+static int get_one_patchid(unsigned char *next_sha1, unsigned char *result,
+  struct strbuf *line_buf, int stable)
+{
+   int patchlen = 0, found_next = 0, hunks = 0;
int before = -1, after = -1;
+   git_SHA_CTX ctx, header_ctx;
+
+   git_SHA1_Init(ctx);
+   hashclr(result);
 
while (strbuf_getwholeline(line_buf, stdin, '\n') != EOF) {
char *line = line_buf-buf;
@@ -98,7 +116,19 @@ static int get_one_patchid(unsigned char *next_sha1, 
git_SHA_CTX *ctx, struct st
if (before == 0  after == 0) {
if (!memcmp(line, @@ -, 4)) {
/* Parse next hunk, but ignore line numbers.  */
+   if (stable) {
+   /* Hash the file-level headers together 
with each hunk. */
+   if (hunks) {
+   flush_one_hunk(result, ctx);
+   /* Prepend saved header ctx for 
next hunk.  */
+   memcpy(ctx, header_ctx, 
sizeof(ctx));
+   } else {
+   /* Save header ctx for next 
hunk.  */
+   memcpy(header_ctx, ctx, 
sizeof(ctx));
+   }
+   }
scan_hunk_header(line, before, after);
+   hunks++;
continue;
}
 
@@ -107,7 +137,10 @@ static int get_one_patchid(unsigned char *next_sha1, 
git_SHA_CTX *ctx, struct st
break;
 
/* Else we're parsing another header.  */
+   if (stable  hunks)
+   flush_one_hunk(result, ctx);
before = after = -1;
+   hunks = 0;
}
 
/* If we get here, we're inside a hunk.  */
@@ -119,39 +152,63 @@ static int get_one_patchid(unsigned char *next_sha1, 
git_SHA_CTX *ctx, struct st
/* Compute the sha without whitespace */
len = remove_space(line);
patchlen += len;
-   

Re: [PATCH v5 4/9] patch-id: make it stable against hunk reordering

2014-04-24 Thread Jonathan Nieder
Hi,

Michael S. Tsirkin wrote:

 Patch id changes if users
 1. reorder file diffs that make up a patch
 or
 2. split a patch up to multiple diffs that touch the same path
 (keeping hunks within a single diff ordered to make patch valid).

 As the result is functionally equivalent, a different patch id is
 surprising to many users.

Hm.

If the goal is that functionally equivalent patches are guaranteed to
produce the same patch-id, I wonder if we should be doing something
like the following:

 1. apply the patch in memory
 2. generate a new diff
 3. use that new diff to produce a patch-id

Otherwise issues like --diff-algorithm=patience versus =myers will
create trouble too.  I don't think that avoiding false negatives for
patch comparison without doing something like that is really possible.

On the other hand if someone reorders file diffs within a patch, that
is a potentially very common thing to do and something worth fixing.
In other words, while your (1) makes perfect sense to me, case (2)
seems less convincing.

The downside of allowing reordering hunks is that it can potentially
make different patches to be treated the same (for example if they
were making similar changes to different functions) when the ordering
previously caused them to be distinguished.  But that wasn't something
people could count on anyway, so I don't mind.

Should the internal patch-id computation used by commands like 'git
cherry' (see diff.c::diff_get_patch_id) get the same change?  (Not a
rhetorical question --- I don't know what the right choice would be
there.)

[...]
 The new behaviour is enabled
 - when patchid.stable is true
 - when --stable flag is present

 Using a new flag --unstable or setting patchid.stable to false force
 the historical behaviour.

Which is the default?

[...]
  builtin/patch-id.c | 89 
 --
  1 file changed, 73 insertions(+), 16 deletions(-)

Documentation?  Tests?

Thanks,
Jonathan
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 4/9] patch-id: make it stable against hunk reordering

2014-04-24 Thread Junio C Hamano
Jonathan Nieder jrnie...@gmail.com writes:

 Should the internal patch-id computation used by commands like 'git
 cherry' (see diff.c::diff_get_patch_id) get the same change?  (Not a
 rhetorical question --- I don't know what the right choice would be
 there.)

I thought about it but I did not think of a reason why.  If we do
not store the patch-id (it would be a misnomer especially after this
series, it is mor like patch signature), and we generate the patch
to be hashed internally without getting affected by any user input
given per-invocation, then nothing is externally observable even if
we used two completely different definition of patch id computation,
and I think these preconditions do hold.

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 4/9] patch-id: make it stable against hunk reordering

2014-04-24 Thread Michael S. Tsirkin
On Thu, Apr 24, 2014 at 10:30:44AM -0700, Jonathan Nieder wrote:
 Hi,
 
 Michael S. Tsirkin wrote:
 
  Patch id changes if users
  1. reorder file diffs that make up a patch
  or
  2. split a patch up to multiple diffs that touch the same path
  (keeping hunks within a single diff ordered to make patch valid).
 
  As the result is functionally equivalent, a different patch id is
  surprising to many users.
 
 Hm.
 
 If the goal is that functionally equivalent patches are guaranteed to
 produce the same patch-id, I wonder if we should be doing something
 like the following:
 
  1. apply the patch in memory
  2. generate a new diff
  3. use that new diff to produce a patch-id
 
 Otherwise issues like --diff-algorithm=patience versus =myers will
 create trouble too.  I don't think that avoiding false negatives for
 patch comparison without doing something like that is really possible.
 
 On the other hand if someone reorders file diffs within a patch, that
 is a potentially very common thing to do and something worth fixing.
 In other words, while your (1) makes perfect sense to me, case (2)
 seems less convincing.

I agree it's less convincing: one would have to edit patch
by hand (which I used to sometimes do to make important parts more prominent,
but stopped doing in favor of splitting a patch).
I'm not 100% sure whether it's worth supporting or not.


 The downside of allowing reordering hunks is that it can potentially
 make different patches to be treated the same (for example if they
 were making similar changes to different functions) when the ordering
 previously caused them to be distinguished.  But that wasn't something
 people could count on anyway, so I don't mind.

I think this example convinces me. I'll drop this support in the next version.

 Should the internal patch-id computation used by commands like 'git
 cherry' (see diff.c::diff_get_patch_id) get the same change?  (Not a
 rhetorical question --- I don't know what the right choice would be
 there.)
 
 [...]
  The new behaviour is enabled
  - when patchid.stable is true
  - when --stable flag is present
 
  Using a new flag --unstable or setting patchid.stable to false force
  the historical behaviour.
 
 Which is the default?
 
 [...]
   builtin/patch-id.c | 89 
  --
   1 file changed, 73 insertions(+), 16 deletions(-)
 
 Documentation?  Tests?
 
 Thanks,
 Jonathan
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html