Re: [BUG] add_again() off-by-one error in custom format

2017-06-22 Thread Jeff King
On Thu, Jun 22, 2017 at 08:19:40PM +0200, René Scharfe wrote:

> > I'd be OK with keeping it if we could reduce the number of magic
> > numbers. E.g,. rather than 32 elsewhere use:
> > 
> >(sizeof(*loose_objects_subdir_bitmap) * CHAR_BIT)
> 
> We have a bitsizeof macro for that.
> 
> > and similarly rather than 8 here use
> > 
> >256 / sizeof(*loose_objects_subdir_bitmap) / CHAR_BIT
> 
> If we're pretending not to know the number of bits in a byte then we
> need to round up, and we have DIV_ROUND_UP for that. :)

Thanks, I meant to mention the rounding thing but forgot. I didn't know
about either of those macros, though.

> There is another example of a bitmap in shallow_c (just search for
> "32").  It would benefit from the macros mentioned above.  That
> might make it easier to switch to the native word size (unsigned int)
> instead of using uint32_t everywhere.
> 
> But perhaps this one is actually a candidate for using EWAH, depending
> on the number of refs the code is supposed to handle.

I thought at first you meant real EWAH bitmaps. But those aren't random
access (so you have to read them into an uncompressed bitmap in memory,
which is why ewah/bitmap.c exists). But if you just mean reusing those
bitmaps, then yeah, possibly.

It looks like it's using a commit slab, though. If you know the number
of bits you need up front, then the commit slab can do it without any
waste. I didn't dig enough to see if that's what it's doing or not.

-Peff


Re: [BUG] add_again() off-by-one error in custom format

2017-06-22 Thread René Scharfe

Am 18.06.2017 um 15:56 schrieb Jeff King:

On Sun, Jun 18, 2017 at 02:59:04PM +0200, René Scharfe wrote:


@@ -1586,6 +1587,9 @@ extern struct alternate_object_database {
struct strbuf scratch;
size_t base_len;
+   uint32_t loose_objects_subdir_bitmap[8];


Is it worth the complexity of having an actual bitmap and not just an
array of char? I guess it's not _that_ complex to access the bits, but
there are a lot of magic numbers involved.


That would be 224 bytes more per alternate_object_database, and we'd
gain simpler code.  Hmm.  We could add some bitmap helper macros, but
you're probably right that the first version should use the simplest
form for representing small bitmaps that we currently have.


I'd be OK with keeping it if we could reduce the number of magic
numbers. E.g,. rather than 32 elsewhere use:

   (sizeof(*loose_objects_subdir_bitmap) * CHAR_BIT)


We have a bitsizeof macro for that.


and similarly rather than 8 here use

   256 / sizeof(*loose_objects_subdir_bitmap) / CHAR_BIT


If we're pretending not to know the number of bits in a byte then we
need to round up, and we have DIV_ROUND_UP for that. :)


There's also already a bitmap data structure in ewah/bitmap.c. It's a
little bit overkill, though, because it mallocs and will grow the bitmap
as needed.


Yes, I feel that's too big a hammer for this purpose.

There is another example of a bitmap in shallow_c (just search for
"32").  It would benefit from the macros mentioned above.  That
might make it easier to switch to the native word size (unsigned int)
instead of using uint32_t everywhere.

But perhaps this one is actually a candidate for using EWAH, depending
on the number of refs the code is supposed to handle.


+static void read_loose_object_subdir(struct alternate_object_database *alt,
+int subdir_nr)


I think it's nice to pull this out into a helper function. I do wonder
if it should just be reusing for_each_loose_file_in_objdir(). You'd just
need to expose the in_obj_subdir() variant publicly.

It does do slightly more than we need (for the callbacks it actually
builds the filename), but I doubt memcpy()ing a few bytes would be
measurable.


Good point.  The function also copies the common first two hex digits
for each entry.  But all that extra work is certainly dwarfed by the
readdir calls.


Yes. You're welcome to micro-optimize that implementation if you want. ;)


My knee-jerk reaction was that this would lead to ugliness, but on
second look it might actually be doable.  Will check, but I don't
expect much of a speedup.

René


Re: [BUG] add_again() off-by-one error in custom format

2017-06-18 Thread Junio C Hamano
Jeff King  writes:

> On Sun, Jun 18, 2017 at 12:58:49PM +0200, René Scharfe wrote:
>
>> Anyway, here's a patch for stat-based invalidation, on top of the other
>> one.  Array removal is really slow (hope I didn't sneak a bug in there,
>> but my confidence in this code isn't very high).  No locking is done;
>> parallel threads removing and adding entries could make a mess, but
>> that's not an issue for log.
>> 
>> Timings for "time git log --pretty=%h >/dev/null" in my git repository
>> with 5094 loose objects on Debian:
>> 
>> master   first patch  this patch
>> real0m1.065s 0m0.581s 0m0.633s
>> user0m0.648s 0m0.564s 0m0.580s
>> sys 0m0.412s 0m0.016s 0m0.052s
>> 
>> 
>> And on mingw with 227 loose objects:
>> 
>> master   first patch  this patch
>> real0m1.756s 0m0.546s 0m1.659s
>> user0m0.000s 0m0.000s 0m0.000s
>> sys 0m0.000s 0m0.000s 0m0.000s
>> 
>> So at least for Windows it would be really nice if we could avoid
>> calling stat..
>
> Thanks for doing the timings. Given those numbers and the earlier
> discussion, I'd be inclined to skip the mtime check.

Yeah, thanks for these experiments.  With or without invalidation,
we already accept that racing with other processes will make the
result inaccurate, so I am also inclined to say that it would be
best to take the first one alone.



Re: [BUG] add_again() off-by-one error in custom format

2017-06-18 Thread Jeff King
On Sun, Jun 18, 2017 at 02:59:04PM +0200, René Scharfe wrote:

> > > @@ -1586,6 +1587,9 @@ extern struct alternate_object_database {
> > >   struct strbuf scratch;
> > >   size_t base_len;
> > > + uint32_t loose_objects_subdir_bitmap[8];
> > 
> > Is it worth the complexity of having an actual bitmap and not just an
> > array of char? I guess it's not _that_ complex to access the bits, but
> > there are a lot of magic numbers involved.
> 
> That would be 224 bytes more per alternate_object_database, and we'd
> gain simpler code.  Hmm.  We could add some bitmap helper macros, but
> you're probably right that the first version should use the simplest
> form for representing small bitmaps that we currently have.

I'd be OK with keeping it if we could reduce the number of magic
numbers. E.g,. rather than 32 elsewhere use:

  (sizeof(*loose_objects_subdir_bitmap) * CHAR_BIT)

and similarly rather than 8 here use

  256 / sizeof(*loose_objects_subdir_bitmap) / CHAR_BIT

There's also already a bitmap data structure in ewah/bitmap.c. It's a
little bit overkill, though, because it mallocs and will grow the bitmap
as needed.

> > We should probably insert a comment here warning that the cache may go
> > out of date during the process run, and should only be used when that's
> > an acceptable tradeoff.
> 
> Then we need to offer a way for opting out.  Through a global variable?
> (I'd rather reduce their number, but don't see how allow programs to
> nicely toggle this cache otherwise.)

Sorry, I didn't mean disabling it for a particular run of a program. I
just meant that we know this is an OK tradeoff for short-sha1 lookups,
but has_sha1_file(), for example, would never want to use it. So we are
warning programmers from using it in more code, not giving users a way
to turn it off run-to-run.

> > > +static void read_loose_object_subdir(struct alternate_object_database 
> > > *alt,
> > > +  int subdir_nr)
> > 
> > I think it's nice to pull this out into a helper function. I do wonder
> > if it should just be reusing for_each_loose_file_in_objdir(). You'd just
> > need to expose the in_obj_subdir() variant publicly.
> > 
> > It does do slightly more than we need (for the callbacks it actually
> > builds the filename), but I doubt memcpy()ing a few bytes would be
> > measurable.
> 
> Good point.  The function also copies the common first two hex digits
> for each entry.  But all that extra work is certainly dwarfed by the
> readdir calls.

Yes. You're welcome to micro-optimize that implementation if you want. ;)

-Peff


Re: [BUG] add_again() off-by-one error in custom format

2017-06-18 Thread René Scharfe

Am 18.06.2017 um 13:49 schrieb Jeff King:

On Sun, Jun 18, 2017 at 12:58:37PM +0200, René Scharfe wrote:


An oid_array spends ca. 30 bytes per entry (20 bytes for a hash times
a factor of 1.5 from alloc_nr).  How many loose objects do we have to
expect?  30 MB for a million of them sounds not too bad, but can there
realistically be orders of magnitude more?


Good point. We gc by default at 6000 loose objects, and lots of thing
will suffer poor performance by the time you hit a million. So a little
extra space is probably not worth worrying about.


So here's a patch for caching loose object hashes in an oid_array
without a way to invalidate or release the cache.  It uses a single
oid_array for simplicity, but reads each subdirectory individually and
remembers that fact using a bitmap.


I like the direction of this very much.


@@ -1586,6 +1587,9 @@ extern struct alternate_object_database {
struct strbuf scratch;
size_t base_len;
  
+	uint32_t loose_objects_subdir_bitmap[8];


Is it worth the complexity of having an actual bitmap and not just an
array of char? I guess it's not _that_ complex to access the bits, but
there are a lot of magic numbers involved.


That would be 224 bytes more per alternate_object_database, and we'd
gain simpler code.  Hmm.  We could add some bitmap helper macros, but
you're probably right that the first version should use the simplest
form for representing small bitmaps that we currently have.


+   struct oid_array loose_objects_cache;


We should probably insert a comment here warning that the cache may go
out of date during the process run, and should only be used when that's
an acceptable tradeoff.


Then we need to offer a way for opting out.  Through a global variable?
(I'd rather reduce their number, but don't see how allow programs to
nicely toggle this cache otherwise.)


+static void read_loose_object_subdir(struct alternate_object_database *alt,
+int subdir_nr)


I think it's nice to pull this out into a helper function. I do wonder
if it should just be reusing for_each_loose_file_in_objdir(). You'd just
need to expose the in_obj_subdir() variant publicly.

It does do slightly more than we need (for the callbacks it actually
builds the filename), but I doubt memcpy()ing a few bytes would be
measurable.


Good point.  The function also copies the common first two hex digits
for each entry.  But all that extra work is certainly dwarfed by the
readdir calls.

René


Re: [BUG] add_again() off-by-one error in custom format

2017-06-18 Thread Jeff King
On Sun, Jun 18, 2017 at 12:58:49PM +0200, René Scharfe wrote:

> Anyway, here's a patch for stat-based invalidation, on top of the other
> one.  Array removal is really slow (hope I didn't sneak a bug in there,
> but my confidence in this code isn't very high).  No locking is done;
> parallel threads removing and adding entries could make a mess, but
> that's not an issue for log.
> 
> Timings for "time git log --pretty=%h >/dev/null" in my git repository
> with 5094 loose objects on Debian:
> 
> master   first patch  this patch
> real0m1.065s 0m0.581s 0m0.633s
> user0m0.648s 0m0.564s 0m0.580s
> sys 0m0.412s 0m0.016s 0m0.052s
> 
> 
> And on mingw with 227 loose objects:
> 
> master   first patch  this patch
> real0m1.756s 0m0.546s 0m1.659s
> user0m0.000s 0m0.000s 0m0.000s
> sys 0m0.000s 0m0.000s 0m0.000s
> 
> So at least for Windows it would be really nice if we could avoid
> calling stat..

Thanks for doing the timings. Given those numbers and the earlier
discussion, I'd be inclined to skip the mtime check.

-Peff


Re: [BUG] add_again() off-by-one error in custom format

2017-06-18 Thread Jeff King
On Sun, Jun 18, 2017 at 12:58:37PM +0200, René Scharfe wrote:

> An oid_array spends ca. 30 bytes per entry (20 bytes for a hash times
> a factor of 1.5 from alloc_nr).  How many loose objects do we have to
> expect?  30 MB for a million of them sounds not too bad, but can there
> realistically be orders of magnitude more?

Good point. We gc by default at 6000 loose objects, and lots of thing
will suffer poor performance by the time you hit a million. So a little
extra space is probably not worth worrying about.

> So here's a patch for caching loose object hashes in an oid_array
> without a way to invalidate or release the cache.  It uses a single
> oid_array for simplicity, but reads each subdirectory individually and
> remembers that fact using a bitmap.

I like the direction of this very much.

> @@ -1586,6 +1587,9 @@ extern struct alternate_object_database {
>   struct strbuf scratch;
>   size_t base_len;
>  
> + uint32_t loose_objects_subdir_bitmap[8];

Is it worth the complexity of having an actual bitmap and not just an
array of char? I guess it's not _that_ complex to access the bits, but
there are a lot of magic numbers involved.

> + struct oid_array loose_objects_cache;

We should probably insert a comment here warning that the cache may go
out of date during the process run, and should only be used when that's
an acceptable tradeoff.

> +static void read_loose_object_subdir(struct alternate_object_database *alt,
> +  int subdir_nr)

I think it's nice to pull this out into a helper function. I do wonder
if it should just be reusing for_each_loose_file_in_objdir(). You'd just
need to expose the in_obj_subdir() variant publicly.

It does do slightly more than we need (for the callbacks it actually
builds the filename), but I doubt memcpy()ing a few bytes would be
measurable.

> + pos = oid_array_lookup(&alt->loose_objects_cache, &ds->bin_pfx);
> + if (pos < 0)
> + pos = -1 - pos;
> + while (!ds->ambiguous && pos < alt->loose_objects_cache.nr) {
> + const struct object_id *oid;
> + oid = alt->loose_objects_cache.oid + pos;
> + if (!match_sha(ds->len, ds->bin_pfx.hash, oid->hash))
> + break;
> + update_candidates(ds, oid);
> + pos++;
>   }

This part looks quite nice and straightforward.

-Peff


Re: [BUG] add_again() off-by-one error in custom format

2017-06-18 Thread René Scharfe
Am 15.06.2017 um 15:25 schrieb Jeff King:
> On Thu, Jun 15, 2017 at 01:33:34PM +0200, René Scharfe wrote:
>> Can we trust object directory time stamps for cache invalidation?
> 
> I think it would work on POSIX-ish systems, since loose object changes
> always involve new files, not modifications of existing ones. I don't
> know if there are systems that don't update directory mtimes even for
> newly added files.
> 
> That would still be a stat() per request. I'd be curious how the timing
> compares to the opendir/readdir that happens now.

Modification times wouldn't be exact, as you mentioned above: An object
could be added just after checking the time stamp.  So perhaps we don't
need that, or a time-based invalidation suffices, e.g. we discard the
cache after five minutes or so?

Anyway, here's a patch for stat-based invalidation, on top of the other
one.  Array removal is really slow (hope I didn't sneak a bug in there,
but my confidence in this code isn't very high).  No locking is done;
parallel threads removing and adding entries could make a mess, but
that's not an issue for log.

Timings for "time git log --pretty=%h >/dev/null" in my git repository
with 5094 loose objects on Debian:

master   first patch  this patch
real0m1.065s 0m0.581s 0m0.633s
user0m0.648s 0m0.564s 0m0.580s
sys 0m0.412s 0m0.016s 0m0.052s


And on mingw with 227 loose objects:

master   first patch  this patch
real0m1.756s 0m0.546s 0m1.659s
user0m0.000s 0m0.000s 0m0.000s
sys 0m0.000s 0m0.000s 0m0.000s

So at least for Windows it would be really nice if we could avoid
calling stat..
---
 cache.h |  1 +
 sha1_name.c | 32 
 2 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/cache.h b/cache.h
index 8c6748907b..386b09710d 100644
--- a/cache.h
+++ b/cache.h
@@ -1589,6 +1589,7 @@ extern struct alternate_object_database {
 
uint32_t loose_objects_subdir_bitmap[8];
struct oid_array loose_objects_cache;
+   time_t loose_objects_subdir_mtime[256];
 
char path[FLEX_ARRAY];
 } *alt_odb_list;
diff --git a/sha1_name.c b/sha1_name.c
index ae6a5c3abe..baecb10454 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -77,6 +77,24 @@ static void update_candidates(struct disambiguate_state *ds, 
const struct object
/* otherwise, current can be discarded and candidate is still good */
 }
 
+static void remove_subdir_entries(struct oid_array *array, int subdir_nr)
+{
+   struct object_id oid;
+   int start, end;
+
+   memset(&oid, 0, sizeof(oid));
+   oid.hash[0] = subdir_nr;
+   start = oid_array_lookup(array, &oid);
+   if (start < 0)
+   start = -1 - start;
+   end = start;
+   while (end < array->nr && array->oid[end].hash[0] == subdir_nr)
+   end++;
+   memmove(array->oid + start, array->oid + end,
+   (array->nr - end) * sizeof(array->oid[0]));
+   array->nr -= end - start;
+}
+
 static void read_loose_object_subdir(struct alternate_object_database *alt,
 int subdir_nr)
 {
@@ -86,15 +104,21 @@ static void read_loose_object_subdir(struct 
alternate_object_database *alt,
struct dirent *de;
size_t bitmap_index = subdir_nr / 32;
uint32_t bitmap_mask = 1 << (subdir_nr % 32);
-
-   if (alt->loose_objects_subdir_bitmap[bitmap_index] & bitmap_mask)
-   return;
-   alt->loose_objects_subdir_bitmap[bitmap_index] |= bitmap_mask;
+   struct stat st;
 
buf = alt_scratch_buf(alt);
strbuf_addf(buf, "%02x/", subdir_nr);
xsnprintf(hex, sizeof(hex), "%02x", subdir_nr);
 
+   stat(buf->buf, &st);
+   if (alt->loose_objects_subdir_bitmap[bitmap_index] & bitmap_mask) {
+   if (alt->loose_objects_subdir_mtime[subdir_nr] == st.st_mtime)
+   return;
+   remove_subdir_entries(&alt->loose_objects_cache, subdir_nr);
+   }
+   alt->loose_objects_subdir_mtime[subdir_nr] = st.st_mtime;
+   alt->loose_objects_subdir_bitmap[bitmap_index] |= bitmap_mask;
+
dir = opendir(buf->buf);
if (!dir)
return;
-- 
2.13.0


Re: [BUG] add_again() off-by-one error in custom format

2017-06-18 Thread René Scharfe
Am 15.06.2017 um 15:25 schrieb Jeff King:
> On Thu, Jun 15, 2017 at 01:33:34PM +0200, René Scharfe wrote:
>>> I'm not really sure how, though, short of caching the directory
>>> contents. That opens up questions of whether and when to invalidate the
>>> cache. If the cache were _just_ about short hashes, it might be OK to
>>> just assume that it remains valid through the length of the program (so
>>> worst case, a simultaneous write might mean that we generate a sha1
>>> which just became ambiguous, but that's generally going to be racy
>>> anyway).
>>>
>>> The other downside of course is that we'd spend RAM on it. We could
>>> bound the size of the cache, I suppose.
>>
>> You mean like an in-memory pack index for loose objects?  In order to
>> avoid the readdir call in sha1_name.c::find_short_object_filename()?
>> We'd only need to keep the hashes of found objects.  An oid_array
>> would be quite compact.
> 
> Yes, that's what I was thinking of.

An oid_array spends ca. 30 bytes per entry (20 bytes for a hash times
a factor of 1.5 from alloc_nr).  How many loose objects do we have to
expect?  30 MB for a million of them sounds not too bad, but can there
realistically be orders of magnitude more?

>> Non-racy writes inside a process should not be ignored (write, read
>> later) -- e.g. checkout does something like that.
> 
> Ideally, yes. Though thinking on this more, it's racy even today,
> because we rely on the in-memory packed_git list. We usually re-check the
> directory only when we look for an object and it's missing. So any new
> packs which have been added while the process runs will be missed when
> doing short-sha1 lookups (and actually, find_short_packed_object does
> not even seem to do the usual retry-on-miss).
> 
> A normal process like "checkout" isn't writing new packs, but a
> simultaneous repack could be migrating its new objects to a pack behind
> its back. (It also _can_ write packs, but only for large blobs).
> 
> Given that we are talking about finding "unique" abbreviations here, and
> that those abbreviations can become invalidated immediately anyway as
> more objects are added (even by the same process), I'm not sure we need
> to strive for absolute accuracy.

Agreed.  And processes that add objects and use them later probably use
full hashes (at least checkout does).

So here's a patch for caching loose object hashes in an oid_array
without a way to invalidate or release the cache.  It uses a single
oid_array for simplicity, but reads each subdirectory individually and
remembers that fact using a bitmap.
---
 cache.h |  4 
 sha1_name.c | 69 +++--
 2 files changed, 53 insertions(+), 20 deletions(-)

diff --git a/cache.h b/cache.h
index 4d92aae0e8..8c6748907b 100644
--- a/cache.h
+++ b/cache.h
@@ -11,6 +11,7 @@
 #include "string-list.h"
 #include "pack-revindex.h"
 #include "hash.h"
+#include "sha1-array.h"
 
 #ifndef platform_SHA_CTX
 /*
@@ -1586,6 +1587,9 @@ extern struct alternate_object_database {
struct strbuf scratch;
size_t base_len;
 
+   uint32_t loose_objects_subdir_bitmap[8];
+   struct oid_array loose_objects_cache;
+
char path[FLEX_ARRAY];
 } *alt_odb_list;
 extern void prepare_alt_odb(void);
diff --git a/sha1_name.c b/sha1_name.c
index 5126853bb5..ae6a5c3abe 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -77,10 +77,46 @@ static void update_candidates(struct disambiguate_state 
*ds, const struct object
/* otherwise, current can be discarded and candidate is still good */
 }
 
+static void read_loose_object_subdir(struct alternate_object_database *alt,
+int subdir_nr)
+{
+   struct strbuf *buf;
+   char hex[GIT_MAX_HEXSZ];
+   DIR *dir;
+   struct dirent *de;
+   size_t bitmap_index = subdir_nr / 32;
+   uint32_t bitmap_mask = 1 << (subdir_nr % 32);
+
+   if (alt->loose_objects_subdir_bitmap[bitmap_index] & bitmap_mask)
+   return;
+   alt->loose_objects_subdir_bitmap[bitmap_index] |= bitmap_mask;
+
+   buf = alt_scratch_buf(alt);
+   strbuf_addf(buf, "%02x/", subdir_nr);
+   xsnprintf(hex, sizeof(hex), "%02x", subdir_nr);
+
+   dir = opendir(buf->buf);
+   if (!dir)
+   return;
+
+   while ((de = readdir(dir)) != NULL) {
+   struct object_id oid;
+
+   if (strlen(de->d_name) != GIT_SHA1_HEXSZ - 2)
+   continue;
+   memcpy(hex + 2, de->d_name, GIT_SHA1_HEXSZ - 2);
+   if (!get_oid_hex(hex, &oid))
+   oid_array_append(&alt->loose_objects_cache, &oid);
+   }
+   closedir(dir);
+}
+
+static int match_sha(unsigned, const unsigned char *, const unsigned char *);
+
 static void find_short_object_filename(struct disambiguate_state *ds)
 {
+   int subdir_nr = ds->bin_pfx.hash[0];
struct alternate_object_database *alt;
-   char hex[GIT_MAX_HEXSZ];
   

Re: [BUG] add_again() off-by-one error in custom format

2017-06-15 Thread Junio C Hamano
René Scharfe  writes:

> Am 13.06.2017 um 23:20 schrieb Junio C Hamano:
>
>> I think the real question is how likely people use more than one
>> occurrence of the same thing in their custom format, and how deeply
>> they care that --format='%h %h' costs more than --format='%h'.  The
>> cost won't of course be double (because the main traversal costs
>> without any output), but it would be rather unreasonable to expect
>> that --format='%h %h %h %h %h' to cost the same as --format='%h';
>> after all, Git is doing more for them ;-)
>
> The answer to the first half is obviously "very likely" -- otherwise
> this bug wouldn't have been found, right? :)

Not really.  There was only one (this one) after all these years.
The question we are asking is not "very rarely this is used and we
can afford to leave it broken?"  It is "very rarely this is used
and we can afford not to optimize for that rare use case?".

> Regarding the question of how bad a 50% slowdown for a second %h
> would be: No idea.  If ran interactively it may not even be noticeable
> because the user can read the first few lines in less while the rest
> is prepared in the background.  We don't have a perf test for formats
> with duplicate short hashes, so we don't promise anything, right? :)

OK.

> -- >8 --
> Subject: [PATCH] pretty: recalculate duplicate short hashes
>
> b9c6232138 (--format=pretty: avoid calculating expensive expansions
> twice) optimized adding short hashes multiple times by using the
> fact that the output strbuf was only ever simply appended to and
> copying the added string from the previous run.  That prerequisite
> is no longer given; we now have modfiers like %< and %+ that can
> cause the cache to lose track of the correct offsets.  Remove it.
>
> Reported-by: Michael Giuffrida 
> Signed-off-by: Rene Scharfe 
> ---
> I'm sending this out in the hope that there might be a simple way
> to fix it after all, like Gábor's patch does for %+.  %< and %>
> seem to be the only other problematic modifiers for now -- I'm
> actually surprised that %w seems to be OK.

Thanks, this looks like a sensible first step.  Will queue.


Re: [BUG] add_again() off-by-one error in custom format

2017-06-15 Thread Jeff King
On Thu, Jun 15, 2017 at 01:33:34PM +0200, René Scharfe wrote:

> > The difference is that in the "before" case, we actually opened each
> > directory and ran getdents(). But after gc, the directories are gone
> > totally and open() fails. We also have to do a linear walk through the
> > objects in each directory, since the contents are sorted.
> 
> Do you mean "unsorted"?

Er yeah, sorry, I meant to say "aren't".

> > I'm not really sure how, though, short of caching the directory
> > contents. That opens up questions of whether and when to invalidate the
> > cache. If the cache were _just_ about short hashes, it might be OK to
> > just assume that it remains valid through the length of the program (so
> > worst case, a simultaneous write might mean that we generate a sha1
> > which just became ambiguous, but that's generally going to be racy
> > anyway).
> > 
> > The other downside of course is that we'd spend RAM on it. We could
> > bound the size of the cache, I suppose.
> 
> You mean like an in-memory pack index for loose objects?  In order to
> avoid the readdir call in sha1_name.c::find_short_object_filename()?
> We'd only need to keep the hashes of found objects.  An oid_array
> would be quite compact.

Yes, that's what I was thinking of.

> Non-racy writes inside a process should not be ignored (write, read
> later) -- e.g. checkout does something like that.

Ideally, yes. Though thinking on this more, it's racy even today,
because we rely on the in-memory packed_git list. We usually re-check the
directory only when we look for an object and it's missing. So any new
packs which have been added while the process runs will be missed when
doing short-sha1 lookups (and actually, find_short_packed_object does
not even seem to do the usual retry-on-miss).

A normal process like "checkout" isn't writing new packs, but a
simultaneous repack could be migrating its new objects to a pack behind
its back. (It also _can_ write packs, but only for large blobs).

Given that we are talking about finding "unique" abbreviations here, and
that those abbreviations can become invalidated immediately anyway as
more objects are added (even by the same process), I'm not sure we need
to strive for absolute accuracy.

> Can we trust object directory time stamps for cache invalidation?

I think it would work on POSIX-ish systems, since loose object changes
always involve new files, not modifications of existing ones. I don't
know if there are systems that don't update directory mtimes even for
newly added files.

That would still be a stat() per request. I'd be curious how the timing
compares to the opendir/readdir that happens now.

-Peff


Re: [BUG] add_again() off-by-one error in custom format

2017-06-15 Thread René Scharfe

Am 15.06.2017 um 07:56 schrieb Jeff King:

One interesting thing is that the cost of finding short hashes very much
depends on your loose object setup. I timed:

   git log --format=%H >/dev/null

versus

   git log --format=%h >/dev/null

on git.git. It went from about 400ms to about 800ms. But then I noticed
I had a lot of loose object directories, and ran "git gc --prune=now".
Afterwards, my timings were more like 380ms and 460ms.

The difference is that in the "before" case, we actually opened each
directory and ran getdents(). But after gc, the directories are gone
totally and open() fails. We also have to do a linear walk through the
objects in each directory, since the contents are sorted.


Do you mean "unsorted"?


So I wonder if it is worth trying to optimize the short-sha1 computation
in the first place. Double-%h aside, that would make _everything_
faster, including --oneline.


Right.


I'm not really sure how, though, short of caching the directory
contents. That opens up questions of whether and when to invalidate the
cache. If the cache were _just_ about short hashes, it might be OK to
just assume that it remains valid through the length of the program (so
worst case, a simultaneous write might mean that we generate a sha1
which just became ambiguous, but that's generally going to be racy
anyway).

The other downside of course is that we'd spend RAM on it. We could
bound the size of the cache, I suppose.


You mean like an in-memory pack index for loose objects?  In order to
avoid the readdir call in sha1_name.c::find_short_object_filename()?
We'd only need to keep the hashes of found objects.  An oid_array
would be quite compact.

Non-racy writes inside a process should not be ignored (write, read
later) -- e.g. checkout does something like that.

Can we trust object directory time stamps for cache invalidation?

René


Re: [BUG] add_again() off-by-one error in custom format

2017-06-14 Thread Jeff King
On Wed, Jun 14, 2017 at 08:24:25PM +0200, René Scharfe wrote:

> > I think the real question is how likely people use more than one
> > occurrence of the same thing in their custom format, and how deeply
> > they care that --format='%h %h' costs more than --format='%h'.  The
> > cost won't of course be double (because the main traversal costs
> > without any output), but it would be rather unreasonable to expect
> > that --format='%h %h %h %h %h' to cost the same as --format='%h';
> > after all, Git is doing more for them ;-)
> 
> The answer to the first half is obviously "very likely" -- otherwise
> this bug wouldn't have been found, right? :)
> 
> Regarding the question of how bad a 50% slowdown for a second %h
> would be: No idea.  If ran interactively it may not even be noticeable
> because the user can read the first few lines in less while the rest
> is prepared in the background.  We don't have a perf test for formats
> with duplicate short hashes, so we don't promise anything, right? :)

One interesting thing is that the cost of finding short hashes very much
depends on your loose object setup. I timed:

  git log --format=%H >/dev/null

versus

  git log --format=%h >/dev/null

on git.git. It went from about 400ms to about 800ms. But then I noticed
I had a lot of loose object directories, and ran "git gc --prune=now".
Afterwards, my timings were more like 380ms and 460ms.

The difference is that in the "before" case, we actually opened each
directory and ran getdents(). But after gc, the directories are gone
totally and open() fails. We also have to do a linear walk through the
objects in each directory, since the contents are sorted.

So I wonder if it is worth trying to optimize the short-sha1 computation
in the first place. Double-%h aside, that would make _everything_
faster, including --oneline.

I'm not really sure how, though, short of caching the directory
contents. That opens up questions of whether and when to invalidate the
cache. If the cache were _just_ about short hashes, it might be OK to
just assume that it remains valid through the length of the program (so
worst case, a simultaneous write might mean that we generate a sha1
which just became ambiguous, but that's generally going to be racy
anyway).

The other downside of course is that we'd spend RAM on it. We could
bound the size of the cache, I suppose.

-Peff


Re: [BUG] add_again() off-by-one error in custom format

2017-06-14 Thread René Scharfe
Am 13.06.2017 um 23:20 schrieb Junio C Hamano:
> René Scharfe  writes:
> 
>> The difference is about the same as the one between:
>>
>>  $ time git log --format="" >/dev/null
>>
>>  real0m0.463s
>>  user0m0.448s
>>  sys 0m0.012s
>>
>> and:
>>
>>  $ time git log --format="%h" >/dev/null
>>
>>  real0m1.062s
>>  user0m0.636s
>>  sys 0m0.416s
>>
>> With caching duplicates are basically free and without it short
>> hashes have to be looked up again.  Other placeholders may reduce
>> the relative slowdown, depending on how expensive they are.
> 
> I think the real question is how likely people use more than one
> occurrence of the same thing in their custom format, and how deeply
> they care that --format='%h %h' costs more than --format='%h'.  The
> cost won't of course be double (because the main traversal costs
> without any output), but it would be rather unreasonable to expect
> that --format='%h %h %h %h %h' to cost the same as --format='%h';
> after all, Git is doing more for them ;-)

The answer to the first half is obviously "very likely" -- otherwise
this bug wouldn't have been found, right? :)

Regarding the question of how bad a 50% slowdown for a second %h
would be: No idea.  If ran interactively it may not even be noticeable
because the user can read the first few lines in less while the rest
is prepared in the background.  We don't have a perf test for formats
with duplicate short hashes, so we don't promise anything, right? :)

René

-- >8 --
Subject: [PATCH] pretty: recalculate duplicate short hashes

b9c6232138 (--format=pretty: avoid calculating expensive expansions
twice) optimized adding short hashes multiple times by using the
fact that the output strbuf was only ever simply appended to and
copying the added string from the previous run.  That prerequisite
is no longer given; we now have modfiers like %< and %+ that can
cause the cache to lose track of the correct offsets.  Remove it.

Reported-by: Michael Giuffrida 
Signed-off-by: Rene Scharfe 
---
I'm sending this out in the hope that there might be a simple way
to fix it after all, like Gábor's patch does for %+.  %< and %>
seem to be the only other problematic modifiers for now -- I'm
actually surprised that %w seems to be OK.

 pretty.c | 32 
 strbuf.c |  7 ---
 strbuf.h |  6 --
 3 files changed, 45 deletions(-)

diff --git a/pretty.c b/pretty.c
index 09701bd2ff..cc099dfdd1 100644
--- a/pretty.c
+++ b/pretty.c
@@ -783,29 +783,9 @@ struct format_commit_context {
size_t body_off;
 
/* The following ones are relative to the result struct strbuf. */
-   struct chunk abbrev_commit_hash;
-   struct chunk abbrev_tree_hash;
-   struct chunk abbrev_parent_hashes;
size_t wrap_start;
 };
 
-static int add_again(struct strbuf *sb, struct chunk *chunk)
-{
-   if (chunk->len) {
-   strbuf_adddup(sb, chunk->off, chunk->len);
-   return 1;
-   }
-
-   /*
-* We haven't seen this chunk before.  Our caller is surely
-* going to add it the hard way now.  Remember the most likely
-* start of the to-be-added chunk: the current end of the
-* struct strbuf.
-*/
-   chunk->off = sb->len;
-   return 0;
-}
-
 static void parse_commit_header(struct format_commit_context *context)
 {
const char *msg = context->message;
@@ -1147,24 +1127,16 @@ static size_t format_commit_one(struct strbuf *sb, /* 
in UTF-8 */
return 1;
case 'h':   /* abbreviated commit hash */
strbuf_addstr(sb, diff_get_color(c->auto_color, DIFF_COMMIT));
-   if (add_again(sb, &c->abbrev_commit_hash)) {
-   strbuf_addstr(sb, diff_get_color(c->auto_color, 
DIFF_RESET));
-   return 1;
-   }
strbuf_add_unique_abbrev(sb, commit->object.oid.hash,
 c->pretty_ctx->abbrev);
strbuf_addstr(sb, diff_get_color(c->auto_color, DIFF_RESET));
-   c->abbrev_commit_hash.len = sb->len - c->abbrev_commit_hash.off;
return 1;
case 'T':   /* tree hash */
strbuf_addstr(sb, oid_to_hex(&commit->tree->object.oid));
return 1;
case 't':   /* abbreviated tree hash */
-   if (add_again(sb, &c->abbrev_tree_hash))
-   return 1;
strbuf_add_unique_abbrev(sb, commit->tree->object.oid.hash,
 c->pretty_ctx->abbrev);
-   c->abbrev_tree_hash.len = sb->len - c->abbrev_tree_hash.off;
return 1;
case 'P':   /* parent hashes */
for (p = commit->parents; p; p = p->next) {
@@ -1174,16 +1146,12 @@ static size_t format_commit_one(struct strbuf *sb, /* 
in UTF-8 */
}
 

Re: [BUG] add_again() off-by-one error in custom format

2017-06-14 Thread René Scharfe

Am 14.06.2017 um 00:24 schrieb SZEDER Gábor:

[sorry for double post, forgot the mailing list...]

To throw in a fourth option, this one adjusts the expansions' cached
offsets when the magic makes it necessary.  It's not necessary for
'%-', because it only makes a difference when the expansion is empty,
and in that case

   - add_again() doesn't consider it cached,
   - and even if it did, the offset of a zero-length string wouldn't
 matter.

  pretty.c | 32 +---
  1 file changed, 21 insertions(+), 11 deletions(-)


There are other modifiers, e.g. try the format '%h%>(20)%h' -- its
output is obviously wrong (contains control characters), even with your
patch.  We'd have to update the offsets for each placeholder that
changes the output of others.  I don't think that approach scales, alas.

René


Re: [BUG] add_again() off-by-one error in custom format

2017-06-13 Thread SZEDER Gábor
[sorry for double post, forgot the mailing list...]

To throw in a fourth option, this one adjusts the expansions' cached
offsets when the magic makes it necessary.  It's not necessary for
'%-', because it only makes a difference when the expansion is empty,
and in that case

  - add_again() doesn't consider it cached,
  - and even if it did, the offset of a zero-length string wouldn't
matter.

 pretty.c | 32 +---
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/pretty.c b/pretty.c
index d0f86f5d8..69c4f2a21 100644
--- a/pretty.c
+++ b/pretty.c
@@ -1066,9 +1066,17 @@ static size_t parse_padding_placeholder(struct strbuf 
*sb,
return 0;
 }
 
+enum format_commit_item_magic {
+   NO_MAGIC,
+   ADD_LF_BEFORE_NON_EMPTY,
+   DEL_LF_BEFORE_EMPTY,
+   ADD_SP_BEFORE_NON_EMPTY
+};
+
 static size_t format_commit_one(struct strbuf *sb, /* in UTF-8 */
const char *placeholder,
-   void *context)
+   void *context,
+   enum format_commit_item_magic magic)
 {
struct format_commit_context *c = context;
const struct commit *commit = c->commit;
@@ -1155,6 +1163,8 @@ static size_t format_commit_one(struct strbuf *sb, /* in 
UTF-8 */
 c->pretty_ctx->abbrev);
strbuf_addstr(sb, diff_get_color(c->auto_color, DIFF_RESET));
c->abbrev_commit_hash.len = sb->len - c->abbrev_commit_hash.off;
+   if (magic == ADD_LF_BEFORE_NON_EMPTY || magic == 
ADD_SP_BEFORE_NON_EMPTY)
+   c->abbrev_commit_hash.off++;
return 1;
case 'T':   /* tree hash */
strbuf_addstr(sb, oid_to_hex(&commit->tree->object.oid));
@@ -1165,6 +1175,8 @@ static size_t format_commit_one(struct strbuf *sb, /* in 
UTF-8 */
strbuf_add_unique_abbrev(sb, commit->tree->object.oid.hash,
 c->pretty_ctx->abbrev);
c->abbrev_tree_hash.len = sb->len - c->abbrev_tree_hash.off;
+   if (magic == ADD_LF_BEFORE_NON_EMPTY || magic == 
ADD_SP_BEFORE_NON_EMPTY)
+   c->abbrev_tree_hash.off++;
return 1;
case 'P':   /* parent hashes */
for (p = commit->parents; p; p = p->next) {
@@ -1184,6 +1196,8 @@ static size_t format_commit_one(struct strbuf *sb, /* in 
UTF-8 */
}
c->abbrev_parent_hashes.len = sb->len -
  c->abbrev_parent_hashes.off;
+   if (magic == ADD_LF_BEFORE_NON_EMPTY || magic == 
ADD_SP_BEFORE_NON_EMPTY)
+   c->abbrev_parent_hashes.off++;
return 1;
case 'm':   /* left/right/bottom */
strbuf_addstr(sb, get_revision_mark(NULL, commit));
@@ -1314,7 +1328,8 @@ static size_t format_commit_one(struct strbuf *sb, /* in 
UTF-8 */
 
 static size_t format_and_pad_commit(struct strbuf *sb, /* in UTF-8 */
const char *placeholder,
-   struct format_commit_context *c)
+   struct format_commit_context *c,
+   enum format_commit_item_magic magic)
 {
struct strbuf local_sb = STRBUF_INIT;
int total_consumed = 0, len, padding = c->padding;
@@ -1329,7 +1344,7 @@ static size_t format_and_pad_commit(struct strbuf *sb, /* 
in UTF-8 */
}
while (1) {
int modifier = *placeholder == 'C';
-   int consumed = format_commit_one(&local_sb, placeholder, c);
+   int consumed = format_commit_one(&local_sb, placeholder, c, 
magic);
total_consumed += consumed;
 
if (!modifier)
@@ -1420,12 +1435,7 @@ static size_t format_commit_item(struct strbuf *sb, /* 
in UTF-8 */
 {
int consumed;
size_t orig_len;
-   enum {
-   NO_MAGIC,
-   ADD_LF_BEFORE_NON_EMPTY,
-   DEL_LF_BEFORE_EMPTY,
-   ADD_SP_BEFORE_NON_EMPTY
-   } magic = NO_MAGIC;
+   enum format_commit_item_magic magic = NO_MAGIC;
 
switch (placeholder[0]) {
case '-':
@@ -1445,9 +1455,9 @@ static size_t format_commit_item(struct strbuf *sb, /* in 
UTF-8 */
 
orig_len = sb->len;
if (((struct format_commit_context *)context)->flush_type != no_flush)
-   consumed = format_and_pad_commit(sb, placeholder, context);
+   consumed = format_and_pad_commit(sb, placeholder, context, 
magic);
else
-   consumed = format_commit_one(sb, placeholder, context);
+   consumed = format_commit_one(sb, placeholder, context, magic);
if (magic == NO_MAGIC)
return consumed;
 
-- 
2.13.1.438.gc638325b1



Re: [BUG] add_again() off-by-one error in custom format

2017-06-13 Thread Junio C Hamano
René Scharfe  writes:

> The difference is about the same as the one between:
>
>   $ time git log --format="" >/dev/null
>
>   real0m0.463s
>   user0m0.448s
>   sys 0m0.012s
>
> and:
>
>   $ time git log --format="%h" >/dev/null
>
>   real0m1.062s
>   user0m0.636s
>   sys 0m0.416s
>
> With caching duplicates are basically free and without it short
> hashes have to be looked up again.  Other placeholders may reduce
> the relative slowdown, depending on how expensive they are.

I think the real question is how likely people use more than one
occurrence of the same thing in their custom format, and how deeply
they care that --format='%h %h' costs more than --format='%h'.  The
cost won't of course be double (because the main traversal costs
without any output), but it would be rather unreasonable to expect
that --format='%h %h %h %h %h' to cost the same as --format='%h';
after all, Git is doing more for them ;-)

So in that sense, I am actually OK if we decide to remove the caching.

> Forgot a third option, probably because it's not a particularly good
> idea: Replacing the caching in pretty.c with a short static cache in
> find_unique_abbrev_r().

Indeed.


Re: [BUG] add_again() off-by-one error in custom format

2017-06-13 Thread René Scharfe

Am 13.06.2017 um 20:29 schrieb Junio C Hamano:

René Scharfe  writes:


Indeed, a very nice bug report!


I think the call to format_commit_one() needs to be taught to pass
'magic' through, and then add_again() mechanism needs to be told not
to cache when magic is in effect, or something like that.

Perhaps something along this line...?

   pretty.c | 64 
++--
   1 file changed, 38 insertions(+), 26 deletions(-)


That looks quite big.  You even invent a way to classify magic.


Hmph, by "a way to classify", do you mean the enum?  That thing was
there from before, just moved.


Oh, yes, sorry.  I didn't even get that far into the patch.  (I'll
better go to bed after hitting send..)


Also I think we do not have to change add_again() at all, because
chunk->off tolerates being given a garbage value as long as
chunk->len stays 0, and add_again() does not change chunk->len at
all.

Which cuts the diffstat down to

  pretty.c | 40 +---
  1 file changed, 25 insertions(+), 15 deletions(-)


Does the caching feature justify the added complexity?


That's a good question.  I thought about your second alternative
while adding the "don't cache" thing, but if we can measure and find
out that we are not gaining much from caching, certainly that sounds
much simpler.


The difference is about the same as the one between:

$ time git log --format="" >/dev/null

real0m0.463s
user0m0.448s
sys 0m0.012s

and:

$ time git log --format="%h" >/dev/null

real0m1.062s
user0m0.636s
sys 0m0.416s

With caching duplicates are basically free and without it short
hashes have to be looked up again.  Other placeholders may reduce
the relative slowdown, depending on how expensive they are.

Forgot a third option, probably because it's not a particularly good
idea: Replacing the caching in pretty.c with a short static cache in
find_unique_abbrev_r().

René


Re: [BUG] add_again() off-by-one error in custom format

2017-06-13 Thread Junio C Hamano
René Scharfe  writes:

> Indeed, a very nice bug report!
>
>> I think the call to format_commit_one() needs to be taught to pass
>> 'magic' through, and then add_again() mechanism needs to be told not
>> to cache when magic is in effect, or something like that.
>>
>> Perhaps something along this line...?
>>
>>   pretty.c | 64 
>> ++--
>>   1 file changed, 38 insertions(+), 26 deletions(-)
>
> That looks quite big.  You even invent a way to classify magic. 

Hmph, by "a way to classify", do you mean the enum?  That thing was
there from before, just moved.

Also I think we do not have to change add_again() at all, because
chunk->off tolerates being given a garbage value as long as
chunk->len stays 0, and add_again() does not change chunk->len at
all.

Which cuts the diffstat down to

 pretty.c | 40 +---
 1 file changed, 25 insertions(+), 15 deletions(-)

> Does the caching feature justify the added complexity?

That's a good question.  I thought about your second alternative
while adding the "don't cache" thing, but if we can measure and find
out that we are not gaining much from caching, certainly that sounds
much simpler.

>
> - Don't cache anymore, now that we have placeholders that change output
>   on top of the original append-only ones.  Yields a net code reduction.
>   Duplicated %h, %t and %p will have to be resolved at each occurrence.
>
> - Provide some space for the cache instead of reusing the output, like
>   a strbuf for each placeholder.  Adds a memory allocation to each
>   first occurrence of %h, %t and %p.  Such a cache could be reused for
>   multiple commits by truncating it instead of releasing; not sure how
>   to pass it in nicely for it to survive a sequence of calls, though.
>
> René


Re: [BUG] add_again() off-by-one error in custom format

2017-06-13 Thread René Scharfe

Am 13.06.2017 um 00:49 schrieb Junio C Hamano:

Michael Giuffrida  writes:


For the abbreviated commit hash placeholder ('h'), pretty.c uses
add_again() to cache the result of the expansion, and then uses that
result in future expansions. This causes problems when the expansion
includes whitespace:

 $ git log -n 1 --pretty='format:newline:%+h%nno_newline:%h'
 newline:
 a0b1c2d
 no_newline:
 a0b1c2

The second expansion of the hash added an unwanted newline and removed
the final character. It seemingly used the cached expansion *starting
from the inserted newline*.

The expected output is:

 $ git log -n 1 --pretty='format:newline:%+h%nno_newline:%h'
 newline:
 a0b1c2d
 no_newline:a0b1c2d


Nicely explained.  The add_again() mechanism caches an earlier
result by remembering the offset and the length in the strbuf the
formatted string is being accumulated to, but because %+x (and
probably %-x) magic placeholders futz with the result of
format_commit_one() in place, the cache goes out of sync, of course.


Indeed, a very nice bug report!


I think the call to format_commit_one() needs to be taught to pass
'magic' through, and then add_again() mechanism needs to be told not
to cache when magic is in effect, or something like that.

Perhaps something along this line...?

  pretty.c | 64 ++--
  1 file changed, 38 insertions(+), 26 deletions(-)


That looks quite big.  You even invent a way to classify magic. Does the
caching feature justify the added complexity?  Alternatives:

- Don't cache anymore, now that we have placeholders that change output
  on top of the original append-only ones.  Yields a net code reduction.
  Duplicated %h, %t and %p will have to be resolved at each occurrence.

- Provide some space for the cache instead of reusing the output, like
  a strbuf for each placeholder.  Adds a memory allocation to each
  first occurrence of %h, %t and %p.  Such a cache could be reused for
  multiple commits by truncating it instead of releasing; not sure how
  to pass it in nicely for it to survive a sequence of calls, though.

René


Re: [BUG] add_again() off-by-one error in custom format

2017-06-12 Thread Junio C Hamano
Michael Giuffrida  writes:

> For the abbreviated commit hash placeholder ('h'), pretty.c uses
> add_again() to cache the result of the expansion, and then uses that
> result in future expansions. This causes problems when the expansion
> includes whitespace:
>
> $ git log -n 1 --pretty='format:newline:%+h%nno_newline:%h'
> newline:
> a0b1c2d
> no_newline:
> a0b1c2
>
> The second expansion of the hash added an unwanted newline and removed
> the final character. It seemingly used the cached expansion *starting
> from the inserted newline*.
>
> The expected output is:
>
> $ git log -n 1 --pretty='format:newline:%+h%nno_newline:%h'
> newline:
> a0b1c2d
> no_newline:a0b1c2d

Nicely explained.  The add_again() mechanism caches an earlier
result by remembering the offset and the length in the strbuf the
formatted string is being accumulated to, but because %+x (and
probably %-x) magic placeholders futz with the result of
format_commit_one() in place, the cache goes out of sync, of course.

I think the call to format_commit_one() needs to be taught to pass
'magic' through, and then add_again() mechanism needs to be told not
to cache when magic is in effect, or something like that.

Perhaps something along this line...?

 pretty.c | 64 ++--
 1 file changed, 38 insertions(+), 26 deletions(-)

diff --git a/pretty.c b/pretty.c
index 09701bd2ff..be22b12986 100644
--- a/pretty.c
+++ b/pretty.c
@@ -789,20 +789,22 @@ struct format_commit_context {
size_t wrap_start;
 };
 
-static int add_again(struct strbuf *sb, struct chunk *chunk)
+static int add_again(struct strbuf *sb, struct chunk *chunk, int no_cache)
 {
if (chunk->len) {
strbuf_adddup(sb, chunk->off, chunk->len);
return 1;
}
 
-   /*
-* We haven't seen this chunk before.  Our caller is surely
-* going to add it the hard way now.  Remember the most likely
-* start of the to-be-added chunk: the current end of the
-* struct strbuf.
-*/
-   chunk->off = sb->len;
+   if (!no_cache) {
+   /*
+* We haven't seen this chunk before.  Our caller is surely
+* going to add it the hard way now.  Remember the most likely
+* start of the to-be-added chunk: the current end of the
+* struct strbuf.
+*/
+   chunk->off = sb->len;
+   }
return 0;
 }
 
@@ -1066,15 +1068,24 @@ static size_t parse_padding_placeholder(struct strbuf 
*sb,
return 0;
 }
 
+enum format_commit_item_magic {
+   NO_MAGIC,
+   ADD_LF_BEFORE_NON_EMPTY,
+   DEL_LF_BEFORE_EMPTY,
+   ADD_SP_BEFORE_NON_EMPTY
+};
+
 static size_t format_commit_one(struct strbuf *sb, /* in UTF-8 */
const char *placeholder,
-   void *context)
+   void *context,
+   enum format_commit_item_magic magic)
 {
struct format_commit_context *c = context;
const struct commit *commit = c->commit;
const char *msg = c->message;
struct commit_list *p;
int ch;
+   int no_cache;
 
/* these are independent of the commit */
switch (placeholder[0]) {
@@ -1139,6 +1150,8 @@ static size_t format_commit_one(struct strbuf *sb, /* in 
UTF-8 */
if (!commit->object.parsed)
parse_object(&commit->object.oid);
 
+   no_cache = magic != NO_MAGIC;
+
switch (placeholder[0]) {
case 'H':   /* commit hash */
strbuf_addstr(sb, diff_get_color(c->auto_color, DIFF_COMMIT));
@@ -1147,24 +1160,26 @@ static size_t format_commit_one(struct strbuf *sb, /* 
in UTF-8 */
return 1;
case 'h':   /* abbreviated commit hash */
strbuf_addstr(sb, diff_get_color(c->auto_color, DIFF_COMMIT));
-   if (add_again(sb, &c->abbrev_commit_hash)) {
+   if (add_again(sb, &c->abbrev_commit_hash, no_cache)) {
strbuf_addstr(sb, diff_get_color(c->auto_color, 
DIFF_RESET));
return 1;
}
strbuf_add_unique_abbrev(sb, commit->object.oid.hash,
 c->pretty_ctx->abbrev);
strbuf_addstr(sb, diff_get_color(c->auto_color, DIFF_RESET));
-   c->abbrev_commit_hash.len = sb->len - c->abbrev_commit_hash.off;
+   if (!no_cache)
+   c->abbrev_commit_hash.len = sb->len - 
c->abbrev_commit_hash.off;
return 1;
case 'T':   /* tree hash */
strbuf_addstr(sb, oid_to_hex(&commit->tree->object.oid));
return 1;
case 't':   /* abbreviated tree hash */
-   if (add_again(sb, &c->abbrev_tree_hash))
+ 

[BUG] add_again() off-by-one error in custom format

2017-06-11 Thread Michael Giuffrida
In a custom pretty format, using the '+' or ' ' combinators to prefix
a non-empty expansion with whitespace will erroneously truncate
subsequent expansions of the same type.

Normally '%+X' inserts a newline before , IFF the expansion of X is
non-empty:

$ git log -n 1 --pretty='format:newline:%+an'
newline:
MyName

For the abbreviated commit hash placeholder ('h'), pretty.c uses
add_again() to cache the result of the expansion, and then uses that
result in future expansions. This causes problems when the expansion
includes whitespace:

$ git log -n 1 --pretty='format:newline:%+h%nno_newline:%h'
newline:
a0b1c2d
no_newline:
a0b1c2

The second expansion of the hash added an unwanted newline and removed
the final character. It seemingly used the cached expansion *starting
from the inserted newline*.

The expected output is:

$ git log -n 1 --pretty='format:newline:%+h%nno_newline:%h'
newline:
a0b1c2d
no_newline:a0b1c2d


/CC René from commit b9c62321 and Junio from 9fa708d