Re: [PATCH v3 5/6] split-index: don't compare stat data of entries already marked for split index

2018-09-29 Thread SZEDER Gábor
On Sat, Sep 29, 2018 at 11:14:29AM +0200, SZEDER Gábor wrote:
> > > Note that comparing the cached data in copied and original entries in
> > 
> > s/cached data/cached stat data/ ? I was confused for a bit.
> 
> No, it's indeed cached data, but now that you mention it, the subject
> line does need a s/stat //.

Or rather s/stat/cached/



Re: [PATCH v3 5/6] split-index: don't compare stat data of entries already marked for split index

2018-09-29 Thread SZEDER Gábor
On Sat, Sep 29, 2018 at 07:36:08AM +0200, Duy Nguyen wrote:
> On Fri, Sep 28, 2018 at 06:24:58PM +0200, SZEDER Gábor wrote:
> > When unpack_trees() constructs a new index, it copies cache entries
> > from the original index [1].  prepare_to_write_split_index() has to
> > deal with this, and it has a dedicated code path for copied entries
> > that are present in the shared index, where it compares the cached
> > data in the corresponding copied and original entries.  If the cached
> > data matches, then they are considered the same; if it differs, then
> > the copied entry will be marked for inclusion as a replacement entry
> > in the just about to be written split index by setting the
> > CE_UPDATE_IN_BASE flag.
> > 
> > However, a cache entry already has its CE_UPDATE_IN_BASE flag set upon
> > reading the split index, if the entry already has a replacement entry
> > there, or upon refreshing the cached stat data, if the corresponding
> > file was modified.  The state of this flag is then preserved when
> > unpack_trees() copies a cache entry from the shared index.
> > 
> > So modify prepare_to_write_split_index() to check the copied cache
> > entries' CE_UPDATE_IN_BASE flag first, and skip the thorough
> > comparison of cached data if the flag is already set.
> 
> OK so this is an optimization, not a bug fix. Right?

Well, a microoptimization at most: with all what's going on in
unpack_trees() I seriously doubt that it's effect is measurable.

> > Note that comparing the cached data in copied and original entries in
> 
> s/cached data/cached stat data/ ? I was confused for a bit.

No, it's indeed cached data, but now that you mention it, the subject
line does need a s/stat //.

The comparison is done with this call:

  ret = memcmp(>ce_stat_data, >ce_stat_data,
   offsetof(struct cache_entry, name) -
   offsetof(struct cache_entry, ce_stat_data));

i.e. it starts at the stat data and ends just before the cache entry's
name, and 'struct cache_entry' has several other fields between these
two, including e.g. the cached oid:

  struct cache_entry {
  struct hashmap_entry ent;
  struct stat_data ce_stat_data;
  unsigned int ce_mode;
  unsigned int ce_flags;
  unsigned int mem_pool_allocated;
  unsigned int ce_namelen;
  unsigned int index; /* for link extension */
  struct object_id oid;
  char name[FLEX_ARRAY]; /* more */
  };

However, to me it's mostly about clarity of the code, and about
documenting that CE_UPDATE_IN_BASE might have already been set at that
point and why, so the next dev diving in to debug the split index
doesn't have to figure this out himself.

> > the shared index might actually be entirely unnecessary.  In theory
> > all code paths refreshing the cached stat data of an entry in the
> > shared index should set the CE_UPDATE_IN_BASE flag in that entry, and
> > unpack_trees() should preserve this flag when copying cache entries.
> > This means that the cached data is only ever changed if the
> > CE_UPDATE_IN_BASE flag is set as well.  Our test suite seems to
> > confirm this: instrumenting the conditions in question and running the
> > test suite repeatedly with 'GIT_TEST_SPLIT_INDEX=yes' showed that the
> > cached data in a copied entry differs from the data in the shared
> > entry only if its CE_UPDATE_IN_BASE flag is indeed set.
> 
> Yes I was probably just being paranoid (or sticking to simpler
> checks). I was told that split index is computation expensive and not
> doing unnecesary/expensive checks may help. But let's leave it for
> later.
> 
> > +   } else {
> > +   /*
> > +* Thoroughly compare the cached data to see
> > +* whether it should be marked for inclusion
> > +* in the split index.
> > +*
> > +* This comparison might be unnecessary, as
> > +* code paths modifying the cached data do
> > +* set CE_UPDATE_IN_BASE as well.
> > +*/
> > +   const unsigned int ondisk_flags =
> > +   CE_STAGEMASK | CE_VALID |
> > +   CE_EXTENDED_FLAGS;
> > +   unsigned int ce_flags, base_flags, ret;
> > +   ce_flags = ce->ce_flags;
> > +   base_flags = base->ce_flags;
> > +   /* only on-disk flags matter */
> > +   ce->ce_flags   &= ondisk_flags;
> > +   base->ce_flags &= ondisk_flags;
> > +   ret = memcmp(>ce_stat_data, 
> > >ce_stat_data,
> > +offsetof(struct cache_entry, name) 
> > -
> > +offsetof(struct 

Re: [PATCH v3 5/6] split-index: don't compare stat data of entries already marked for split index

2018-09-28 Thread Duy Nguyen
On Fri, Sep 28, 2018 at 06:24:58PM +0200, SZEDER Gábor wrote:
> When unpack_trees() constructs a new index, it copies cache entries
> from the original index [1].  prepare_to_write_split_index() has to
> deal with this, and it has a dedicated code path for copied entries
> that are present in the shared index, where it compares the cached
> data in the corresponding copied and original entries.  If the cached
> data matches, then they are considered the same; if it differs, then
> the copied entry will be marked for inclusion as a replacement entry
> in the just about to be written split index by setting the
> CE_UPDATE_IN_BASE flag.
> 
> However, a cache entry already has its CE_UPDATE_IN_BASE flag set upon
> reading the split index, if the entry already has a replacement entry
> there, or upon refreshing the cached stat data, if the corresponding
> file was modified.  The state of this flag is then preserved when
> unpack_trees() copies a cache entry from the shared index.
> 
> So modify prepare_to_write_split_index() to check the copied cache
> entries' CE_UPDATE_IN_BASE flag first, and skip the thorough
> comparison of cached data if the flag is already set.

OK so this is an optimization, not a bug fix. Right?

> Note that comparing the cached data in copied and original entries in

s/cached data/cached stat data/ ? I was confused for a bit.

> the shared index might actually be entirely unnecessary.  In theory
> all code paths refreshing the cached stat data of an entry in the
> shared index should set the CE_UPDATE_IN_BASE flag in that entry, and
> unpack_trees() should preserve this flag when copying cache entries.
> This means that the cached data is only ever changed if the
> CE_UPDATE_IN_BASE flag is set as well.  Our test suite seems to
> confirm this: instrumenting the conditions in question and running the
> test suite repeatedly with 'GIT_TEST_SPLIT_INDEX=yes' showed that the
> cached data in a copied entry differs from the data in the shared
> entry only if its CE_UPDATE_IN_BASE flag is indeed set.

Yes I was probably just being paranoid (or sticking to simpler
checks). I was told that split index is computation expensive and not
doing unnecesary/expensive checks may help. But let's leave it for
later.

> + } else {
> + /*
> +  * Thoroughly compare the cached data to see
> +  * whether it should be marked for inclusion
> +  * in the split index.
> +  *
> +  * This comparison might be unnecessary, as
> +  * code paths modifying the cached data do
> +  * set CE_UPDATE_IN_BASE as well.
> +  */
> + const unsigned int ondisk_flags =
> + CE_STAGEMASK | CE_VALID |
> + CE_EXTENDED_FLAGS;
> + unsigned int ce_flags, base_flags, ret;
> + ce_flags = ce->ce_flags;
> + base_flags = base->ce_flags;
> + /* only on-disk flags matter */
> + ce->ce_flags   &= ondisk_flags;
> + base->ce_flags &= ondisk_flags;
> + ret = memcmp(>ce_stat_data, 
> >ce_stat_data,
> +  offsetof(struct cache_entry, name) 
> -
> +  offsetof(struct cache_entry, 
> ce_stat_data));
> + ce->ce_flags = ce_flags;
> + base->ce_flags = base_flags;

Maybe make this block a separate function (compare_ce_content or
something). The amount of indentation is getting too high.

> + if (ret)
> + ce->ce_flags |= CE_UPDATE_IN_BASE;
> + }
>   discard_cache_entry(base);
>   si->base->cache[ce->index - 1] = ce;
>   }
> -- 
> 2.19.0.361.gafc87ffe72
> 


[PATCH v3 5/6] split-index: don't compare stat data of entries already marked for split index

2018-09-28 Thread SZEDER Gábor
When unpack_trees() constructs a new index, it copies cache entries
from the original index [1].  prepare_to_write_split_index() has to
deal with this, and it has a dedicated code path for copied entries
that are present in the shared index, where it compares the cached
data in the corresponding copied and original entries.  If the cached
data matches, then they are considered the same; if it differs, then
the copied entry will be marked for inclusion as a replacement entry
in the just about to be written split index by setting the
CE_UPDATE_IN_BASE flag.

However, a cache entry already has its CE_UPDATE_IN_BASE flag set upon
reading the split index, if the entry already has a replacement entry
there, or upon refreshing the cached stat data, if the corresponding
file was modified.  The state of this flag is then preserved when
unpack_trees() copies a cache entry from the shared index.

So modify prepare_to_write_split_index() to check the copied cache
entries' CE_UPDATE_IN_BASE flag first, and skip the thorough
comparison of cached data if the flag is already set.

Note that comparing the cached data in copied and original entries in
the shared index might actually be entirely unnecessary.  In theory
all code paths refreshing the cached stat data of an entry in the
shared index should set the CE_UPDATE_IN_BASE flag in that entry, and
unpack_trees() should preserve this flag when copying cache entries.
This means that the cached data is only ever changed if the
CE_UPDATE_IN_BASE flag is set as well.  Our test suite seems to
confirm this: instrumenting the conditions in question and running the
test suite repeatedly with 'GIT_TEST_SPLIT_INDEX=yes' showed that the
cached data in a copied entry differs from the data in the shared
entry only if its CE_UPDATE_IN_BASE flag is indeed set.

In practice, however, our test suite doesn't have 100% coverage,
GIT_TEST_SPLIT_INDEX is inherently random, and I certainly can't claim
to possess complete understanding of what goes on in unpack_trees()...
Therefore I kept the comparison of the cached data when
CE_UPDATE_IN_BASE is not set, just in case that an unnoticed or future
code path were to accidentally miss setting this flag upon refreshing
the cached stat data or unpack_trees() were to drop this flag while
copying a cache entry.

[1] Note that when unpack_trees() constructs the new index and decides
that a cache entry should now refer to different content than what
was recorded in the original index (e.g. 'git read-tree -m
HEAD^'), then that can't really be considered a copy of the
original, but rather the creation of a new entry.  Notably and
pertinent to the split index feature, such a new entry doesn't
have a reference to the original's shared index entry anymore,
i.e. its 'index' field is set to 0.  Consequently, such an entry
is treated by prepare_to_write_split_index() as an entry not
present in the shared index and it will be added to the new split
index, while the original entry will be marked as deleted, and
neither the above discussion nor the changes in this patch apply
to them.

Signed-off-by: SZEDER Gábor 
---
 split-index.c | 79 ---
 1 file changed, 62 insertions(+), 17 deletions(-)

diff --git a/split-index.c b/split-index.c
index 548272ec33..7d8799f6b7 100644
--- a/split-index.c
+++ b/split-index.c
@@ -207,13 +207,28 @@ void prepare_to_write_split_index(struct index_state 
*istate)
 */
for (i = 0; i < istate->cache_nr; i++) {
struct cache_entry *base;
-   /* namelen is checked separately */
-   const unsigned int ondisk_flags =
-   CE_STAGEMASK | CE_VALID | CE_EXTENDED_FLAGS;
-   unsigned int ce_flags, base_flags, ret;
ce = istate->cache[i];
-   if (!ce->index)
+   if (!ce->index) {
+   /*
+* During simple update index operations this
+* is a cache entry that is not present in
+* the shared index.  It will be added to the
+* split index.
+*
+* However, it might also represent a file
+* that already has a cache entry in the
+* shared index, but a new index has just
+* been constructed by unpack_trees(), and
+* this entry now refers to different content
+* than what was recorded in the original
+* index, e.g. during 'read-tree -m HEAD^' or
+* 'checkout HEAD^'.  In this case the
+*