Re: [RFC] fast-import: invalidate pack_id references after loosening

2016-05-27 Thread Eric Wong
Jeff King  wrote:
> On Thu, May 26, 2016 at 08:02:36AM +, Eric Wong wrote:
> 
> > > That's probably OK in practice, as I think the actual pack-write already
> > > does a linear walk of the object table to generate the pack index. So
> > > presumably nobody checkpoints often enough for it to be a problem. And
> > > the solution would be to keep a separate list of pointers to objects for
> > > the current pack-id, which would trivially fix both this case and the
> > > one in create_index().  So we can punt on it until somebody actually
> > > runs into it, I think.
> > 
> > I might checkpoint that much and run into the problem soon :)
> > Too tired now; maybe in a day or two I'll be able to make sense
> > of C again :x

OK, I guess I was too tired and not thinking clearly.

I now figure it's not worth it to checkpoint frequently and have
a very-long-lived fast-import process.  The object table and
marks set will grow indefinitely, so forking off another gfi
with a new set of marks is better for my case[1], anyways.


Junio: I think my RFC can go into pu as-is and replace
Jeff's object_table discarding patch.


[1] inotify watching a Maildir + fast-import

> In theory the list of objects in the allocation pool are contiguous for
> a particular pack. So you could break out of the double-loop at the
> first non-matching object you see:
> 
> diff --git a/fast-import.c b/fast-import.c
> index 83558dc..f214bd3 100644
> --- a/fast-import.c
> +++ b/fast-import.c
> @@ -900,10 +900,13 @@ static const char *create_index(void)
>   c = idx;
>   for (o = blocks; o; o = o->next_pool)
>   for (e = o->next_free; e-- != o->entries;)
>   if (pack_id == e->pack_id)
>   *c++ = &e->idx;
> + else
> + goto done;
> +done:
>   last = idx + object_count;
>   if (c != last)
>   die("internal consistency error creating the index");
>  
>   tmpfile = write_idx_file(NULL, idx, object_count, &pack_idx_opts, 
> pack_data->sha1);
> 
> But that seems to break some of the tests, so I think there is some case
> which does not match my theory. :)

Yes, it's probably because duplicate objects from the object
store will create entries with e->pack_id == MAX_PACK_ID

> Another strategy is to note that we cross-check against object_count
> here. So you could simply break out of the loop when we have found
> object_count matches.

I'm worried that could fail to detect bugs which should've led
us to die of the internal consistency error.


What we could probably do more safely is limit the scan to only
recent object pools (new ones since the previous create_index call
and the last one scanned).

This passes tests, but I could be missing something:

--- a/fast-import.c
+++ b/fast-import.c
@@ -926,14 +926,16 @@ static const char *create_index(void)
struct pack_idx_entry **idx, **c, **last;
struct object_entry *e;
struct object_entry_pool *o;
+   static struct object_entry_pool *last_seen;
 
/* Build the table of object IDs. */
ALLOC_ARRAY(idx, object_count);
c = idx;
-   for (o = blocks; o; o = o->next_pool)
+   for (o = blocks; o; o = (o == last_seen) ? NULL : o->next_pool)
for (e = o->next_free; e-- != o->entries;)
if (pack_id == e->pack_id)
*c++ = &e->idx;
+   last_seen = blocks;
last = idx + object_count;
if (c != last)
die("internal consistency error creating the index");
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] fast-import: invalidate pack_id references after loosening

2016-05-26 Thread Jeff King
On Thu, May 26, 2016 at 08:02:36AM +, Eric Wong wrote:

> > That's probably OK in practice, as I think the actual pack-write already
> > does a linear walk of the object table to generate the pack index. So
> > presumably nobody checkpoints often enough for it to be a problem. And
> > the solution would be to keep a separate list of pointers to objects for
> > the current pack-id, which would trivially fix both this case and the
> > one in create_index().  So we can punt on it until somebody actually
> > runs into it, I think.
> 
> I might checkpoint that much and run into the problem soon :)
> Too tired now; maybe in a day or two I'll be able to make sense
> of C again :x

In theory the list of objects in the allocation pool are contiguous for
a particular pack. So you could break out of the double-loop at the
first non-matching object you see:

diff --git a/fast-import.c b/fast-import.c
index 83558dc..f214bd3 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -900,10 +900,13 @@ static const char *create_index(void)
c = idx;
for (o = blocks; o; o = o->next_pool)
for (e = o->next_free; e-- != o->entries;)
if (pack_id == e->pack_id)
*c++ = &e->idx;
+   else
+   goto done;
+done:
last = idx + object_count;
if (c != last)
die("internal consistency error creating the index");
 
tmpfile = write_idx_file(NULL, idx, object_count, &pack_idx_opts, 
pack_data->sha1);

But that seems to break some of the tests, so I think there is some case
which does not match my theory. :)

Another strategy is to note that we cross-check against object_count
here. So you could simply break out of the loop when we have found
object_count matches.

We don't keep similar counters for branches/tags (which have a similar
quadratic problem, though presumably much smaller "n"), but it would be
easy to keep an offset in start_packfile() and just start the iteration
there.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] fast-import: invalidate pack_id references after loosening

2016-05-26 Thread Eric Wong
Jeff King  wrote:
> On Wed, May 25, 2016 at 10:54:02PM +, Eric Wong wrote:
> > +   for (h = 0; h < ARRAY_SIZE(object_table); h++) {
> > +   struct object_entry *e;
> > +
> > +   for (e = object_table[h]; e; e = e->next)
> > +   if (e->pack_id == id)
> > +   e->pack_id = MAX_PACK_ID;
> > +   }



> This looks pretty straightforward. I do notice that we never shrink the
> number of items in the object table when checkpointing, and so our
> linear walk will grow ever larger. So if one were to checkpoint every
> k-th object, it makes the whole operation quadratic in the number of
> objects (actually, O(n^2/k) but k is a constant).

Good point, I'll work on a separate patch to fix it.

> That's probably OK in practice, as I think the actual pack-write already
> does a linear walk of the object table to generate the pack index. So
> presumably nobody checkpoints often enough for it to be a problem. And
> the solution would be to keep a separate list of pointers to objects for
> the current pack-id, which would trivially fix both this case and the
> one in create_index().  So we can punt on it until somebody actually
> runs into it, I think.

I might checkpoint that much and run into the problem soon :)
Too tired now; maybe in a day or two I'll be able to make sense
of C again :x
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] fast-import: invalidate pack_id references after loosening

2016-05-25 Thread Jeff King
On Wed, May 25, 2016 at 10:54:02PM +, Eric Wong wrote:

> When loosening a pack, the current pack_id gets reused when
> checkpointing and the import does not terminate.  This causes
> problems after checkpointing as the object table, branch, and
> tag lists still contains pre-checkpoint references to the
> recycled pack_id.
> 
> Merely clearing the object_table as suggested by Jeff King in
> http://mid.gmane.org/20160517121330.ga7...@sigill.intra.peff.net
> is insufficient as the marks set still contains references
> to object entries.
> 
> Wrong pack_id references branch and tags lists do not cause
> errors, but can lead to misleading crash reports and core dumps,
> so they are also invalidated.

Nicely explained.

> +static void invalidate_pack_id(unsigned int id)
> +{
> + unsigned int h;
> + unsigned long lu;
> + struct tag *t;
> +
> + for (h = 0; h < ARRAY_SIZE(object_table); h++) {
> + struct object_entry *e;
> +
> + for (e = object_table[h]; e; e = e->next)
> + if (e->pack_id == id)
> + e->pack_id = MAX_PACK_ID;
> + }
> +
> + for (lu = 0; lu < branch_table_sz; lu++) {
> + struct branch *b;
> +
> + for (b = branch_table[lu]; b; b = b->table_next_branch)
> + if (b->pack_id == id)
> + b->pack_id = MAX_PACK_ID;
> + }
> +
> + for (t = first_tag; t; t = t->next_tag)
> + if (t->pack_id == id)
> + t->pack_id = MAX_PACK_ID;
> +}

This looks pretty straightforward. I do notice that we never shrink the
number of items in the object table when checkpointing, and so our
linear walk will grow ever larger. So if one were to checkpoint every
k-th object, it makes the whole operation quadratic in the number of
objects (actually, O(n^2/k) but k is a constant).

That's probably OK in practice, as I think the actual pack-write already
does a linear walk of the object table to generate the pack index. So
presumably nobody checkpoints often enough for it to be a problem. And
the solution would be to keep a separate list of pointers to objects for
the current pack-id, which would trivially fix both this case and the
one in create_index().  So we can punt on it until somebody actually
runs into it, I think.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] fast-import: invalidate pack_id references after loosening

2016-05-25 Thread Eric Wong
When loosening a pack, the current pack_id gets reused when
checkpointing and the import does not terminate.  This causes
problems after checkpointing as the object table, branch, and
tag lists still contains pre-checkpoint references to the
recycled pack_id.

Merely clearing the object_table as suggested by Jeff King in
http://mid.gmane.org/20160517121330.ga7...@sigill.intra.peff.net
is insufficient as the marks set still contains references
to object entries.

Wrong pack_id references branch and tags lists do not cause
errors, but can lead to misleading crash reports and core dumps,
so they are also invalidated.

Signed-off-by: Eric Wong 
---
 I started writing a standalone test case; but testing with
 original failing cases would be greatly appreciated.

 Still learning my way around the fast-import code...
 Thanks.

 fast-import.c   | 31 +++-
 t/t9302-fast-import-unpack-limit.sh | 57 +
 2 files changed, 87 insertions(+), 1 deletion(-)

diff --git a/fast-import.c b/fast-import.c
index 0e8bc6a..b9db4b6 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -597,6 +597,33 @@ static struct object_entry *insert_object(unsigned char 
*sha1)
return e;
 }
 
+static void invalidate_pack_id(unsigned int id)
+{
+   unsigned int h;
+   unsigned long lu;
+   struct tag *t;
+
+   for (h = 0; h < ARRAY_SIZE(object_table); h++) {
+   struct object_entry *e;
+
+   for (e = object_table[h]; e; e = e->next)
+   if (e->pack_id == id)
+   e->pack_id = MAX_PACK_ID;
+   }
+
+   for (lu = 0; lu < branch_table_sz; lu++) {
+   struct branch *b;
+
+   for (b = branch_table[lu]; b; b = b->table_next_branch)
+   if (b->pack_id == id)
+   b->pack_id = MAX_PACK_ID;
+   }
+
+   for (t = first_tag; t; t = t->next_tag)
+   if (t->pack_id == id)
+   t->pack_id = MAX_PACK_ID;
+}
+
 static unsigned int hc_str(const char *s, size_t len)
 {
unsigned int r = 0;
@@ -993,8 +1020,10 @@ static void end_packfile(void)
cur_pack_sha1, pack_size);
 
if (object_count <= unpack_limit) {
-   if (!loosen_small_pack(pack_data))
+   if (!loosen_small_pack(pack_data)) {
+   invalidate_pack_id(pack_id);
goto discard_pack;
+   }
}
 
close(pack_data->pack_fd);
diff --git a/t/t9302-fast-import-unpack-limit.sh 
b/t/t9302-fast-import-unpack-limit.sh
index 0f686d2..a04de14 100755
--- a/t/t9302-fast-import-unpack-limit.sh
+++ b/t/t9302-fast-import-unpack-limit.sh
@@ -45,4 +45,61 @@ test_expect_success 'bigger packs are preserved' '
test $(find .git/objects/pack -type f | wc -l) -eq 2
 '
 
+test_expect_success 'lookups after checkpoint works' '
+   hello_id=$(echo hello | git hash-object --stdin -t blob) &&
+   id="$GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE" &&
+   before=$(git rev-parse refs/heads/master^0) &&
+   (
+   cat <<-INPUT_END &&
+   blob
+   mark :1
+   data 6
+   hello
+
+   commit refs/heads/master
+   mark :2
+   committer $id
+   data <&2 "checkpoint did not update branch"
+   exit 1
+   else
+   n=$(($n + 1))
+   fi &&
+   sleep 1 &&
+   from=$(git rev-parse refs/heads/master^0)
+   done &&
+   cat <<-INPUT_END &&
+   commit refs/heads/master
+   committer $id
+   data