Re: What's cooking in git.git (Nov 2013, #05; Thu, 21)

2013-11-22 Thread Vicent Marti
On Fri, Nov 22, 2013 at 6:26 PM, Jeff King p...@peff.net wrote:
 Granted, the way I verified this was checking whether you renamed
 rlw_xor_run_bit() to something more fitting, so perhaps you just forgot
 that one thing but did all the rest.

 I didn't touch that. Vicent, did you have a comment on the name (it
 really does look like it is a negation, and the only caller is
 ewah_not).

Yes, the name was ported straight from the original library and kept
as-is to make the translation more straightforward. These sources are
--again-- a translation, so I tried to remain as close to the original
Java implementation as possible.

I agree the name is not ideal, but it does make quite a bit of sense.
It effectively inverts the word based on the run bit, which is the
equivalent of xoring it with the bit if it's one.

Love,
vmg
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/19] pack-bitmap: add support for bitmap indexes

2013-10-30 Thread Vicent Marti
On Wed, Oct 30, 2013 at 9:10 AM, Jeff King p...@peff.net wrote:

 In fact, I'm not quite sure that even a partial reuse up to an offset is
 100% safe. In a newly packed git repo it is, because we always put bases
 before deltas (and OFS_DELTA objects need this). But if you had a bitmap
 generated from a fixed thin pack, we would have REF_DELTA objects early
 on that depend on bases appended to the end of the pack. So I really
 wonder if we should scrap this partial reuse and either just have full
 reuse, or go through the regular object_entry construction.

 Vicent, you've thought about the reuse code a lot more than I have. Any
 thoughts?

Yes, our pack writing and bitmap code takes enough precautions to
arrange the objects in the packfile in a way that can be partially
reused, so for any given bitmap file written from Git, I'd say we're
safe to always reuse the leader of the pack if this is possible.

For bitmaps generated from JGit, however, we cannot make this
assumption. I mean, we can right now (from my understanding of the
current implementation for pack-objects on JGit), but they are free to
change this in the future.

Obviously I intend to keep the pack reuse on production because the
CPU savings are noticeable, but we can drop it from the public
patchset. Ideally, we'd have full pack reuse like JGit, but we cannot
reasonably do that in GitHub because splitting a pack for the network
root would double our disk usage for all the forks.

luv,
vmg
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/19] pack-objects: use bitmaps when packing objects

2013-10-30 Thread Vicent Marti
On Wed, Oct 30, 2013 at 11:38 AM, Shawn Pearce spea...@spearce.org wrote:
 Since (1) is relatively rare, I think we are using this as a proxy for
 (2), so that we can do a regular walk rather than looking around for
 bitmaps that don't exist. But I may be misremembering the reasoning.
 Vicent?

 Ah. I am not sure if we do this in JGit. I think JGit's approach is to
 look if the have appears in a pack with bitmaps, this is a simple
 lookup in the .idx file and does not require expanding any data from
 the .bitmap file.

No, you don't do this in JGit right now. This is a small optimization
we implemented to prevent *loading* the bitmap file (and hence
building the reverse index, which can be expensive) unless we're sure
we can at least *attempt* a bitmap walk.

 But it wasn't my question. :-)

 Client sends want B ; have E. What if E appears in the bitmapped
 pack, but does not itself have a bitmap? Do you walk backwards from B
 and switch to the bitmap algorithm when you find a commit that has a
 bitmap and that bitmap contains E?

This is correct, we use the same heuristics as JGit. Once we have
loaded a bitmap file we know we can attempt a bitmap walk; if E is on
the bitmapped pack, we'll build a temporary bitmap using an extended
index (with bits for commits that are not in the packfile) as we walk
backwards until E. Once we find a commit with a bitmap, we'll OR that
with the temporary bitmap to skip the full walk.

 In JGit we write the to_pack list first, then the reuse pack. Our
 rationale was the to_pack list is recent objects that are newer and
 would appear first in a traditional traversal, so they should go at
 the front of the stream. This does mean if they delta compress against
 an object in that reuse_packfile slice they have to use REF_DELTA
 instead of OFS_DELTA.

 That's a good point. In our case I think we do not delta against the
 reused packfile objects at all, as we simply send out the whole slice of
 packfile without making an entry for each object.

 JGit also doesn't use the reused packfile as delta bases, because
 there are no object entries to shove through the delta window. So
 there is never any risk of a reference to a base later in the file. It
 also means that thin pack at the front of the stream is less
 optimally compressed. At Google we side step that by doing GC at the
 server very often, to try and keep the number of objects in that pack
 low.

 It might make sense to use a commit that covers the majority of the
 reused pack as the edge base candidate case during delta compression
 here, as though the client had sent us a have for that commit. I
 don't think we have tried this in JGit. It would make deltas use
 REF_DELTA, but the delta size has to be smaller than the REF_DELTA
 header to be used in the stream so its still a smaller overall
 transfer.

 Is this series running on github.com/torvalds/linux? Last Saturday I
 ran a live demo clone comparing github.com/torvalds/linux to a JGit
 bitmap clone and some guy heckled me because GitHub was only a few
 seconds slower. :-)

 It is. Use kernel.org if you want to make fun of someone. :)

 Hah. OK, so GitHub was only a few seconds slower only because my
 desktop is better connected to our data centers than to yours. Nicely
 done, this patch series really works. :)

Thanks Shawn, your feedback was very helpful.

Re. cloning speeds: we are actually bottlenecked by our routing layer.
The CPUs in our new fileservers can clone `torvalds/linux` to
/dev/null in 20s, but we're slowing down when routing the actual
traffic back to the client. We're in the process of rewriting our Git
reverse proxys so let's see what the future looks like.

Love,
vmg
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/19] documentation: add documentation for the bitmap format

2013-10-30 Thread Vicent Marti
On Wed, Oct 30, 2013 at 11:23 AM, Shawn Pearce spea...@spearce.org wrote:
 On Wed, Oct 30, 2013 at 7:50 AM, Jeff King p...@peff.net wrote:
 On Fri, Oct 25, 2013 at 01:47:06PM +, Shawn O. Pearce wrote:

 I think Colby and I talked about having additional optional sections
 in this file, but Colby didn't want to overcomplicate the format early
 on. So v1 is probably not very extensible and we may have to go to v2
 to safely create an extension with the name hash cache used in this
 series.

 Given that the JGit v1 bitmap format has been shipping since JGit 3.0
 and in Gerrit Code Review 2.6, its in use in the wild. So we aren't
 going to go back and redefine v1.

 I don't think either course of action affects how JGit in the wild will
 react. If we add a new flag to v1, existing JGit barfs. If we move to
 v2, existing JGit barfs.  In either case, the simplest fix for JGit is
 to ignore the new section.

 Fair point. Then we can use v1 with the flag for now, JGit will barf and...

Shawn, I'm proposing the following patch to JGit (actually Kevin is,
because I don't have the CLA, but whatevs):

https://git.eclipse.org/r/#/c/17894/

It's a very small change (using an and to check for the flags on the
bitmap instead of a switch), but I think it's very clean. In the
spirit of Be conservative in what you send, be liberal in what you
accept, this patch lets JGit read V1 bitmaps emited from git.git
*even* if they have the extended Name Cache extension, and has no
effect on any JGit bitmap that has been generated up to date, or any
JGit bitmap generated from git.git without name caches.

I don't think it *changes* the semantics of the bitmap V1 format,
because with only one bit value set so far, those semantics weren't
really there, and it'll allow newer versions of JGit to never barf.

Love,
vmg
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/19] documentation: add documentation for the bitmap format

2013-10-30 Thread Vicent Marti
On Wed, Oct 30, 2013 at 11:23 AM, Shawn Pearce spea...@spearce.org wrote:
 The name-hash cache is probably important, but it would be nice to
 have a variable or flag we can use to disable the name-cache
 generation and thus permit Git to create JGit style v1 indexes, and
 also use JGit v1 indexes if the name-cache is not available.

Yes, we'll reroll the patches with name-hash cache writing disabled by
default. I still think it'd be nice for JGit to be able to ignore the
flag in the cases where it's there.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/16] list-objects: mark tree as unparsed when we free its buffer

2013-06-24 Thread Vicent Marti
From: Jeff King p...@peff.net

We free the tree buffer during traversal to save memory.
However, we do not reset the parsed flag, which leaves a
landmine for the next person to use the tree. When they call
parse_tree it will do nothing, and they will segfault when
they try to access the buffer.

This hasn't mattered until now because most rev-list
traversals would exit the program immediately afterwards,
but the bitmap writer wants to access the trees twice.
---
 list-objects.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/list-objects.c b/list-objects.c
index 3dd4a96..1251180 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -125,6 +125,7 @@ static void process_tree(struct rev_info *revs,
strbuf_setlen(base, baselen);
free(tree-buffer);
tree-buffer = NULL;
+   tree-object.parsed = 0;
 }
 
 static void mark_edge_parents_uninteresting(struct commit *commit,
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/16] Speed up Counting Objects with bitmap data

2013-06-24 Thread Vicent Marti
Hello friends and enemies from the lovevely Git Mailing list.

I bring to you a patch series that implement a quite interesting performance
optimization: the removal of the Counting Objects phase during `pack-objects`
by using a pre-computed bitmap to find the reachable objects in the packfile.

As you probably know, Shawn Pearce designed this approach a few months ago and
implemented it for JGit, with very exciting results.

This is a not-so-straightforward port of his original design: The general 
approach
is the same, but unfortunately we were not able to re-use JGit's original 
on-disk
format for the `.bitmap` files.

There is a full technical spec for the new format (v2) in patch 09, including
benchmarks and rationale for the new design. The gist of it is that JGit's
original format is not `mmap`able (JGit tends to not mmap anything), and that
becomes very costly in practice with `upload-pack`, which spawns a new process
for every upload.

The header and metadata for both formats are however compatible, so it should be
trivial to update JGit to read/write this format too. I intend to do this on the
coming weeks, and I also hope that the v2 implementation will be slightly faster
than the actual, even with the shortcomings of the JVM.

The patch series, although massive, is rather straightforward.

Most of the patches are isolated refactorings that enable access to a few 
functions
that were previously hidden (re. packfile data). These functions are needed for
reading and writing the bitmap indexes.

Patch 03 is worth noting because it implements a performance optimization for
`pack-objects` which isn't particularly good in normal invocations (~10% speed 
up)
but that will show great benefits in later patches when it comes to writing the
bitmap indexes.

Patch 10 is the core of the series, implementing the actual loading of bitmap 
indexes
and optimizing the Counting Objects phase of `pack-objects`. Like with every 
other
patch that offers performance improvements, sample benchmarks are provided 
(spoiler:
they are pretty fucking cool).

Patch 11 and 16 are samples of using the new Bitmap traversal API to speed up 
other
parts of Git (`rev-list --objects` and `rev-list --count`, respectively).

Patch 12, 13 and 15 implement the actual writing of bitmap indexes. Like JGit, 
patch
12 enables writing a bitmap index as part of the `pack-objects` process (and 
hence
as part of a normal `gc` run). On top of that, I implemented a new plumbing 
command
in patch 15 that allows to write bitmap indexes for already-existing packfiles.

I'd love your feedback on the design and implementation of this feature. I deem 
it
rather stable, as we've been testing it on production on the world's largest Git
host (Git Hub Dot Com The Web Site) with good results, so I'd love it to have it
upstreamed on Core Git.

Strawberry kisses,
vmg

Jeff King (1):
  list-objects: mark tree as unparsed when we free its buffer

Vicent Marti (15):
  sha1_file: refactor into `find_pack_object_pos`
  pack-objects: use a faster hash table
  pack-objects: make `pack_name_hash` global
  revision: allow setting custom limiter function
  sha1_file: export `git_open_noatime`
  compat: add endinanness helpers
  ewah: compressed bitmap implementation
  documentation: add documentation for the bitmap format
  pack-objects: use bitmaps when packing objects
  rev-list: add bitmap mode to speed up lists
  pack-objects: implement bitmap writing
  repack: consider bitmaps when performing repacks
  sha1_file: implement `nth_packed_object_info`
  write-bitmap: implement new git command to write bitmaps
  rev-list: Optimize --count using bitmaps too

 Documentation/technical/bitmap-format.txt |  235 
 Makefile  |   11 +
 builtin.h |1 +
 builtin/pack-objects.c|  362 +++-
 builtin/pack-objects.h|   33 ++
 builtin/rev-list.c|   35 +-
 builtin/write-bitmap.c|  256 +
 cache.h   |5 +
 ewah/bitmap.c |  229 
 ewah/ewah_bitmap.c|  703 
 ewah/ewah_io.c|  199 +++
 ewah/ewah_rlw.c   |  124 +
 ewah/ewok.h   |  194 +++
 ewah/ewok_rlw.h   |  114 
 git-compat-util.h |   28 +
 git-repack.sh |   10 +-
 git.c |1 +
 khash.h   |  329 +++
 list-objects.c|1 +
 pack-bitmap-write.c   |  520 ++
 pack-bitmap.c |  855 +
 pack-bitmap.h |   64 +++
 pack-write.c  |2

[PATCH 09/16] documentation: add documentation for the bitmap format

2013-06-24 Thread Vicent Marti
This is the technical documentation and design rationale for the new
Bitmap v2 on-disk format.
---
 Documentation/technical/bitmap-format.txt |  235 +
 1 file changed, 235 insertions(+)
 create mode 100644 Documentation/technical/bitmap-format.txt

diff --git a/Documentation/technical/bitmap-format.txt 
b/Documentation/technical/bitmap-format.txt
new file mode 100644
index 000..5400082
--- /dev/null
+++ b/Documentation/technical/bitmap-format.txt
@@ -0,0 +1,235 @@
+GIT bitmap v2 format  rationale
+
+
+   - A header appears at the beginning, using the same format
+   as JGit's original bitmap indexes.
+
+   4-byte signature: {'B', 'I', 'T', 'M'}
+
+   2-byte version number (network byte order)
+   The current implementation only supports version 2
+   of the bitmap index. The rationale for this is explained
+   in this document.
+
+   2-byte flags (network byte order)
+
+   The folowing flags are supported:
+
+   - BITMAP_OPT_FULL_DAG (0x1) REQUIRED
+   This flag must always be present. It implies that the 
bitmap
+   index has been generated for a packfile with full 
closure
+   (i.e. where every single object in the packfile can find
+its parent links inside the same packfile). This is a
+   requirement for the bitmap index format, also present 
in JGit,
+   that greatly reduces the complexity of the 
implementation.
+
+   - BITMAP_OPT_LE_BITMAPS (0x2)
+   If present, this implies that that the EWAH bitmaps in 
this
+   index has been serialized to disk in little-endian byte 
order.
+   Note that this only applies to the actual bitmaps, not 
to the
+   Git data structures in the index, which are always in 
Network
+   Byte order as it's costumary.
+
+   - BITMAP_OPT_BE_BITMAPS (0x4)
+   If present, this implies that the EWAH bitmaps have 
been serialized
+   using big-endian byte order (NWO). If the flag is 
missing, **the
+   default is to assume that the bitmaps are in 
big-endian**.
+
+   - BITMAP_OPT_HASH_CACHE (0x8)
+   If present, a hash cache for finding delta bases will 
be available
+   right after the header block in this index. See the 
following
+   section for details.
+
+   4-byte entry count (network byte order)
+
+   The total count of entries (bitmapped commits) in this 
bitmap index.
+
+   20-byte checksum
+
+   The SHA1 checksum of the pack this bitmap index belongs 
to.
+
+   - An OPTIONAL delta cache follows the header.
+
+   The cache is formed by `n` 4-byte hashes in a row, where `n` is
+   the amount of objects in the indexed packfile. Note that this 
amount
+   is the **total number of objects** and is not related to the
+   number of commits that have been selected and indexed in the
+   bitmap index.
+
+   The hashes are stored in Network Byte Order and they are the 
same
+   values generated by a normal revision walk during the 
`pack-objects`
+   phase.
+
+   The `n`nth hash in the cache is the name hash for the `n`th 
object
+   in the index for the indexed packfile.
+
+   [RATIONALE]:
+
+   The bitmap index allows us to skip the Counting Objects phase
+   during `pack-objects` and yield all the OIDs that would be 
reachable
+   (WANTS) when generating the pack.
+
+   This optimization, however, means that we're adding objects to 
the
+   packfile straight from the packfile index, and hence we are 
lacking
+   path information for the objects that would normally be 
generated
+   during the Counting Objects phase.
+
+   This path information for each object is hashed and used as a 
very
+   effective way to find good delta bases when compressing the 
packfile;
+   without these hashes, the resulting packfiles are much less 
optimal.
+
+   By storing all the hashes in a cache together with the bitmapsin
+   the bitmap index, we can yield not only the SHA1 of all the 
reachable
+   objects, but also their hashes, and allow Git to be much 
smarter when
+   finding delta bases for packing.
+
+   If the delta cache is not available, the bitmap index will 
obviously
+   be smaller in disk, but 

[PATCH 16/16] rev-list: Optimize --count using bitmaps too

2013-06-24 Thread Vicent Marti
If bitmap indexes are available, the process of counting reachable
commits with `git rev-list --count` can be greatly sped up. Instead of
having to use callbacks that yield each object in the revision list, we
can build the reachable bitmap for the list and then use an efficient
popcount to find the number of bits set in the bitmap.

This commit implements a `count_bitmap_commit_list` that can be used
after `prepare_bitmap_walk` has returned successfully to return the
number of commits, trees, blobs or tags that have been found to be
reachable during the walk.

`git rev-list` is taught to use this function call when bitmaps are
enabled instead of going through the old rev-list machinery. Do note,
however, that counts with `left_right` and `cherry_mark` are not
optimized by this patch.

Here are some sample timings of different ways to count commits in
`torvalds/linux`:

$ time ../git/git rev-list master | wc -l
376549

real0m6.973s
user0m3.216s
sys 0m5.316s

$ time ../git/git rev-list --count master
376549

real0m1.933s
user0m1.744s
sys 0m0.188s

$ time ../git/git rev-list --use-bitmaps --count master
376549

real0m0.005s
user0m0.000s
sys 0m0.004s

Note that the time in the `--use-bitmaps` invocation is basically noise.
In my machine it ranges from 2ms to 6ms.
---
 builtin/rev-list.c |   11 +--
 pack-bitmap.c  |   37 +
 pack-bitmap.h  |2 ++
 3 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 905ed08..097adb8 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -354,8 +354,15 @@ int cmd_rev_list(int argc, const char **argv, const char 
*prefix)
revs.limited = 1;
 
if (use_bitmaps  !prepare_bitmap_walk(revs, NULL)) {
-   traverse_bitmap_commit_list(show_object_fast);
-   return 0;
+   if (revs.count  !revs.left_right  !revs.cherry_mark) {
+   uint32_t commit_count;
+   count_bitmap_commit_list(commit_count, NULL, NULL, 
NULL);
+   printf(%d\n, commit_count);
+   return 0;
+   } else {
+   traverse_bitmap_commit_list(show_object_fast);
+   return 0;
+   }
}
 
if (prepare_revision_walk(revs))
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 090db15..65fdce7 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -720,6 +720,43 @@ void traverse_bitmap_commit_list(show_reachable_fn 
show_reachable)
bitmap_git.result = NULL;
 }
 
+static uint32_t count_object_type(
+   struct bitmap *objects,
+   struct ewah_bitmap *type_filter)
+{
+   size_t i = 0, count = 0;
+   struct ewah_iterator it;
+   eword_t filter;
+
+   ewah_iterator_init(it, type_filter);
+
+   while (i  objects-word_alloc  ewah_iterator_next(filter, it)) {
+   eword_t word = objects-words[i++]  filter;
+   count += __builtin_popcountll(word);
+   }
+
+   return count;
+}
+
+void count_bitmap_commit_list(
+   uint32_t *commits, uint32_t *trees, uint32_t *blobs, uint32_t *tags)
+{
+   if (!bitmap_git.result)
+   die(Tried to count bitmap without setting it up first);
+
+   if (commits)
+   *commits = count_object_type(bitmap_git.result, 
bitmap_git.commits);
+
+   if (trees)
+   *trees = count_object_type(bitmap_git.result, bitmap_git.trees);
+
+   if (blobs)
+   *blobs = count_object_type(bitmap_git.result, bitmap_git.blobs);
+
+   if (tags)
+   *tags = count_object_type(bitmap_git.result, bitmap_git.tags);
+}
+
 struct bitmap_test_data {
struct bitmap *base;
struct progress *prg;
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 8e7e3dc..816da6d 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -47,6 +47,8 @@ typedef int (*show_reachable_fn)(
struct packed_git *found_pack,
off_t found_offset);
 
+void count_bitmap_commit_list(
+   uint32_t *commits, uint32_t *trees, uint32_t *blobs, uint32_t *tags);
 void traverse_bitmap_commit_list(show_reachable_fn show_reachable);
 int prepare_bitmap_walk(struct rev_info *revs, uint32_t *result_size);
 void test_bitmap_walk(struct rev_info *revs);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/16] pack-objects: implement bitmap writing

2013-06-24 Thread Vicent Marti
This commit extends more the functionality of `pack-objects` by allowing
it to write out a `.bitmap` index next to any written packs, together
with the `.idx` index that currently gets written.

If bitmaps are enabled for a given repository (either by calling
`pack-objects` with the `--use-bitmaps` flag or by having
`pack.usebitmaps` set to `true` in the config) and pack-objects is
writing a packfile that would normally be indexed (i.e. not piping to
stdout), we will attempt to write the corresponding bitmap index for the
packfile.

Bitmap index writing happens after the packfile and its index has been
successfully written to disk (`finish_tmp_packfile`). The process is
performed in several steps:

1. `bitmap_writer_build_type_index`: this call uses the array of
`struct object_entry`es that has just been sorted when writing out
the actual packfile index to disk to generate 4 type-index bitmaps
(one for each object type).

These bitmaps have their nth bit set if the given object is of the
bitmap's type. E.g. the nth bit of the Commits bitmap will be 1 if
the nth object in the packfile index is a commit.

This is a very cheap operation because the bitmap writing code has
access to the metadata stored in the `struct object_entry` array,
and hence the real type for each object in the packfile.

2. `bitmap_writer_select_commits`: if bitmap writing is enabled for
a given `pack-objects` run, the sequence of commits generated during
the Counting Objects phase will be stored in an array.

We then use that array to build up the list of selected commits.
Writing a bitmap in the index for each object in the repository
would be cost-prohibitive, so we use a simple heuristic to pick the
commits that will be indexed with bitmaps.

The current heuristics are a simplified version of JGit's original
implementation. We select a higher density of commits depending on
their age: the 100 most recent commits are always selected, after
that we pick 1 commit of each 100, and the gap increases as the
commits grow older. On top of that, we make sure that every single
branch that has not been merged (all the tips that would be required
from a clone) gets their own bitmap, and when selecting commits
between a gap, we tend to prioritize the commit with the most
parents.

Do note that there is no right/wrong way to perform commit selection;
different selection algorithms will result in different commits
being selected, but there's no such thing as missing a commit. The
bitmap walker algorithm implemented in `prepare_bitmap_walk` is able
to adapt to missing bitmaps by performing manual walks that complete
the bitmap: the ideal selection algorithm, however, would select
the commits that are more likely to be used as roots for a walk in
the future (e.g. the tips of each branch, and so on) to ensure a
bitmap for them is always available.

3. `bitmap_writer_build`: this is the computationally expensive part
of bitmap generation. Based on the list of commits that were
selected in the previous step, we perform several incremental walks
to generate the bitmap for each commit.

The walks begin from the oldest commit, and are built up
incrementally for each branch. E.g. consider this dag where A, B, C,
D, E, F are the selected commits, and a, b, c, e are a chunk of
simplified history that will not receive bitmaps.

A---a---B--b--C--c--D
 \
  E--e--F

We start by building the bitmap for A, using A as the root for a
revision walk and marking all the objects that are reachable until
the walk is over. Once this bitmap is stored, we reuse the bitmap
walker to perform the walk for B, assuming that once we reach A
again, the walk will be terminated because A has already been SEEN
on the previous walk.

This process is repeated for C, and D, but when we try to generate
the bitmaps for E, we cannot reuse neither the current walk nor the
bitmap we have generated so far.

What we do now is resetting both the walk and clearing the bitmap,
and performing the walk from scratch using E as the origin. This new
walk, however, does not need to be completed. Once we hit B, we can
lookup the bitmap we have already stored for that commit and OR it
with the existing bitmap we've composed so far, allowing us to limit
the walk early.

After all the bitmaps have been generated, another iteration through
the list of commits is performed to find the best XOR offsets for
compression before writing them to disk. Because of the 

[PATCH 10/16] pack-objects: use bitmaps when packing objects

2013-06-24 Thread Vicent Marti
A bitmap index is used, if available, to speed up the Counting Objects
phase during `pack-objects`.

The bitmap index is a `.bitmap` file that can be found inside
`$GIT_DIR/objects/pack/`, next to its corresponding packfile, and
contains precalculated reachability information for selected commits.
The full specification of the format for these bitmap indexes can be found
in `Documentation/technical/bitmap-format.txt`.

For a given commit SHA1, if it happens to be available in the bitmap
index, its bitmap will represent every single object that is reachable
from the commit itself. The nth bit in the bitmap is the nth object in
the index of the packfile; if it's set to 1, the object is reachable.

By using the bitmaps available in the index, this commit implements a new
pair of functions:

- `prepare_bitmap_walk`
- `traverse_bitmap_commit_list`

This first function tries to build a bitmap of all the objects that can be
reached from the commit roots of a given `rev_info` struct by using
the following algorithm:

- If all the interesting commits for a revision walk are available in
the index, the resulting reachability bitmap is the bitwise OR of all
the individual bitmaps.

- When the full set of WANTs is not available in the index, we perform a
partial revision walk using the commits that don't have bitmaps as
roots, and limiting the revision walk as soon as we reach a commit that
has a corresponding bitmap. The earlier OR'ed bitmap with all the
indexed commits can now be completed as this walk progresses, so the end
result is the full reachability list.

- For revision walks with a HAVEs set (a set of commits that are deemed
uninteresting), first we perform the same method as for the WANTs, but
using our HAVEs as roots, in order to obtain a full reachability bitmap
of all the uninteresting commits. This bitmap then can be used to:

a) limit the subsequent walk when building the WANTs bitmap
b) finding the final set of interesting commits by performing an
   AND-NOT of the WANTs and the HAVEs.

If `prepare_bitmap_walk` runs successfully, the resulting bitmap is
stored and the equivalent of a `traverse_commit_list` call can be
performed by using `traverse_bitmap_commit_list`; the bitmap version
of this call yields the objects straight from the packfile index
(without having to look them up or parse them) and hence is several
orders of magnitude faster.

If the `prepare_bitmap_walk` call fails (e.g. because no bitmap files
are available), the `rev_info` struct is left untouched, and can be used
to perform a manual rev-walk using `traverse_commit_list`.

Hence, this new pair of functions are a generic API that allows to
perform the equivalent of

git rev-list --objects [roots...] [^uninteresting...]

for any set of commits, even if they don't have specific bitmaps
generated for them.

In this specific commit, we use the API to perform the
`Counting Objects` phase in `builtin/pack-objects.c`, although it could
be used to speed up other parts of Git that use the same mechanism.

If the pack-objects invocation is being piped to `stdout` (like a normal
`pack-objects` from `upload-pack` would be used) and bitmaps are
enabled, the new `bitmap_walk` API will be used instead of
`traverse_commit_list`.

There are two ways to enable bitmaps for pack-objecs:

- Pass the `--use-bitmaps` flag when calling `pack-objects`
- Set `pack.usebitmaps` to `true` in the git config for the
repository.

Of course, simply enabling the bitmaps is not enought to perform the
optimization: a bitmap index must be available on disk. If no bitmap
index can be found, we'll silently fall back to the slow counting
objects phase.

The point of speeding up the Counting Objects phase of `pack-objects` is
to reduce fetch and clone times for big repositories, which right now
are definitely dominated by the rev-walk algorithm during the Counting
Objects phase.

Here are some sample timings from a full pack of `torvalds/linux` (i.e.
something very similar to what would be generated for a clone of the
repository):

$ time ../git/git pack-objects --all --stdout
Counting objects: 3053537, done.
Compressing objects: 100% (495706/495706), done.
Total 3053537 (delta 2529614), reused 3053537 (delta 2529614)

real0m36.686s
user0m34.440s
sys 0m2.184s

$ time ../git/git pack-objects --all --stdout
Counting objects: 3053537, done.
Compressing objects: 100% (495706/495706), done.
Total 3053537 (delta 2529614), reused 3053537 (delta 2529614)

real0m7.255s
user0m6.892s
sys 0m0.444s

From a hotspot profiling run, we can see how the counting
objects phase has been reduced to about 400ms (down from 28s).
The remaining time is spent finding deltas and writing the packfile, the
optimization of which is out of the scope of this topic.
---
 Makefile   |

[PATCH 15/16] write-bitmap: implement new git command to write bitmaps

2013-06-24 Thread Vicent Marti
The `pack-objects` builtin is capable of writing out bitmap indexes
(.bitmap) next to the their corresponding packfile, as part of the
process of actually generating the packfile.

This is a very efficient operation because all the required data for
writing the bitmap index (commit traversal list, list of all objects in
a packfile, sorted index for the packfile, and types for all objects in
the packfile) is readily available in memory as part of the process of
building the packfile itself.

There are however cases when we want to generate a bitmap index for a
packfile that already exists on disk (i.e. one we're not writing from
scratch). This new git builtin implements the bitmap index equivalent of
`git index-pack`: it writes a `.bitmap` file given a pair of existing
`.pack` and `.idx` files.

NOTE that `write-bitmap` requires the packfile to have been indexed
beforehand. If the packfile doesn't have its corresponding `.idx`
file, `git index-pack` must be called before `write-bitmap` can
work.

The process of generating bitmaps for an existing packfile is as
follows:

1. Load the existing pack index in memory. The `.idx` for the
packfile is loaded into a hash table in mememory so it can be
efficiently queried. As part of this loading process, the real type
for each object in the packfile is resolved (this implies resolving
deltas, which can make this process rather expensive).

2. Find the full closure for the packfile. All the objects from the
packfile that have been loaded in memory are iterated, looking for
commits. These commits are parsed, and their parents are marked as
such to ensure that

a) there is a full closure in the packfile, and no commit has a
dangling parent pointer

b) we can find the set of tips for the packfile, i.e., the set
of commits that don't have any commits pointing to them

3. The tips of the packfile are then used as the roots to perform a
normal revision walk. The result of this revision walk is the list
of commits that will be used by `bitmap_writer_select_commits` when
selecting which commits are going to be bitmapped.

4. We build and write the bitmap index in the same way that
`pack-objects` does, given that we have all the required metadata:

- an array of all the objects in the packfile, in index order
- the types of all these objects
- an array with a walk-ordering of all the commits in the
packfile, which will be used for selection
- a hash table to efficiently look up objects in the index

See the previous patch pack-objects: implement bitmap writing for
details on how the bitmap computation happens.
---
 Makefile   |1 +
 builtin.h  |1 +
 builtin/write-bitmap.c |  256 
 git.c  |1 +
 4 files changed, 259 insertions(+)
 create mode 100644 builtin/write-bitmap.c

diff --git a/Makefile b/Makefile
index 599aa59..4a0a7dd 100644
--- a/Makefile
+++ b/Makefile
@@ -1000,6 +1000,7 @@ BUILTIN_OBJS += builtin/upload-archive.o
 BUILTIN_OBJS += builtin/var.o
 BUILTIN_OBJS += builtin/verify-pack.o
 BUILTIN_OBJS += builtin/verify-tag.o
+BUILTIN_OBJS += builtin/write-bitmap.o
 BUILTIN_OBJS += builtin/write-tree.o
 
 GITLIBS = $(LIB_FILE) $(XDIFF_LIB)
diff --git a/builtin.h b/builtin.h
index 64bab6b..e39685f 100644
--- a/builtin.h
+++ b/builtin.h
@@ -144,6 +144,7 @@ extern int cmd_var(int argc, const char **argv, const char 
*prefix);
 extern int cmd_verify_tag(int argc, const char **argv, const char *prefix);
 extern int cmd_version(int argc, const char **argv, const char *prefix);
 extern int cmd_whatchanged(int argc, const char **argv, const char *prefix);
+extern int cmd_write_bitmap(int argc, const char **argv, const char *prefix);
 extern int cmd_write_tree(int argc, const char **argv, const char *prefix);
 extern int cmd_verify_pack(int argc, const char **argv, const char *prefix);
 extern int cmd_show_ref(int argc, const char **argv, const char *prefix);
diff --git a/builtin/write-bitmap.c b/builtin/write-bitmap.c
new file mode 100644
index 000..0cc1c9e
--- /dev/null
+++ b/builtin/write-bitmap.c
@@ -0,0 +1,256 @@
+#include stdlib.h
+
+#include cache.h
+#include commit.h
+#include tag.h
+#include diff.h
+#include revision.h
+#include progress.h
+#include list-objects.h
+#include pack.h
+#include refs.h
+#include pack-bitmap.h
+
+#include builtin/pack-objects.h
+
+static int progress = 1;
+static struct progress *progress_state;
+static int write_hash_cache;
+
+static struct object_entry **objects;
+static uint32_t nr_objects;
+
+static struct commit **walked_commits;
+static uint32_t nr_commits;
+
+static khash_sha1 *packed_objects;
+
+static struct object_entry 

[PATCH 14/16] sha1_file: implement `nth_packed_object_info`

2013-06-24 Thread Vicent Marti
A new helper function allows to efficiently query the size and real type
of an object in a packfile based on its position on the packfile index.

This is particularly useful when trying to parse all the information of
an index in memory.
---
 cache.h |1 +
 sha1_file.c |6 ++
 2 files changed, 7 insertions(+)

diff --git a/cache.h b/cache.h
index bbe5e2a..26e4567 100644
--- a/cache.h
+++ b/cache.h
@@ -1104,6 +1104,7 @@ extern void clear_delta_base_cache(void);
 extern struct packed_git *add_packed_git(const char *, int, int);
 extern const unsigned char *nth_packed_object_sha1(struct packed_git *, 
uint32_t);
 extern off_t nth_packed_object_offset(const struct packed_git *, uint32_t);
+extern int nth_packed_object_info(struct packed_git *p, uint32_t n, unsigned 
long *sizep);
 extern int find_pack_entry_pos(const unsigned char *sha1, struct packed_git 
*p);
 extern off_t find_pack_entry_one(const unsigned char *, struct packed_git *);
 extern int is_pack_valid(struct packed_git *);
diff --git a/sha1_file.c b/sha1_file.c
index 018a847..fd5bd01 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2223,6 +2223,12 @@ off_t nth_packed_object_offset(const struct packed_git 
*p, uint32_t n)
}
 }
 
+int nth_packed_object_info(struct packed_git *p, uint32_t n, unsigned long 
*sizep)
+{
+   off_t offset = nth_packed_object_offset(p, n);
+   return packed_object_info(p, offset, sizep, NULL);
+}
+
 int find_pack_entry_pos(const unsigned char *sha1, struct packed_git *p)
 {
const uint32_t *level1_ofs = p-index_data;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/16] revision: allow setting custom limiter function

2013-06-24 Thread Vicent Marti
This commit enables users of `struct rev_info` to peform custom limiting
during a revision walk (i.e. `get_revision`).

If the field `include_check` has been set to a callback, this callback
will be issued once for each commit before it is added to the pending
list of the revwalk. If the include check returns 0, the commit will be
marked as added but won't be pushed to the pending list, effectively
limiting the walk.
---
 revision.c |5 +
 revision.h |2 ++
 2 files changed, 7 insertions(+)

diff --git a/revision.c b/revision.c
index f1bb731..fa78c65 100644
--- a/revision.c
+++ b/revision.c
@@ -777,8 +777,13 @@ static int add_parents_to_list(struct rev_info *revs, 
struct commit *commit,
 
if (commit-object.flags  ADDED)
return 0;
+
commit-object.flags |= ADDED;
 
+   if (revs-include_check 
+   !revs-include_check(commit, revs-include_check_data))
+   return 0;
+
/*
 * If the commit is uninteresting, don't try to
 * prune parents - we want the maximal uninteresting
diff --git a/revision.h b/revision.h
index eeea6fb..997a093 100644
--- a/revision.h
+++ b/revision.h
@@ -162,6 +162,8 @@ struct rev_info {
unsigned long min_age;
int min_parents;
int max_parents;
+   int (*include_check)(struct commit *, void *);
+   void *include_check_data;
 
/* diff info for patches and for paths limiting */
struct diff_options diffopt;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/16] ewah: compressed bitmap implementation

2013-06-24 Thread Vicent Marti
EWAH is a word-aligned compressed variant of a bitset (i.e. a data
structure that acts as a 0-indexed boolean array for many entries).

It uses a 64-bit run-length encoding (RLE) compression scheme,
trading some compression for better processing speed.

The goal of this word-aligned implementation is not to achieve
the best compression, but rather to improve query processing time.
As it stands right now, this EWAH implementation will always be more
efficient storage-wise than its uncompressed alternative.

EWAH arrays will be used as the on-disk format to store reachability
bitmaps for all objects in a repository while keeping reasonable sizes,
in the same way that JGit does.

This EWAH implementation is a mostly straightforward port of the
original `javaewah` library that JGit currently uses. The library is
self-contained and has been embedded whole (4 files) inside the `ewah`
folder to ease redistribution.

The library is re-licensed under the GPLv2 with the permission of Daniel
Lemire, the original author. The source code for the C version can
be found on GitHub:

https://github.com/vmg/libewok

The original Java implementation can also be found on GitHub:

https://github.com/lemire/javaewah
---
 Makefile   |6 +
 ewah/bitmap.c  |  229 +
 ewah/ewah_bitmap.c |  703 
 ewah/ewah_io.c |  199 +++
 ewah/ewah_rlw.c|  124 +
 ewah/ewok.h|  194 +++
 ewah/ewok_rlw.h|  114 +
 7 files changed, 1569 insertions(+)
 create mode 100644 ewah/bitmap.c
 create mode 100644 ewah/ewah_bitmap.c
 create mode 100644 ewah/ewah_io.c
 create mode 100644 ewah/ewah_rlw.c
 create mode 100644 ewah/ewok.h
 create mode 100644 ewah/ewok_rlw.h

diff --git a/Makefile b/Makefile
index e01506d..e03c773 100644
--- a/Makefile
+++ b/Makefile
@@ -672,6 +672,8 @@ LIB_H += diff.h
 LIB_H += diffcore.h
 LIB_H += dir.h
 LIB_H += exec_cmd.h
+LIB_H += ewah/ewok.h
+LIB_H += ewah/ewok_rlw.h
 LIB_H += fetch-pack.h
 LIB_H += fmt-merge-msg.h
 LIB_H += fsck.h
@@ -802,6 +804,10 @@ LIB_OBJS += dir.o
 LIB_OBJS += editor.o
 LIB_OBJS += entry.o
 LIB_OBJS += environment.o
+LIB_OBJS += ewah/bitmap.o
+LIB_OBJS += ewah/ewah_bitmap.o
+LIB_OBJS += ewah/ewah_io.o
+LIB_OBJS += ewah/ewah_rlw.o
 LIB_OBJS += exec_cmd.o
 LIB_OBJS += fetch-pack.o
 LIB_OBJS += fsck.o
diff --git a/ewah/bitmap.c b/ewah/bitmap.c
new file mode 100644
index 000..75ca8fd
--- /dev/null
+++ b/ewah/bitmap.c
@@ -0,0 +1,229 @@
+/**
+ * Copyright 2013, GitHub, Inc
+ * Copyright 2009-2013, Daniel Lemire, Cliff Moon,
+ * David McIntosh, Robert Becho, Google Inc. and Veronika Zenz
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, 
USA.
+ */
+#include assert.h
+#include stdlib.h
+#include string.h
+
+#include ewok.h
+
+#define MASK(x) ((eword_t)1  (x % BITS_IN_WORD))
+#define BLOCK(x) (x / BITS_IN_WORD)
+
+struct bitmap *bitmap_new(void)
+{
+   struct bitmap *bitmap = ewah_malloc(sizeof(struct bitmap));
+   bitmap-words = ewah_calloc(32, sizeof(eword_t));
+   bitmap-word_alloc = 32;
+   return bitmap;
+}
+
+void bitmap_set(struct bitmap *self, size_t pos)
+{
+   size_t block = BLOCK(pos);
+
+   if (block = self-word_alloc) {
+   size_t old_size = self-word_alloc;
+   self-word_alloc = block * 2;
+   self-words = ewah_realloc(self-words, self-word_alloc * 
sizeof(eword_t));
+
+   memset(self-words + old_size, 0x0,
+   (self-word_alloc - old_size) * sizeof(eword_t));
+   }
+
+   self-words[block] |= MASK(pos);
+}
+
+void bitmap_clear(struct bitmap *self, size_t pos)
+{
+   size_t block = BLOCK(pos);
+
+   if (block  self-word_alloc)
+   self-words[block] = ~MASK(pos);
+}
+
+bool bitmap_get(struct bitmap *self, size_t pos)
+{
+   size_t block = BLOCK(pos);
+   return block  self-word_alloc  (self-words[block]  MASK(pos)) != 
0;
+}
+
+extern size_t ewah_add_empty_words(struct ewah_bitmap *self, bool v, size_t 
number);
+extern size_t ewah_add(struct ewah_bitmap *self, eword_t word);
+
+struct ewah_bitmap *bitmap_to_ewah(struct bitmap *bitmap)
+{
+   struct ewah_bitmap *ewah = ewah_new();
+   size_t i, running_empty_words = 0;
+   eword_t 

[PATCH 11/16] rev-list: add bitmap mode to speed up lists

2013-06-24 Thread Vicent Marti
The bitmap reachability index used to speed up the counting objects
phase during `pack-objects` can also be used to optimize a normal
rev-list if the only thing required are the SHA1s of the objects during
the list.

Calling `git rev-list --use-bitmaps [committish]` is the equivalent
of `git rev-list --objects`, but the rev list is performed based on
a bitmap result instead of using a manual counting objects phase.

These are some example timings for `torvalds/linux`:

$ time ../git/git rev-list --objects master  /dev/null

real0m25.567s
user0m25.148s
sys 0m0.384s

$ time ../git/git rev-list --use-bitmaps master  /dev/null

real0m0.393s
user0m0.356s
sys 0m0.036s

Additionally, a `--test-bitmap` flag has been added that will perform
the same rev-list manually (i.e. using a normal revwalk) and using
bitmaps, and verify that the results are the same.
---
 builtin/rev-list.c |   28 +++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 67701be..905ed08 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -3,6 +3,8 @@
 #include diff.h
 #include revision.h
 #include list-objects.h
+#include pack.h
+#include pack-bitmap.h
 #include builtin.h
 #include log-tree.h
 #include graph.h
@@ -256,6 +258,17 @@ static int show_bisect_vars(struct rev_list_info *info, 
int reaches, int all)
return 0;
 }
 
+static int show_object_fast(
+   const unsigned char *sha1,
+   enum object_type type,
+   uint32_t hash, int exclude,
+   struct packed_git *found_pack,
+   off_t found_offset)
+{
+   fprintf(stdout, %ss\n, sha1_to_hex(sha1));
+   return 1;
+}
+
 int cmd_rev_list(int argc, const char **argv, const char *prefix)
 {
struct rev_info revs;
@@ -264,6 +277,7 @@ int cmd_rev_list(int argc, const char **argv, const char 
*prefix)
int bisect_list = 0;
int bisect_show_vars = 0;
int bisect_find_all = 0;
+   int use_bitmaps = 0;
 
git_config(git_default_config, NULL);
init_revisions(revs, prefix);
@@ -305,8 +319,15 @@ int cmd_rev_list(int argc, const char **argv, const char 
*prefix)
bisect_show_vars = 1;
continue;
}
+   if (!strcmp(arg, --use-bitmaps)) {
+   use_bitmaps = 1;
+   continue;
+   }
+   if (!strcmp(arg, --test-bitmap)) {
+   test_bitmap_walk(revs);
+   return 0;
+   }
usage(rev_list_usage);
-
}
if (revs.commit_format != CMIT_FMT_UNSPECIFIED) {
/* The command line has a --pretty  */
@@ -332,6 +353,11 @@ int cmd_rev_list(int argc, const char **argv, const char 
*prefix)
if (bisect_list)
revs.limited = 1;
 
+   if (use_bitmaps  !prepare_bitmap_walk(revs, NULL)) {
+   traverse_bitmap_commit_list(show_object_fast);
+   return 0;
+   }
+
if (prepare_revision_walk(revs))
die(revision walk setup failed);
if (revs.tree_objects)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/16] pack-objects: make `pack_name_hash` global

2013-06-24 Thread Vicent Marti
The hash function used by `builtin/pack-objects.c` to efficiently find
delta bases when packing can be of interest for other parts of Git that
also have to deal with delta bases.
---
 builtin/pack-objects.c |   24 ++--
 cache.h|2 ++
 sha1_file.c|   20 
 3 files changed, 24 insertions(+), 22 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index fc12df8..b7cab18 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -854,26 +854,6 @@ static struct object_entry *locate_object_entry(const 
unsigned char *sha1)
return NULL;
 }
 
-static unsigned name_hash(const char *name)
-{
-   unsigned c, hash = 0;
-
-   if (!name)
-   return 0;
-
-   /*
-* This effectively just creates a sortable number from the
-* last sixteen non-whitespace characters. Last characters
-* count most, so things that end in .c sort together.
-*/
-   while ((c = *name++) != 0) {
-   if (isspace(c))
-   continue;
-   hash = (hash  2) + (c  24);
-   }
-   return hash;
-}
-
 static void setup_delta_attr_check(struct git_attr_check *check)
 {
static struct git_attr *attr_delta;
@@ -977,7 +957,7 @@ static int add_object_entry_1(const unsigned char *sha1, 
enum object_type type,
 static int add_object_entry(const unsigned char *sha1, enum object_type type,
const char *name, int exclude)
 {
-   if (add_object_entry_1(sha1, type, name_hash(name), exclude, NULL, 0)) {
+   if (add_object_entry_1(sha1, type, pack_name_hash(name), exclude, NULL, 
0)) {
struct object_entry *entry = objects[nr_objects - 1];
 
if (name  no_try_delta(name))
@@ -1186,7 +1166,7 @@ static void add_preferred_base_object(const char *name)
 {
struct pbase_tree *it;
int cmplen;
-   unsigned hash = name_hash(name);
+   unsigned hash = pack_name_hash(name);
 
if (!num_preferred_base || check_pbase_path(hash))
return;
diff --git a/cache.h b/cache.h
index a29645e..95ef14d 100644
--- a/cache.h
+++ b/cache.h
@@ -653,6 +653,8 @@ extern char *sha1_pack_index_name(const unsigned char 
*sha1);
 extern const char *find_unique_abbrev(const unsigned char *sha1, int);
 extern const unsigned char null_sha1[20];
 
+extern uint32_t pack_name_hash(const char *name);
+
 static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
 {
int i;
diff --git a/sha1_file.c b/sha1_file.c
index 371e295..44c7bca 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -60,6 +60,26 @@ static struct cached_object empty_tree = {
0
 };
 
+uint32_t pack_name_hash(const char *name)
+{
+   unsigned c, hash = 0;
+
+   if (!name)
+   return 0;
+
+   /*
+* This effectively just creates a sortable number from the
+* last sixteen non-whitespace characters. Last characters
+* count most, so things that end in .c sort together.
+*/
+   while ((c = *name++) != 0) {
+   if (isspace(c))
+   continue;
+   hash = (hash  2) + (c  24);
+   }
+   return hash;
+}
+
 static struct packed_git *last_found_pack;
 
 static struct cached_object *find_cached_object(const unsigned char *sha1)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 13/16] repack: consider bitmaps when performing repacks

2013-06-24 Thread Vicent Marti
Since `pack-objects` will write a `.bitmap` file next to the `.pack` and
`.idx` files, this commit teaches `git-repack` to consider the new
bitmap indexes (if they exist) when performing repack operations.

This implies moving old bitmap indexes out of the way if we are
repacking a repository that already has them, and moving the newly
generated bitmap indexes into the `objects/pack` directory, next to
their corresponding packfiles.

Since `git repack` is now capable of handling these `.bitmap` files,
a normal `git gc` run on a repository that has `pack.usebitmaps` set
to true in its config file will generate bitmap indexes as part of the
garbage collection process.
---
 git-repack.sh |   10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/git-repack.sh b/git-repack.sh
index 7579331..d5355ae 100755
--- a/git-repack.sh
+++ b/git-repack.sh
@@ -108,7 +108,7 @@ rollback=
 failed=
 for name in $names
 do
-   for sfx in pack idx
+   for sfx in pack idx bitmap
do
file=pack-$name.$sfx
test -f $PACKDIR/$file || continue
@@ -156,6 +156,11 @@ do
fullbases=$fullbases pack-$name
chmod a-w $PACKTMP-$name.pack
chmod a-w $PACKTMP-$name.idx
+
+   test -f $PACKTMP-$name.bitmap 
+   chmod a-w $PACKTMP-$name.bitmap 
+   mv -f $PACKTMP-$name.bitmap $PACKDIR/pack-$name.bitmap
+
mv -f $PACKTMP-$name.pack $PACKDIR/pack-$name.pack 
mv -f $PACKTMP-$name.idx  $PACKDIR/pack-$name.idx ||
exit
@@ -166,6 +171,7 @@ for name in $names
 do
rm -f $PACKDIR/old-pack-$name.idx
rm -f $PACKDIR/old-pack-$name.pack
+   rm -f $PACKDIR/old-pack-$name.bitmap
 done
 
 # End of pack replacement.
@@ -180,7 +186,7 @@ then
  do
case  $fullbases  in
* $e *) ;;
-   *)  rm -f $e.pack $e.idx $e.keep ;;
+   *)  rm -f $e.pack $e.idx $e.keep $e.bitmap 
;;
esac
  done
)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/16] sha1_file: export `git_open_noatime`

2013-06-24 Thread Vicent Marti
The `git_open_noatime` helper can be of general interest for other
consumers of git's different on-disk formats.
---
 cache.h |1 +
 sha1_file.c |4 +---
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/cache.h b/cache.h
index 95ef14d..bbe5e2a 100644
--- a/cache.h
+++ b/cache.h
@@ -769,6 +769,7 @@ extern int hash_sha1_file(const void *buf, unsigned long 
len, const char *type,
 extern int write_sha1_file(const void *buf, unsigned long len, const char 
*type, unsigned char *return_sha1);
 extern int pretend_sha1_file(void *, unsigned long, enum object_type, unsigned 
char *);
 extern int force_object_loose(const unsigned char *sha1, time_t mtime);
+extern int git_open_noatime(const char *name);
 extern void *map_sha1_file(const unsigned char *sha1, unsigned long *size);
 extern int unpack_sha1_header(git_zstream *stream, unsigned char *map, 
unsigned long mapsize, void *buffer, unsigned long bufsiz);
 extern int parse_sha1_header(const char *hdr, unsigned long *sizep);
diff --git a/sha1_file.c b/sha1_file.c
index 44c7bca..018a847 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -259,8 +259,6 @@ char *sha1_pack_index_name(const unsigned char *sha1)
 struct alternate_object_database *alt_odb_list;
 static struct alternate_object_database **alt_odb_tail;
 
-static int git_open_noatime(const char *name);
-
 /*
  * Prepare alternate object database registry.
  *
@@ -1307,7 +1305,7 @@ int check_sha1_signature(const unsigned char *sha1, void 
*map,
return hashcmp(sha1, real_sha1) ? -1 : 0;
 }
 
-static int git_open_noatime(const char *name)
+int git_open_noatime(const char *name)
 {
static int sha1_file_open_flag = O_NOATIME;
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/16] sha1_file: refactor into `find_pack_object_pos`

2013-06-24 Thread Vicent Marti
Looking up the offset in the packfile for a given SHA1 involves the
following:

- Finding the position in the index for the given SHA1
- Accessing the offset cache in the index for the found position

There are cases however where we'd like to find the position of a SHA1
in the index without looking up the packfile offset (e.g. when accessing
information that has been indexed based on index offsets).

This refactoring implements `find_pack_object_pos`, returning the
position in the index, and re-implements `find_pack_entry_one`(returning
the actual offset in the packfile) to use the new function.
---
 cache.h |1 +
 sha1_file.c |   27 +--
 2 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/cache.h b/cache.h
index ec8240f..a29645e 100644
--- a/cache.h
+++ b/cache.h
@@ -1101,6 +1101,7 @@ extern void clear_delta_base_cache(void);
 extern struct packed_git *add_packed_git(const char *, int, int);
 extern const unsigned char *nth_packed_object_sha1(struct packed_git *, 
uint32_t);
 extern off_t nth_packed_object_offset(const struct packed_git *, uint32_t);
+extern int find_pack_entry_pos(const unsigned char *sha1, struct packed_git 
*p);
 extern off_t find_pack_entry_one(const unsigned char *, struct packed_git *);
 extern int is_pack_valid(struct packed_git *);
 extern void *unpack_entry(struct packed_git *, off_t, enum object_type *, 
unsigned long *);
diff --git a/sha1_file.c b/sha1_file.c
index 0af19c0..371e295 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2205,8 +2205,7 @@ off_t nth_packed_object_offset(const struct packed_git 
*p, uint32_t n)
}
 }
 
-off_t find_pack_entry_one(const unsigned char *sha1,
- struct packed_git *p)
+int find_pack_entry_pos(const unsigned char *sha1, struct packed_git *p)
 {
const uint32_t *level1_ofs = p-index_data;
const unsigned char *index = p-index_data;
@@ -2219,7 +2218,7 @@ off_t find_pack_entry_one(const unsigned char *sha1,
 
if (!index) {
if (open_pack_index(p))
-   return 0;
+   return -1;
level1_ofs = p-index_data;
index = p-index_data;
}
@@ -2243,12 +2242,9 @@ off_t find_pack_entry_one(const unsigned char *sha1,
 
if (use_lookup  0)
use_lookup = !!getenv(GIT_USE_LOOKUP);
+
if (use_lookup) {
-   int pos = sha1_entry_pos(index, stride, 0,
-lo, hi, p-num_objects, sha1);
-   if (pos  0)
-   return 0;
-   return nth_packed_object_offset(p, pos);
+   return sha1_entry_pos(index, stride, 0, lo, hi, p-num_objects, 
sha1);
}
 
do {
@@ -2259,13 +2255,24 @@ off_t find_pack_entry_one(const unsigned char *sha1,
printf(lo %u hi %u rg %u mi %u\n,
   lo, hi, hi - lo, mi);
if (!cmp)
-   return nth_packed_object_offset(p, mi);
+   return mi;
if (cmp  0)
hi = mi;
else
lo = mi+1;
} while (lo  hi);
-   return 0;
+
+   return -1;
+}
+
+off_t find_pack_entry_one(const unsigned char *sha1, struct packed_git *p)
+{
+   int pos;
+
+   if ((pos = find_pack_entry_pos(sha1, p))  0)
+   return 0;
+
+   return nth_packed_object_offset(p, (uint32_t)pos);
 }
 
 int is_pack_valid(struct packed_git *p)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/16] compat: add endinanness helpers

2013-06-24 Thread Vicent Marti
The POSIX standard doesn't currently define a `nothll`/`htonll`
function pair to perform network-to-host and host-to-network
swaps of 64-bit data. These 64-bit swaps are necessary for the on-disk
storage of EWAH bitmaps if they are not in native byte order.
---
 git-compat-util.h |   28 
 1 file changed, 28 insertions(+)

diff --git a/git-compat-util.h b/git-compat-util.h
index ff193f4..bc9b591 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -710,4 +710,32 @@ void warn_on_inaccessible(const char *path);
 /* Get the passwd entry for the UID of the current process. */
 struct passwd *xgetpwuid_self(void);
 
+#include endian.h
+
+#ifndef __BYTE_ORDER
+# if defined(BYTE_ORDER)  defined(LITTLE_ENDIAN)  defined(BIG_ENDIAN)
+#  define __BYTE_ORDER BYTE_ORDER
+#  define __LITTLE_ENDIAN LITTLE_ENDIAN
+#  define __BIG_ENDIAN BIG_ENDIAN
+# else
+#  error Cannot determine endianness
+# endif
+#endif
+
+#if __BYTE_ORDER == __BIG_ENDIAN
+# define ntohll(n) (n)
+# define htonll(n) (n)
+#elif __BYTE_ORDER == __LITTLE_ENDIAN
+# if defined(__GNUC__)  defined(__GLIBC__)
+#  include byteswap.h
+#  define ntohll(n) bswap_64(n)
+#  define htonll(n) bswap_64(n)
+# else /* GNUC  GLIBC */
+#  define ntohll(n) ( (((unsigned long long)ntohl(n))  32) + ntohl(n  32) )
+#  define htonll(n) ( (((unsigned long long)htonl(n))  32) + htonl(n  32) )
+# endif /* GNUC  GLIBC */
+#else /* __BYTE_ORDER */
+# error Can't define htonll or ntohll!
+#endif
+
 #endif
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libgit2 status

2012-08-25 Thread Vicent Marti
On Sat, Aug 25, 2012 at 2:56 AM, Andreas Ericsson a...@op5.se wrote:
 Politically, I'm not sure how keen the git community is on handing
 over control to the core stuff of git to a commercial entity,

The development of libgit2 happens 100% in the open. I don't know what
commercial entity are you talking about, but there are several
companies and independent contributors working on the Library at the
moment.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html