Re: [PATCH] pack-bitmap: do not use gcc packed attribute

2014-08-05 Thread Vicent Martí
On Mon, Aug 4, 2014 at 9:19 PM, Karsten Blees karsten.bl...@gmail.com wrote:
 This raises the question why we read via mmap at all

The first version of the pack bitmap format I wrote for GitHub was 50%
faster to load than this one because it was designed to be mmapable.
Eventually we moved to the JGit-compatible bitmap format (because I
get paid a lot of money to do as I'm told -- not because of some
inherent benefit of the JGit format), which needs to be read
sequentially, but I never bothered to change the mmap reading code.

I believe your patch makes a lot of sense -- at this point we could as
well remove the mmaping altogether and read the file sequentially.

Cheers,
vmg
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Git in GSoC 2014

2014-02-26 Thread Vicent Martí
On Wed, Feb 26, 2014 at 11:41 AM, Michael Haggerty mhag...@alum.mit.edu wrote:
 Since time is short, I already started on this.  I wrote a first draft
 of an introduction for the students.  I also started looking for
 microprojects.  I started going through our source files alphabetically,
 and have already found six suggestions by bundle.c, so I don't think
 there will be a problem finding enough tiny things to do.

Note that for projects that are either libgit2-centric or mix Core Git
and libgit2, it would be worth it to the students to submit a pull
request on libgit2 (or ideally, on both projects, although that's
going to be hard) in order to make themselves familiar with the code
base.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Git in GSoC 2014

2014-02-26 Thread Vicent Martí
On Wed, Feb 26, 2014 at 12:16 PM, Duy Nguyen pclo...@gmail.com wrote:
 On Tue, Feb 25, 2014 at 10:41 PM, Jeff King p...@peff.net wrote:
 I'm pleased to announce that Git has been accepted to this year's Google
 Summer of Code.

 Student proposals will start coming in on March 22. In the meantime
 students will be reading our Ideas page[1] and enquiring about the
 program on the mailing list and on irc. There are many ways that
 existing git developers can help:

   - continue to add to and polish the Ideas page

 One thing I noticed after tg/index-v4-format is both libgit2 and jgit
 do not seem to support index v4. So we could add index v4 support on
 libgit2 to the idea page. It's a relatively small task though once
 you get a hang on index format.

That sounds like a nice task for the Summer of Code too, specially
with the current effort to make Index v4 more visible in Core Git.

I wonder if anybody from JGit would also be interested on mentoring
for the equivalent task (index v4 on JGit). I've CC'ed Shawn Pearce.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Git GSoC 2014

2014-02-14 Thread Vicent Martí
On Thu, Feb 13, 2014 at 10:45 PM, Thomas Rast t...@thomasrast.ch wrote:
 Replace object loading/writing layer by libgit2

 Git reads objects from storage (loose and packed) through functions in
 sha1_file.c.  Most commands only require very simple, opaque read and
 write access to the object storage.  As a weatherballoon, show that it
 is feasible to use libgit2 git_odb_* routines for these simple callers.

 Aim for passing the git test suite using git_odb_* object storage
 access, except for tests that verify behavior in the face of storage
 corruption, replacement objects, alternate storage locations, and
 similar quirks.  Of course it is even better if you pass the test suite
 without exception.

 Language: C
 Difficulty: hard
 Possible mentors: Thomas Rast and fill in libgit2 expert

Note that we have several people in the libgit2 team that are willing
(and excited) to mentor or co-mentor this project or any of the other
libgit2 related projects that have been proposed.

Prospective list is

- Vicent Marti
- Russell Belfer
- Ed Thomson
- Carlos Martin (past GSoC student)

So there shouldn't be any mentor shortage.

Cheers,
vmg
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] ewah: support platforms that require aligned reads

2014-01-23 Thread Vicent Martí
On Fri, Jan 24, 2014 at 12:44 AM, Jonathan Nieder jrnie...@gmail.com wrote:
 --- a/ewah/ewah_io.c
 +++ b/ewah/ewah_io.c
 @@ -112,23 +112,38 @@ int ewah_serialize(struct ewah_bitmap *self, int fd)
 [...]
 +#if __BYTE_ORDER != __BIG_ENDIAN

 Is this portable?

We explicitly set the __BYTE_ORDER macros in `compat/bswap.h`. In
fact, this preprocessor conditional is the same one that we use when
choosing what version of the `ntohl` macro to define, so that's why I
decided to use it here.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pack-objects: turn off bitmaps when skipping objects

2014-01-23 Thread Vicent Martí
On Fri, Jan 24, 2014 at 12:45 AM, Siddharth Agarwal s...@fb.com wrote:
 Yes, we'd prefer to do that too. How do you actually do this, though? I
 don't see a way to pass `--honor-pack-keep` (shouldn't I pass in its
 inverse?) down to `git-pack-objects`.

We run with this patch in production, it may be of use to you:
https://gist.github.com/vmg/8589317

In fact, it may be worth upstreaming too. I'll kindly ask peff to do
it when he has a moment.

Apologies for not attaching the patch inline, the GMail web UI doesn't
mix well with patch workflow.

Cheers,
vmg
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Git v1.9-rc0

2014-01-22 Thread Vicent Martí
On Wed, Jan 22, 2014 at 5:11 PM, Junio C Hamano gits...@pobox.com wrote:
 Stefan Näwe stefan.na...@atlas-elektronik.com writes:

 Am 22.01.2014 13:53, schrieb Javier Domingo Cansino:
 Will there be any change on how tarballs are distributed, taking into
 account that Google will be shutting down Google Code Downloads
 section[1][2]?


 Am I missing something or what's wrong with this:

   https://github.com/gitster/git/archive/v1.9-rc0.tar.gz

 or any

   https://github.com/gitster/git/archive/$TAG.tar.gz

 ??

 Do these consume CPU every time somebody asks for a tarball?  That
 might be considered wrong depending on the view.

No, our infrastructure caches frequently requested tarballs so they
don't have to be regenerated on the fly. If you would prefer to
distribute a different version of the tarball for the release (e.g.
one with a different filename or folder structure), you can upload it
directly to the release page of the tag:

https://github.com/gitster/git/releases/tag/v1.9-rc0

We'll automatically mirror your release to S3 and serve it from there.
You can also go ahead and edit the release page with the changelog
you've posted in this email thread to make it more user friendly.

WE WILL SERVE YOUR RELEASES, JUNIO

BECAUSE WE LOVE YOU

-vmg
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What's cooking in git.git (Nov 2013, #05; Thu, 21)

2013-11-22 Thread Vicent Martí
On Fri, Nov 22, 2013 at 8:36 PM, Junio C Hamano gits...@pobox.com wrote:
 Do we want to try this in 'next' post-1.8.5, or should I try to prod an
 area expert like Shawn into doing another round of review?

 Yes ;-).

 I recall starting to read the series over and then got sidetracked
 in the middle and never finishing.  I'll try to make time sometime
 this weekend (we are still buried in boxes after the move, though,
 so no promises) myself.

 How close is this what you guys are running in production these
 days, by the way?

We are running a slightly older version of the patchset, because we're
still on 1.8.4 and the current reroll doesn't apply cleanly there.

If this could make it to `next` some time next week, that would work
out great for us, because we may start considering using `next` as a
partial or full deployment on our production machines

This also means that we could exercise the patchset and everything
else that is queued up in next release... You must remember all the
corner cases and bugs peff brings to the list every time we deploy a
new Git to production. Wouldn't it be nice to have a thorough checking
of the current iteration *before* the release, and not after? :)

*hint hint*
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] fix up 'jk/pack-bitmap' branch

2013-11-07 Thread Vicent Martí
Thank you Ramsay, all the patches look OK to me.

On Thu, Nov 7, 2013 at 11:19 PM, Jeff King p...@peff.net wrote:
 On Thu, Nov 07, 2013 at 09:58:02PM +, Ramsay Jones wrote:

 These patches fix various errors/warnings on the cygwin, MinGW and
 msvc builds, provoked by the jk/pack-bitmap branch.

 Thanks. Your timing is impeccable, as I was just sitting down to
 finalize the next re-roll. I'll add these in.

 -Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What's cooking in git.git (Oct 2013, #07; Mon, 28)

2013-10-30 Thread Vicent Martí
On Wed, Oct 30, 2013 at 5:51 PM, Torsten Bögershausen tbo...@web.de wrote:
 There is a name clash under cygwin 1.7 (1.5 is OK)
 The following first aid hot fix works for me:
 /Torsten

If Cygwin declares its own bswap_64, wouldn't it be better to use it
instead of overwriting it with our own?
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/19] compat: add endianness helpers

2013-10-30 Thread Vicent Martí
On Oct 30, 2013, at 9:25 AM, Jeff King p...@peff.net wrote:

 On Sat, Oct 26, 2013 at 09:55:36AM +0200, Thomas Rast wrote:
 
 The POSIX standard doesn't currently define a `nothll`/`htonll`
 
 typo: ntohll
 
 Thanks.
 
 function pair to perform network-to-host and host-to-network
 swaps of 64-bit data. These 64-bit swaps are necessary for the on-disk
 storage of EWAH bitmaps if they are not in native byte order.
 [...]
 +#  include byteswap.h
 
 Do we need a hack on top similar to what ntoh_l and hton_l do, for
 platforms that do not support unaligned access?
 
 Ugh, probably. I didn't even know about those. But we do use them when
 reading the ewah bitmaps, which I believe can be at random offsets. We
 should be able to use the same ntoh_l solution.
 
 -Peff

You are right, when reading mmaps we cannot ensure the alignment of the 8 byte 
sections. For writing the bitmaps to memory, however, we can make that 
assumption because the writes will happen to either malloc'ed segments, or the 
stack.

A proposed fix follows:

From 0eed934805543390195f8aee231f8a1da33dad0b Mon Sep 17 00:00:00 2001
From: Vicent Marti tan...@gmail.com
Date: Wed, 30 Oct 2013 15:27:20 +0100
Subject: [PATCH] ewah: support non-aligned reads

When reading straight from a mmaped file, some platforms do not support
atomically loading 8 bytes from random positions on memory to perform a
byte swap. To work around this, we `memcpy` the whole area that needs
to be byteswapped to its final destination, and then perform the
byteswap on each 8-byte chunk: the destionation chunks *will* be aligned
because the destination area is a malloc'ed chunk of memory.

Signed-off-by: Vicent Marti tan...@gmail.com
---
 ewah/ewah_io.c | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/ewah/ewah_io.c b/ewah/ewah_io.c
index fca93f9..41f6d3d 100644
--- a/ewah/ewah_io.c
+++ b/ewah/ewah_io.c
@@ -76,6 +76,10 @@ int ewah_serialize_to(struct ewah_bitmap *self,
buffer = self-buffer;
words_left = self-buffer_size;
 
+   /* buffer is a malloc'ed buffer, which we can assume to
+* be at the very least 8-byte aligned */
+   assert((intptr_t)buffer % 8 == 0);
+
while (words_left = words_per_dump) {
for (i = 0; i  words_per_dump; ++i, ++buffer)
dump[i] = htonll(*buffer);
@@ -117,7 +121,6 @@ int ewah_serialize(struct ewah_bitmap *self, int fd)
 int ewah_read_mmap(struct ewah_bitmap *self, void *map, size_t len)
 {
uint32_t *read32 = map;
-   eword_t *read64;
size_t i;
 
self-bit_size = ntohl(*read32++);
@@ -128,10 +131,19 @@ int ewah_read_mmap(struct ewah_bitmap *self, void *map, 
size_t len)
if (!self-buffer)
return -1;
 
-   for (i = 0, read64 = (void *)read32; i  self-buffer_size; ++i)
-   self-buffer[i] = ntohll(*read64++);
+#ifdef NEEDS_ALIGNED_ACCESS
+   memcpy(self-buffer, read32, self-buffer_size * sizeof(eword_t));
+   for (i = 0; i  self-buffer_size; ++i)
+   self-buffer[i] = ntohll(self-buffer[i]);
+#else
+   {
+   eword_t *read64 = (void *)read32;
+   for (i = 0; i  self-buffer_size; ++i)
+   self-buffer[i] = ntohll(read64[i]);
+   }
+#endif
 
-   read32 = (void *)read64;
+   read32 += self-buffer_size * sizeof(eword_t) / sizeof(uint32_t);
self-rlw = self-buffer + ntohl(*read32++);
 
return (3 * 4) + (self-buffer_size * 8);
@@ -175,7 +187,7 @@ int ewah_deserialize(struct ewah_bitmap *self, int fd)
return -1;
 
for (i = 0; i  words_per_dump; ++i, ++buffer)
-   *buffer = ntohll(dump[i]);
+   *buffer = ntohll(dump[i]); /* aligned on stack */
 
words_left -= words_per_dump;
}
@@ -185,7 +197,7 @@ int ewah_deserialize(struct ewah_bitmap *self, int fd)
return -1;
 
for (i = 0; i  words_left; ++i, ++buffer)
-   *buffer = ntohll(dump[i]);
+   *buffer = ntohll(dump[i]); /* aligned on stack */
}
 
/** 32 bit -- position for the RLW */
-- 
1.8.3.2


--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What's cooking in git.git (Oct 2013, #06; Fri, 25)

2013-10-28 Thread Vicent Martí
On Mon, Oct 28, 2013 at 4:48 PM, Junio C Hamano gits...@pobox.com wrote:
 jk/pack-bitmap adds khash.h, which from a first glance looks like yet
 another hash table implementation. I was just wondering if kb's new
 hash tables can cover the need of pack-bitmap.c too so we can remove
 khash.h later..

 Good thinking ;-).

We use the khash tables to map:

- sha1 (const char *) to (void *)
- sha1 (const char *) to int

The new `hashmap.c` covers the first case quite well (albeit slightly
more verbosely than I'd like), but in the second case it doesn't quite
work. Since the new hash needs to embed the struct hashmap_entry on
all its values (to allow for separate chaining), having it map to
`int` keys requires a struct like this:

struct sha1_position {
struct hashmap_entry {
struct hashmap_entry *next;
unsigned int hash;
};
int position;
}

khash on the other hand is capable of storing the position values as
part of the hash table itself (i.e. `int **buckets`), and saves us
from thousands of bytes of allocations + indirection.

I am not sure whether the consistency of having a single hash map
warrants the performance and memory hits when operating on the
extended index.

Please advice.

luv,
vmg
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What's cooking in git.git (Oct 2013, #06; Fri, 25)

2013-10-28 Thread Vicent Martí
On Mon, Oct 28, 2013 at 8:45 PM, Karsten Blees karsten.bl...@gmail.com wrote:
 The new `hashmap.c` covers the first case quite well (albeit slightly
 more verbosely than I'd like), but in the second case it doesn't quite
 work. Since the new hash needs to embed the struct hashmap_entry on
 all its values (to allow for separate chaining), having it map to
 `int` keys requires a struct like this:

 struct sha1_position {
 struct hashmap_entry {
 struct hashmap_entry *next;
 unsigned int hash;
 };
 int position;
 }


 Hmm...isn't that position rather an index into two separately maintained 
 arrays? So you'd rather have:

 struct sha1_position {
 struct hashmap_entry {
 struct hashmap_entry *next;
 unsigned int hash;
 };
 uint32_t pack_name_hash;
 struct object *object;
  }

No, this is not quite right. We use the separate arrays because the
normal operation mode is to index by position (e.g. we need the nth
object in the extended index); the hash table is an auxiliary
structure to reverse that indexing (e.g. what position does this SHA1
have on the extended index). The information which is always required
to construct bitmaps is position, so we need to store the indexes in a
map.

 khash on the other hand is capable of storing the position values as
 part of the hash table itself (i.e. `int **buckets`), and saves us
 from thousands of bytes of allocations + indirection.


 However, khash keeps separate arrays for flags, keys and values, all of them 
 overallocated by 1 / load factor (so there's lots of unused space). 
 ext_index.objects and ext_index.hashes are also overallocated by the usual 
 alloc_nr() factor 1.5.

FWIW The flags for khash are compacted, so they are stored much more
tightly than pointers, even when overallocated.


 Regarding memory consumption, I think both implementations will be pretty 
 similar. Hashmap allocates many small regions while khash re-allocates a few 
 big ones...I really don't know which is better, ideally entries would be 
 allocated in chunks to minimize both heap overhead and memcpy disadvantes.

I agree, both implementations probably have very similar memory
characteristics, probably enough not to matter.

 Regarding performance, khash uses open addressing, which requires more key 
 comparisons (O(1/(1-load_factor))) than chaining (O(1+load_factor)). However, 
 any measurable differences will most likely be dwarfed by IO costs in this 
 particular use case.

I don't think this is true. If you actually run a couple insertion and
lookup benchmarks comparing the two implementations, you'll find khash
to be around ~30% faster for most workloads (venturing a guess from
past experience). I am obviously not the author of khash, but I've
found that the theoretical increase in key comparisons is completely
dwarfed by the benefit of increased cache locality during the probing
fase. I still haven't found a faster C hash table implementation than
khash for the general case, that's why I normally use it despite the
worrisome preprocessor crash-party going on in that header file.

 Btw., pack-objects.c::rehash_objects() in patch 03 unnecessarily checks for 
 duplicates. That's probably the reason for the high hashcmp times you found 
 in the first round of the patch series.

Patch 03 is a refactoring -- the duplicate checking code has been in
pack-objects.c for years. I am not sure duplicate checking is
superfluous at all, there are many cases where you could be
double-inserting objects in a pack-objects run, and you really don't
want to generate packfiles with dupe objects.

Thanks for the feedback!

luv,
vmg
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/16] documentation: add documentation for the bitmap format

2013-06-26 Thread Vicent Martí
That was a very rude reply. :(

Please refrain from interacting with me in the ML in the future. I'l
do accordingly.

Thanks!
vmg

On Thu, Jun 27, 2013 at 3:11 AM, Shawn Pearce spea...@spearce.org wrote:
 On Tue, Jun 25, 2013 at 4:08 PM, Vicent Martí tan...@gmail.com wrote:
 On Tue, Jun 25, 2013 at 11:17 PM, Junio C Hamano gits...@pobox.com wrote:
 What case are you talking about?

 The n-th object must be one of these four types and can never be of
 more than one type at the same time, so a natural expectation from
 the reader is If you OR them together, you will get the same set.
 If you say If you XOR them, that forces the reader to wonder when
 these bitmaps ever can overlap at the same bit position.

 I guess this is just wording. I don't particularly care about the
 distinction, but I'll change it to OR.

 Hmm, OK. If you think XOR and OR are the same operation, I also have a
 bridge to sell you. Its in Brooklyn. Its a great value.

 The correct operation is OR. Not XOR. OR. Drop the X.

 It cannot be mmapped not particularly because of endianness issues,
 but because the original format is not indexed and requires a full
 parse of the whole index before it can be accessed programatically.
 The wrong endianness just increases the parse time.

 Wrong endianness has nothing to do with the parse time. Modern CPUs
 can flip a word around very quickly. In JGit we chose to parse the
 file at load time because its simpler than having an additional index
 segment, and we do what you did which is to toss the object SHA-1s
 into a hashtable for fast lookup. By the time we look for the SHA-1s
 and toss them into a hashtable we can stride through the file and find
 the bitmap regions. Simple.

 In other words, the least complex solution possible that still
 provides good performance. I'd say we have pretty good performance.

 and I'm going to try to make it run fast enough in that
 encoding.

 Hmph.  Is it an option to start from what JGit does, so that people
 can use both JGit and your code on the same repository?

 I'm afraid I agree here with Junio. The JGit format is already
 shipping in JGit 3.0, Gerrit Code Review 2.6, and in heavy production
 use for almost a year on android.googlesource.com, and Google's own
 internal Git trees.

 I would prefer to see a series adding bitmap support to C Git start
 with the existing format, make it run, taking advantage of the
 optimizations JGit uses (many of which you ignored and tried to fix
 in other ways), and then look at improving the file format itself if
 load time is still the largest low hanging fruit in upload-pack. I'm
 guessing its not. You futzed around with the object table, but JGit
 sped itself up considerably by simply not using the object table when
 the bitmap is used. I think there are several such optimizations you
 missed in your rush to redefine the file format.

  And then if
 you do not succeed, after trying to optimize in-core processing
 using that on-disk format to make it fast enough, start thinking
 about tweaking the on-disk format?

 I'm afraid this is not an option. I have an old patchset that
 implements JGit v1 bitmap loading (and in fact that's how I initially
 developed these series -- by loading the bitmaps from JGit for
 debugging), but I discarded it because it simply doesn't pan out in
 production. ~3 seconds time to spawn `upload-pack` is not an option
 for us. I did not develop a tweaked on-disk format out of boredom.

 I think your code or experiments are bogus. Even on our systems with
 JGit a cold start for the Linux kernel doesn't take 3s. And this is
 JGit where Java is slow because Jesus it has a lot of factories, and
 without mmap'ing the file into the server's address space. Hell the
 file has to come over the network from a remote disk array.

 I could dig up the patch if you're particularly interested in
 backwards compatibility, but since it was several times slower than
 the current iteration, I have no interest (time, actually) to maintain
 it, brush it up, and so on. I have already offered myself to port the
 v2 format to JGit as soon as it's settled. It sounds like a better
 investment of all our times.

 Actually, I think the format you propose here is inferior to the JGit
 format. In particular the idx-ordering means the EWAH code is useless.
 You might as well not use the EWAH format and just store 2.6M bits per
 commit. The idx-ordering also makes *much* harder to emit a pack file
 a reasonable order for the client. Colby and I tried idx-ordering and
 discarded it when it didn't perform as well as the pack-ordering that
 JGit uses.

 Following up on Shawn's comments, I removed the little-endian support
 from the on-disk format and implemented lazy loading of the bitmaps to
 make up for it. The result is decent (slowed down from 250ms to 300ms)
 and it lets us keep the whole format as NWO on disk. I think it's a
 good tradeback.

 The maintenance burden of two endian formats in a single file is too
 high

Re: [PATCH 07/16] compat: add endinanness helpers

2013-06-25 Thread Vicent Martí
On Tue, Jun 25, 2013 at 3:08 PM, Peter Krefting pe...@softwolves.pp.se wrote:
 endian(3) claims that glibc 2.9+ define be64toh() and htobe64() which should
 do what you are looking for. The manual page does mention them being named
 differently across OSes, though, so you may need to be careful with that.

I'm aware of that, but Git needs to build with glibc 2.7+ (or was it
2.6?), hence the need for this compat layer.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/16] documentation: add documentation for the bitmap format

2013-06-25 Thread Vicent Martí
On Tue, Jun 25, 2013 at 7:42 AM, Shawn Pearce spea...@spearce.org wrote:
 I very much hate seeing a file format that is supposed to be portable
 that supports both big-endian and little-endian encoding.

Well, the bitmap index is not supposed to be portable, as it doesn't
get sent over the wire in any situation. Regardless, the format is
portable because it supports both encodings and clearly defines which
one the current file is using. I think that's a good tradeoff!

 Such a specification forces everyone to implement two code paths to handle
 reading data from the file, on the off-chance they are on the wrong
 platform.

Extra code paths have never been an issue in the
JGitAbstractFactoryGeneratorOfGit, har har har. Ah. I'm such a funny
guy when it comes to Java.

Anyway, I designed this keeping JGit in mind. In this specific case,
it doesn't force you to add any new code paths. The endianness changes
only affect the serialization format of the bitmaps, which is not part
of Git or JGit itself but of the Javaewah/libewok library. The
interface for reading on that library has already been wisely
abstracted on JGit
(https://github.com/eclipse/jgit/blob/master/org.eclipse.jgit/src/org/eclipse/jgit/internal/storage/file/PackBitmapIndexV1.java#L133),
so changing the byte order simply means changing the SimpleDataInput
to a LE one.

I agree this is not ideal, or elegant, but I'm having a hard time
making an argument for sacrificing objective speed for the sake of
subjective simplicity.

 What is wrong with picking one encoding and sticking to it?

It prevents us from making this optimally fast on the machines where
it needs to be.

Regardless, I must admit I haven't generated numbers for this in a
while (the BE-LE switch is one of the first optimizations I did). I'm
going to try to re-implement full NWO loading and see how much slower
it is, before I continue arguing for/against it.

If I can get it within a reasonable margin (say, 15%) of the current
implementation, I'd definitely be in favor of sticking to only NWO on
the whole file. If it's slower than that, well, Git has never
compromised on speed, and I don't think there's a point to be made for
starting to do that now.

 +   - BITMAP_OPT_HASH_CACHE (0x8)
 +   If present, a hash cache for finding delta bases 
 will be available
 +   right after the header block in this index. See the 
 following
 +   section for details.
 +
 +   4-byte entry count (network byte order)
 +
 +   The total count of entries (bitmapped commits) in 
 this bitmap index.
 +
 +   20-byte checksum
 +
 +   The SHA1 checksum of the pack this bitmap index 
 belongs to.
 +
 +   - An OPTIONAL delta cache follows the header.

 Some may find the name delta cache confusing as it does not cache
 deltas of objects. May I suggest path hash cache as an alternative
 name?

Definitely, this is a typo.

 +   The cache is formed by `n` 4-byte hashes in a row, where `n` 
 is
 +   the amount of objects in the indexed packfile. Note that 
 this amount
 +   is the **total number of objects** and is not related to the
 +   number of commits that have been selected and indexed in the
 +   bitmap index.
 +
 +   The hashes are stored in Network Byte Order and they are the 
 same
 +   values generated by a normal revision walk during the 
 `pack-objects`
 +   phase.

 I find it interesting this is network byte order and not big-endian or
 little-endian based on the flag in the header.

As stated before, the flag in the header only affects the
Javaewah/libewok interface. Everything Git-related in the bitmap index
is in NWO, like it is customary in Git.


 +   The `n`nth hash in the cache is the name hash for the `n`th 
 object
 +   in the index for the indexed packfile.
 +
 +   [RATIONALE]:
 +
 +   The bitmap index allows us to skip the Counting Objects phase
 +   during `pack-objects` and yield all the OIDs that would be 
 reachable
 +   (WANTS) when generating the pack.
 +
 +   This optimization, however, means that we're adding objects 
 to the
 +   packfile straight from the packfile index, and hence we are 
 lacking
 +   path information for the objects that would normally be 
 generated
 +   during the Counting Objects phase.
 +
 +   This path information for each object is hashed and used as 
 a very
 +   effective way to find good delta bases when compressing the 
 packfile;
 +   without these hashes, the resulting packfiles are much less 
 optimal.
 +
 +   By storing all the hashes in a cache together with the 
 bitmapsin
 +   the bitmap index, we can yield not only the SHA1 of all 

Re: [PATCH 09/16] documentation: add documentation for the bitmap format

2013-06-25 Thread Vicent Martí
On Tue, Jun 25, 2013 at 11:17 PM, Junio C Hamano gits...@pobox.com wrote:
 What case are you talking about?

 The n-th object must be one of these four types and can never be of
 more than one type at the same time, so a natural expectation from
 the reader is If you OR them together, you will get the same set.
 If you say If you XOR them, that forces the reader to wonder when
 these bitmaps ever can overlap at the same bit position.

I guess this is just wording. I don't particularly care about the
distinction, but I'll change it to OR.


 To sum it up: I'd like to see this format be strictly in Network Byte
 Order,

 Good.

 I've been wondering what you meant by cannot be mmap-ed from the
 very beginning.  We mmapped the index for a long time, and it is
 defined in terms of network byte order.  Of course, pack .idx files
 are in network byte order, too, and we mmap them without problems.
 It seems that it primarily came from your fear that using network
 byte order may be unnecessarily hard to perform well, and it would
 be a good thing to do to try to do so first instead of punting from
 the beginning.

It cannot be mmapped not particularly because of endianness issues,
but because the original format is not indexed and requires a full
parse of the whole index before it can be accessed programatically.
The wrong endianness just increases the parse time.


 and I'm going to try to make it run fast enough in that
 encoding.

 Hmph.  Is it an option to start from what JGit does, so that people
 can use both JGit and your code on the same repository?  And then if
 you do not succeed, after trying to optimize in-core processing
 using that on-disk format to make it fast enough, start thinking
 about tweaking the on-disk format?

I'm afraid this is not an option. I have an old patchset that
implements JGit v1 bitmap loading (and in fact that's how I initially
developed these series -- by loading the bitmaps from JGit for
debugging), but I discarded it because it simply doesn't pan out in
production. ~3 seconds time to spawn `upload-pack` is not an option
for us. I did not develop a tweaked on-disk format out of boredom.

I could dig up the patch if you're particularly interested in
backwards compatibility, but since it was several times slower than
the current iteration, I have no interest (time, actually) to maintain
it, brush it up, and so on. I have already offered myself to port the
v2 format to JGit as soon as it's settled. It sounds like a better
investment of all our times.

Following up on Shawn's comments, I removed the little-endian support
from the on-disk format and implemented lazy loading of the bitmaps to
make up for it. The result is decent (slowed down from 250ms to 300ms)
and it lets us keep the whole format as NWO on disk. I think it's a
good tradeback.

The relevant commits are available on my fork of Git (I'll be sending
v2 of the patchset once I finish tackling the other reviews):

https://github.com/vmg/git/commit/d6cdd4329a547580bbc0143764c726c48b887271
https://github.com/vmg/git/commit/d8ec342fee87425e05c0db1e1630db8424612c71

As it stands right now, the only two changes from v1 of the on-disk format are:

- There is an index at the end. This is a good idea.
- The bitmaps are sorted in packfile-index order, not in packfile
order. This is a good idea.

As always, all your feedback is appreciated, but please keep in mind I
have strict performance concerns.

German kisses,
vmg
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/16] documentation: add documentation for the bitmap format

2013-06-25 Thread Vicent Martí
On Tue, Jun 25, 2013 at 5:58 PM, Thomas Rast tr...@inf.ethz.ch wrote:

 This is the technical documentation and design rationale for the new
 Bitmap v2 on-disk format.

 Hrmpf, that's what I get for reading the series in order...

 + The folowing flags are supported:
   ^^

 typos marked by ^

 + By storing all the hashes in a cache together with the 
 bitmapsin
  
 ^^

 + The obvious consequence is that the XOR of all 4 bitmaps will 
 result
 + in a full set (all bits sets), and the AND of all 4 bitmaps 
 will
^

 + - 1-byte XOR-offset
 + The xor offset used to compress this bitmap. For an 
 entry
 + in position `x`, a XOR offset of `y` means that the 
 actual
 + bitmap representing for this commit is composed by 
 XORing the
 + bitmap for this entry with the bitmap in entry `x-y` 
 (i.e.
 + the bitmap `y` entries before this one).
 +
 + Note that this compression can be recursive. In order 
 to
 + XOR this entry with a previous one, the previous entry 
 needs
 + to be decompressed first, and so on.
 +
 + The hard-limit for this offset is 160 (an entry can 
 only be
 + xor'ed against one of the 160 entries preceding it). 
 This
 + number is always positivea, and hence entries are 
 always xor'ed
  ^

 + with **previous** bitmaps, not bitmaps that will come 
 afterwards
 + in the index.

 Clever.  Why 160 though?

JGit implementation detail. It's the equivalent of the delta-window in
`pack-objects` for example.

HINT HINT: in practice, JGit only looks 16 positions behind to find
deltas, and we do the same. So the practical limit is 16. harhar


 + - 2 bytes of RESERVED data (used right now for better packing).

 What do they mean?

 +  With an index at the end of the file, we can load only this index in 
 memory,
 +  allowing for very efficient access to all the available bitmaps lazily (we
 +  have their offsets in the mmaped file).

 Is there anything preventing you from mmap()ing the index also?

Yeah, this format allows you to easily do a SHA1 bsearch with custom
step to lookup entries on the bitmap index, except for the fact that
the index is not sorted by SHA1, so you'd need a linear search
instead. :)

I decided against it because during most complex invocations of
`pack-objects`, we perform a couple thousand commit lookups to see if
they have a bitmap in the index, so it makes a lot of sense to load
the index tightly in a hash table before hand (which takes very little
time, to be fair). We more-than-make up for the loading time by having
much much faster lookups. I felt it was the right tradeoff (JGit does
the same, but in their case, because they cannot mmap. :p)

 +== Appendix A: Serialization format for an EWAH bitmap
 +
 +Ewah bitmaps are serialized in the protocol as the JAVAEWAH
 +library, making them backwards compatible with the JGit
 +implementation:
 +
 + - 4-byte number of bits of the resulting UNCOMPRESSED bitmap
 +
 + - 4-byte number of words of the COMPRESSED bitmap, when stored
 +
 + - N x 8-byte words, as specified by the previous field
 +
 + This is the actual content of the compressed bitmap.
 +
 + - 4-byte position of the current RLW for the compressed
 + bitmap
 +
 +Note that the byte order for this serialization is not defined by
 +default. The byte order for all the content in a serialized EWAH
 +bitmap can be known by the byte order flags in the header of the
 +bitmap index file.

 Please document the RLW format here.

Har har. I was going to comment on your review of the Ewah patchset,
but might as well do it here: the only thing I know about Ewah bitmaps
is that they work. And I know this because I did extensive fuzz
testing of my C port. Unfortunately, the original Java code I ported
from has 0 comments, so any documentation here would have to be
reverse-engineered.

Personally, I'd lean towards considering Ewah an external dependency
(black box); the headers for the library are commented accordingly,
clearly explaining the interfaces while hiding implementation details.
Of course, you're welcome to help me reverse engineer the
implementation, but I'm not sure this would be of much value. It'd be
better to make sure it passes the extensive test suite of the Java
version, and assume that Mr Lemire designed a sound format for the
bitmaps.

Swiss kisses,
vmg
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [PATCH 03/16] pack-objects: use a faster hash table

2013-06-25 Thread Vicent Martí
On Wed, Jun 26, 2013 at 12:48 AM, Junio C Hamano gits...@pobox.com wrote:
 After this, the function returns.  The original did not add to the
 table the object name we are looking at, but the new code first adds
 it to the table with the unconditional kh_put_sha1() above.  Is a
 call to kh_del_sha1() missing here ...

No, this is not the case. That's the return case for when *the object
was found because it already existed in the hash table* (hence we
access it if we're excluding it, to tag it as excluded). We don't want
to remove it from the hash table because we're not the ones we
inserted it.

We only call `kh_del_sha1` in the cases where:

1. The object wasn't found.
2. We inserted its key on the hash table.
3. We later learnt that we don't really want to pack this object.


 @@ -921,38 +916,42 @@ static int add_object_entry(const unsigned char *sha1, 
 enum object_type type,
   return 0;
   }

 - if (!exclude  local  has_loose_object_nonlocal(sha1))
 + if (!exclude  local  has_loose_object_nonlocal(sha1)) {
 + kh_del_sha1(packed_objects, ix);
   return 0;

 ... like this one, which seems to compensate for ahh, after all we
 realize we do not want to add this one to the table?

 @@ -966,19 +965,30 @@ static int add_object_entry(const unsigned char *sha1, 
 enum object_type type,
   entry-in_pack_offset = found_offset;
   }

 - if (object_ix_hashsz * 3 = nr_objects * 4)
 - rehash_objects();
 - else
 - object_ix[-1 - ix] = nr_objects;
 + kh_value(packed_objects, ix) = entry;
 + kh_key(packed_objects, ix) = entry-idx.sha1;
 + objects[nr_objects++] = entry;

   display_progress(progress_state, nr_objects);

 - if (name  no_try_delta(name))
 - entry-no_try_delta = 1;
 -
   return 1;
  }

 +static int add_object_entry(const unsigned char *sha1, enum object_type 
 type,
 + const char *name, int exclude)
 +{
 + if (add_object_entry_1(sha1, type, name_hash(name), exclude, NULL, 0)) 
 {
 + struct object_entry *entry = objects[nr_objects - 1];
 +
 + if (name  no_try_delta(name))
 + entry-no_try_delta = 1;
 +
 + return 1;
 + }
 +
 + return 0;
 +}

 It is somewhat unclear what we are getting from the split of the
 main part of this function into *_1(), other than the *_1() function
 now has a very deep indentation inside if (!found_pack), which is
 always true because the caller always passes NULL to found_pack.
 Perhaps this is an unrelated refactoring that is needed for later
 steps and does not have anything to do with the use of new hash
 function?

Yes, apologies for not making this clear. By refactoring into `_1`,
you can see how `traverse_bitmap_commit_list` can use the `_1` version
directly as a callback, to insert objects straight into the packing
list without looking them up. This is very efficient because we can
pass the whole API straight from the bitmap code:

1. The SHA1: we find it by simply looking up the `nth` sha1 on the
pack index (if we are yielding bit `n`)
2. The object type: we find it because we have type indexes that let
us know the type of any given bit in the bitmap by and-ing it with the
index.
3. The hash for its name: we can look it up from the name hash cache
in the new bitmap format.
4. Exclude flag: we never exclude when working with bitmaps
5. found_pack: all the bitmapped objects come from the same pack!
6. found_offset: we find it by simply looking up the `nth` offset on
the pack index (if we are yielding bit `n`)

Boom! We filled the callback just from the data in a bitmap. Ain't that nice?

Let me amend the commit message.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/16] pack-objects: use bitmaps when packing objects

2013-06-25 Thread Vicent Martí
On Wed, Jun 26, 2013 at 1:06 AM, Junio C Hamano gits...@pobox.com wrote:
 @@ -83,6 +84,9 @@ static struct progress *progress_state;
  static int pack_compression_level = Z_DEFAULT_COMPRESSION;
  static int pack_compression_seen;

 +static int bitmap_support;
 +static int use_bitmap_index;

 OK.

 @@ -2131,6 +2135,10 @@ static int git_pack_config(const char *k, const char 
 *v, void *cb)
   cache_max_small_delta_size = git_config_int(k, v);
   return 0;
   }
 + if (!strcmp(k, pack.usebitmaps)) {
 + bitmap_support = git_config_bool(k, v);
 + return 0;
 + }

 Hmph, so bitmap_support, not use_bitmap_index, keeps track of the
 user request?  Somewhat confusing.

   if (!strcmp(k, pack.threads)) {
   delta_search_threads = git_config_int(k, v);
   if (delta_search_threads  0)
 @@ -2366,8 +2374,24 @@ static void get_object_list(int ac, const char **av)
   die(bad revision '%s', line);
   }

 + if (use_bitmap_index) {
 + uint32_t size_hint;
 +
 + if (!prepare_bitmap_walk(revs, size_hint)) {
 + khint_t new_hash_size = (size_hint * (1.0 / 
 __ac_HASH_UPPER)) + 0.5;

 What is __ac_HASH_UPPER?  That is a very unusual name for a variable
 or a constant.  Also it is mildly annoying to see unnecessary use of
 float like this.

See the updated patch at:

https://github.com/vmg/git/blob/vmg/bitmaps-master/builtin/pack-objects.c#L2422


 + kh_resize_sha1(packed_objects, new_hash_size);
 +
 + nr_alloc = (size_hint + 63)  ~63;
 + objects = xrealloc(objects, nr_alloc * sizeof(struct 
 object_entry *));
 +
 + traverse_bitmap_commit_list(add_object_entry_1);
 + return;
 + }
 + }
 +
   if (prepare_revision_walk(revs))
   die(revision walk setup failed);
 +
   mark_edges_uninteresting(revs.commits, revs, show_edge);
   traverse_commit_list(revs, show_commit, show_object, NULL);

 @@ -2495,6 +2519,8 @@ int cmd_pack_objects(int argc, const char **argv, 
 const char *prefix)
   N_(pack compression level)),
   OPT_SET_INT(0, keep-true-parents, grafts_replace_parents,
   N_(do not hide commits by grafts), 0),
 + OPT_BOOL(0, bitmaps, bitmap_support,
 +  N_(enable support for bitmap optimizations)),

 Please match this with the name of configuration variable, i.e. --use-bitmaps

   OPT_END(),
   };

 @@ -2561,6 +2587,11 @@ int cmd_pack_objects(int argc, const char **argv, 
 const char *prefix)
   if (keep_unreachable  unpack_unreachable)
   die(--keep-unreachable and --unpack-unreachable are 
 incompatible.);

 + if (bitmap_support) {
 + if (use_internal_rev_list  pack_to_stdout)
 + use_bitmap_index = 1;

 OK, so only when some internal condition is met, the user request to
 use bitmap is honored and the deision is kept in use_bitmap_index.

 It may be easier to read if you get rid of bitmap_support, set
 user_bitmap_index directly from the command line and config, and did
 this here instead:

 if (!(use_internal_rev_list  pack_to_stdout))
 use_bitmap_index = 0;

Yeah, I'm not particularly happy with the way these flags are
implemented. I'll update this.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/16] repack: consider bitmaps when performing repacks

2013-06-25 Thread Vicent Martí
On Wed, Jun 26, 2013 at 1:00 AM, Junio C Hamano gits...@pobox.com wrote:
 @@ -156,6 +156,11 @@ do
   fullbases=$fullbases pack-$name
   chmod a-w $PACKTMP-$name.pack
   chmod a-w $PACKTMP-$name.idx
 +
 + test -f $PACKTMP-$name.bitmap 
 + chmod a-w $PACKTMP-$name.bitmap 
 + mv -f $PACKTMP-$name.bitmap $PACKDIR/pack-$name.bitmap

 If we see a temporary bitmap but somehow failed to move it to the
 final name, should we _ignore_ that error, or should we die, like
 the next two lines do?

I obviously decided against dying (as you can see on the patch, har
har), because the bitmap is not required for the proper operation of
the Git repository, unlike the packfile and the index.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/16] rev-list: add bitmap mode to speed up lists

2013-06-25 Thread Vicent Martí
I'm afraid I cannot reproduce the segfault locally (assuming you're
performing the rev-list on the git/git repository). Could you please
send me more information, and a core dump if possible?

On Tue, Jun 25, 2013 at 6:22 PM, Thomas Rast tr...@inf.ethz.ch wrote:
 Vicent Marti tan...@gmail.com writes:

 Calling `git rev-list --use-bitmaps [committish]` is the equivalent
 of `git rev-list --objects`, but the rev list is performed based on
 a bitmap result instead of using a manual counting objects phase.

 Why would we ever want to not --use-bitmaps, once it actually works?
 I.e., shouldn't this be the default if pack.usebitmaps is set (or
 possibly even core.usebitmaps for these things)?

 These are some example timings for `torvalds/linux`:

   $ time ../git/git rev-list --objects master  /dev/null

   real0m25.567s
   user0m25.148s
   sys 0m0.384s

   $ time ../git/git rev-list --use-bitmaps master  /dev/null

   real0m0.393s
   user0m0.356s
   sys 0m0.036s

 I see your badass numbers, and raise you a critical issue:

   $ time git rev-list --use-bitmaps --count --left-right 
 origin/pu...origin/next
   Segmentation fault

   real0m0.408s
   user0m0.383s
   sys 0m0.022s

 It actually seems to be related solely to having negated commits in the
 walk:

   thomas@linux-k42r:~/g(next u+65)$ time git rev-list --use-bitmaps --count 
 origin/pu
   32315

   real0m0.041s
   user0m0.034s
   sys 0m0.006s
   thomas@linux-k42r:~/g(next u+65)$ time git rev-list --use-bitmaps --count 
 origin/pu ^origin/next
   Segmentation fault

   real0m0.460s
   user0m0.214s
   sys 0m0.244s

 I also can't help noticing that the time spent generating the segfault
 would have sufficed to generate the answer the old way as well:

   $ time git rev-list --count --left-right origin/pu...origin/next
   189 125

   real0m0.409s
   user0m0.386s
   sys 0m0.022s

 Can we use the same trick to speed up merge base computation and then
 --left-right?  The latter is a component of __git_ps1 and can get
 somewhat slow in some cases, so it would be nice to make it really fast,
 too.

 --
 Thomas Rast
 trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html