Re: [PATCH] pack-bitmap: do not use gcc packed attribute
On Mon, Aug 4, 2014 at 9:19 PM, Karsten Blees karsten.bl...@gmail.com wrote: This raises the question why we read via mmap at all The first version of the pack bitmap format I wrote for GitHub was 50% faster to load than this one because it was designed to be mmapable. Eventually we moved to the JGit-compatible bitmap format (because I get paid a lot of money to do as I'm told -- not because of some inherent benefit of the JGit format), which needs to be read sequentially, but I never bothered to change the mmap reading code. I believe your patch makes a lot of sense -- at this point we could as well remove the mmaping altogether and read the file sequentially. Cheers, vmg -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Git in GSoC 2014
On Wed, Feb 26, 2014 at 11:41 AM, Michael Haggerty mhag...@alum.mit.edu wrote: Since time is short, I already started on this. I wrote a first draft of an introduction for the students. I also started looking for microprojects. I started going through our source files alphabetically, and have already found six suggestions by bundle.c, so I don't think there will be a problem finding enough tiny things to do. Note that for projects that are either libgit2-centric or mix Core Git and libgit2, it would be worth it to the students to submit a pull request on libgit2 (or ideally, on both projects, although that's going to be hard) in order to make themselves familiar with the code base. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Git in GSoC 2014
On Wed, Feb 26, 2014 at 12:16 PM, Duy Nguyen pclo...@gmail.com wrote: On Tue, Feb 25, 2014 at 10:41 PM, Jeff King p...@peff.net wrote: I'm pleased to announce that Git has been accepted to this year's Google Summer of Code. Student proposals will start coming in on March 22. In the meantime students will be reading our Ideas page[1] and enquiring about the program on the mailing list and on irc. There are many ways that existing git developers can help: - continue to add to and polish the Ideas page One thing I noticed after tg/index-v4-format is both libgit2 and jgit do not seem to support index v4. So we could add index v4 support on libgit2 to the idea page. It's a relatively small task though once you get a hang on index format. That sounds like a nice task for the Summer of Code too, specially with the current effort to make Index v4 more visible in Core Git. I wonder if anybody from JGit would also be interested on mentoring for the equivalent task (index v4 on JGit). I've CC'ed Shawn Pearce. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Git GSoC 2014
On Thu, Feb 13, 2014 at 10:45 PM, Thomas Rast t...@thomasrast.ch wrote: Replace object loading/writing layer by libgit2 Git reads objects from storage (loose and packed) through functions in sha1_file.c. Most commands only require very simple, opaque read and write access to the object storage. As a weatherballoon, show that it is feasible to use libgit2 git_odb_* routines for these simple callers. Aim for passing the git test suite using git_odb_* object storage access, except for tests that verify behavior in the face of storage corruption, replacement objects, alternate storage locations, and similar quirks. Of course it is even better if you pass the test suite without exception. Language: C Difficulty: hard Possible mentors: Thomas Rast and fill in libgit2 expert Note that we have several people in the libgit2 team that are willing (and excited) to mentor or co-mentor this project or any of the other libgit2 related projects that have been proposed. Prospective list is - Vicent Marti - Russell Belfer - Ed Thomson - Carlos Martin (past GSoC student) So there shouldn't be any mentor shortage. Cheers, vmg -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] ewah: support platforms that require aligned reads
On Fri, Jan 24, 2014 at 12:44 AM, Jonathan Nieder jrnie...@gmail.com wrote: --- a/ewah/ewah_io.c +++ b/ewah/ewah_io.c @@ -112,23 +112,38 @@ int ewah_serialize(struct ewah_bitmap *self, int fd) [...] +#if __BYTE_ORDER != __BIG_ENDIAN Is this portable? We explicitly set the __BYTE_ORDER macros in `compat/bswap.h`. In fact, this preprocessor conditional is the same one that we use when choosing what version of the `ntohl` macro to define, so that's why I decided to use it here. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] pack-objects: turn off bitmaps when skipping objects
On Fri, Jan 24, 2014 at 12:45 AM, Siddharth Agarwal s...@fb.com wrote: Yes, we'd prefer to do that too. How do you actually do this, though? I don't see a way to pass `--honor-pack-keep` (shouldn't I pass in its inverse?) down to `git-pack-objects`. We run with this patch in production, it may be of use to you: https://gist.github.com/vmg/8589317 In fact, it may be worth upstreaming too. I'll kindly ask peff to do it when he has a moment. Apologies for not attaching the patch inline, the GMail web UI doesn't mix well with patch workflow. Cheers, vmg -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] Git v1.9-rc0
On Wed, Jan 22, 2014 at 5:11 PM, Junio C Hamano gits...@pobox.com wrote: Stefan Näwe stefan.na...@atlas-elektronik.com writes: Am 22.01.2014 13:53, schrieb Javier Domingo Cansino: Will there be any change on how tarballs are distributed, taking into account that Google will be shutting down Google Code Downloads section[1][2]? Am I missing something or what's wrong with this: https://github.com/gitster/git/archive/v1.9-rc0.tar.gz or any https://github.com/gitster/git/archive/$TAG.tar.gz ?? Do these consume CPU every time somebody asks for a tarball? That might be considered wrong depending on the view. No, our infrastructure caches frequently requested tarballs so they don't have to be regenerated on the fly. If you would prefer to distribute a different version of the tarball for the release (e.g. one with a different filename or folder structure), you can upload it directly to the release page of the tag: https://github.com/gitster/git/releases/tag/v1.9-rc0 We'll automatically mirror your release to S3 and serve it from there. You can also go ahead and edit the release page with the changelog you've posted in this email thread to make it more user friendly. WE WILL SERVE YOUR RELEASES, JUNIO BECAUSE WE LOVE YOU -vmg -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What's cooking in git.git (Nov 2013, #05; Thu, 21)
On Fri, Nov 22, 2013 at 8:36 PM, Junio C Hamano gits...@pobox.com wrote: Do we want to try this in 'next' post-1.8.5, or should I try to prod an area expert like Shawn into doing another round of review? Yes ;-). I recall starting to read the series over and then got sidetracked in the middle and never finishing. I'll try to make time sometime this weekend (we are still buried in boxes after the move, though, so no promises) myself. How close is this what you guys are running in production these days, by the way? We are running a slightly older version of the patchset, because we're still on 1.8.4 and the current reroll doesn't apply cleanly there. If this could make it to `next` some time next week, that would work out great for us, because we may start considering using `next` as a partial or full deployment on our production machines This also means that we could exercise the patchset and everything else that is queued up in next release... You must remember all the corner cases and bugs peff brings to the list every time we deploy a new Git to production. Wouldn't it be nice to have a thorough checking of the current iteration *before* the release, and not after? :) *hint hint* -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] fix up 'jk/pack-bitmap' branch
Thank you Ramsay, all the patches look OK to me. On Thu, Nov 7, 2013 at 11:19 PM, Jeff King p...@peff.net wrote: On Thu, Nov 07, 2013 at 09:58:02PM +, Ramsay Jones wrote: These patches fix various errors/warnings on the cygwin, MinGW and msvc builds, provoked by the jk/pack-bitmap branch. Thanks. Your timing is impeccable, as I was just sitting down to finalize the next re-roll. I'll add these in. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What's cooking in git.git (Oct 2013, #07; Mon, 28)
On Wed, Oct 30, 2013 at 5:51 PM, Torsten Bögershausen tbo...@web.de wrote: There is a name clash under cygwin 1.7 (1.5 is OK) The following first aid hot fix works for me: /Torsten If Cygwin declares its own bswap_64, wouldn't it be better to use it instead of overwriting it with our own? -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/19] compat: add endianness helpers
On Oct 30, 2013, at 9:25 AM, Jeff King p...@peff.net wrote: On Sat, Oct 26, 2013 at 09:55:36AM +0200, Thomas Rast wrote: The POSIX standard doesn't currently define a `nothll`/`htonll` typo: ntohll Thanks. function pair to perform network-to-host and host-to-network swaps of 64-bit data. These 64-bit swaps are necessary for the on-disk storage of EWAH bitmaps if they are not in native byte order. [...] +# include byteswap.h Do we need a hack on top similar to what ntoh_l and hton_l do, for platforms that do not support unaligned access? Ugh, probably. I didn't even know about those. But we do use them when reading the ewah bitmaps, which I believe can be at random offsets. We should be able to use the same ntoh_l solution. -Peff You are right, when reading mmaps we cannot ensure the alignment of the 8 byte sections. For writing the bitmaps to memory, however, we can make that assumption because the writes will happen to either malloc'ed segments, or the stack. A proposed fix follows: From 0eed934805543390195f8aee231f8a1da33dad0b Mon Sep 17 00:00:00 2001 From: Vicent Marti tan...@gmail.com Date: Wed, 30 Oct 2013 15:27:20 +0100 Subject: [PATCH] ewah: support non-aligned reads When reading straight from a mmaped file, some platforms do not support atomically loading 8 bytes from random positions on memory to perform a byte swap. To work around this, we `memcpy` the whole area that needs to be byteswapped to its final destination, and then perform the byteswap on each 8-byte chunk: the destionation chunks *will* be aligned because the destination area is a malloc'ed chunk of memory. Signed-off-by: Vicent Marti tan...@gmail.com --- ewah/ewah_io.c | 24 ++-- 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/ewah/ewah_io.c b/ewah/ewah_io.c index fca93f9..41f6d3d 100644 --- a/ewah/ewah_io.c +++ b/ewah/ewah_io.c @@ -76,6 +76,10 @@ int ewah_serialize_to(struct ewah_bitmap *self, buffer = self-buffer; words_left = self-buffer_size; + /* buffer is a malloc'ed buffer, which we can assume to +* be at the very least 8-byte aligned */ + assert((intptr_t)buffer % 8 == 0); + while (words_left = words_per_dump) { for (i = 0; i words_per_dump; ++i, ++buffer) dump[i] = htonll(*buffer); @@ -117,7 +121,6 @@ int ewah_serialize(struct ewah_bitmap *self, int fd) int ewah_read_mmap(struct ewah_bitmap *self, void *map, size_t len) { uint32_t *read32 = map; - eword_t *read64; size_t i; self-bit_size = ntohl(*read32++); @@ -128,10 +131,19 @@ int ewah_read_mmap(struct ewah_bitmap *self, void *map, size_t len) if (!self-buffer) return -1; - for (i = 0, read64 = (void *)read32; i self-buffer_size; ++i) - self-buffer[i] = ntohll(*read64++); +#ifdef NEEDS_ALIGNED_ACCESS + memcpy(self-buffer, read32, self-buffer_size * sizeof(eword_t)); + for (i = 0; i self-buffer_size; ++i) + self-buffer[i] = ntohll(self-buffer[i]); +#else + { + eword_t *read64 = (void *)read32; + for (i = 0; i self-buffer_size; ++i) + self-buffer[i] = ntohll(read64[i]); + } +#endif - read32 = (void *)read64; + read32 += self-buffer_size * sizeof(eword_t) / sizeof(uint32_t); self-rlw = self-buffer + ntohl(*read32++); return (3 * 4) + (self-buffer_size * 8); @@ -175,7 +187,7 @@ int ewah_deserialize(struct ewah_bitmap *self, int fd) return -1; for (i = 0; i words_per_dump; ++i, ++buffer) - *buffer = ntohll(dump[i]); + *buffer = ntohll(dump[i]); /* aligned on stack */ words_left -= words_per_dump; } @@ -185,7 +197,7 @@ int ewah_deserialize(struct ewah_bitmap *self, int fd) return -1; for (i = 0; i words_left; ++i, ++buffer) - *buffer = ntohll(dump[i]); + *buffer = ntohll(dump[i]); /* aligned on stack */ } /** 32 bit -- position for the RLW */ -- 1.8.3.2 -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What's cooking in git.git (Oct 2013, #06; Fri, 25)
On Mon, Oct 28, 2013 at 4:48 PM, Junio C Hamano gits...@pobox.com wrote: jk/pack-bitmap adds khash.h, which from a first glance looks like yet another hash table implementation. I was just wondering if kb's new hash tables can cover the need of pack-bitmap.c too so we can remove khash.h later.. Good thinking ;-). We use the khash tables to map: - sha1 (const char *) to (void *) - sha1 (const char *) to int The new `hashmap.c` covers the first case quite well (albeit slightly more verbosely than I'd like), but in the second case it doesn't quite work. Since the new hash needs to embed the struct hashmap_entry on all its values (to allow for separate chaining), having it map to `int` keys requires a struct like this: struct sha1_position { struct hashmap_entry { struct hashmap_entry *next; unsigned int hash; }; int position; } khash on the other hand is capable of storing the position values as part of the hash table itself (i.e. `int **buckets`), and saves us from thousands of bytes of allocations + indirection. I am not sure whether the consistency of having a single hash map warrants the performance and memory hits when operating on the extended index. Please advice. luv, vmg -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What's cooking in git.git (Oct 2013, #06; Fri, 25)
On Mon, Oct 28, 2013 at 8:45 PM, Karsten Blees karsten.bl...@gmail.com wrote: The new `hashmap.c` covers the first case quite well (albeit slightly more verbosely than I'd like), but in the second case it doesn't quite work. Since the new hash needs to embed the struct hashmap_entry on all its values (to allow for separate chaining), having it map to `int` keys requires a struct like this: struct sha1_position { struct hashmap_entry { struct hashmap_entry *next; unsigned int hash; }; int position; } Hmm...isn't that position rather an index into two separately maintained arrays? So you'd rather have: struct sha1_position { struct hashmap_entry { struct hashmap_entry *next; unsigned int hash; }; uint32_t pack_name_hash; struct object *object; } No, this is not quite right. We use the separate arrays because the normal operation mode is to index by position (e.g. we need the nth object in the extended index); the hash table is an auxiliary structure to reverse that indexing (e.g. what position does this SHA1 have on the extended index). The information which is always required to construct bitmaps is position, so we need to store the indexes in a map. khash on the other hand is capable of storing the position values as part of the hash table itself (i.e. `int **buckets`), and saves us from thousands of bytes of allocations + indirection. However, khash keeps separate arrays for flags, keys and values, all of them overallocated by 1 / load factor (so there's lots of unused space). ext_index.objects and ext_index.hashes are also overallocated by the usual alloc_nr() factor 1.5. FWIW The flags for khash are compacted, so they are stored much more tightly than pointers, even when overallocated. Regarding memory consumption, I think both implementations will be pretty similar. Hashmap allocates many small regions while khash re-allocates a few big ones...I really don't know which is better, ideally entries would be allocated in chunks to minimize both heap overhead and memcpy disadvantes. I agree, both implementations probably have very similar memory characteristics, probably enough not to matter. Regarding performance, khash uses open addressing, which requires more key comparisons (O(1/(1-load_factor))) than chaining (O(1+load_factor)). However, any measurable differences will most likely be dwarfed by IO costs in this particular use case. I don't think this is true. If you actually run a couple insertion and lookup benchmarks comparing the two implementations, you'll find khash to be around ~30% faster for most workloads (venturing a guess from past experience). I am obviously not the author of khash, but I've found that the theoretical increase in key comparisons is completely dwarfed by the benefit of increased cache locality during the probing fase. I still haven't found a faster C hash table implementation than khash for the general case, that's why I normally use it despite the worrisome preprocessor crash-party going on in that header file. Btw., pack-objects.c::rehash_objects() in patch 03 unnecessarily checks for duplicates. That's probably the reason for the high hashcmp times you found in the first round of the patch series. Patch 03 is a refactoring -- the duplicate checking code has been in pack-objects.c for years. I am not sure duplicate checking is superfluous at all, there are many cases where you could be double-inserting objects in a pack-objects run, and you really don't want to generate packfiles with dupe objects. Thanks for the feedback! luv, vmg -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/16] documentation: add documentation for the bitmap format
That was a very rude reply. :( Please refrain from interacting with me in the ML in the future. I'l do accordingly. Thanks! vmg On Thu, Jun 27, 2013 at 3:11 AM, Shawn Pearce spea...@spearce.org wrote: On Tue, Jun 25, 2013 at 4:08 PM, Vicent Martí tan...@gmail.com wrote: On Tue, Jun 25, 2013 at 11:17 PM, Junio C Hamano gits...@pobox.com wrote: What case are you talking about? The n-th object must be one of these four types and can never be of more than one type at the same time, so a natural expectation from the reader is If you OR them together, you will get the same set. If you say If you XOR them, that forces the reader to wonder when these bitmaps ever can overlap at the same bit position. I guess this is just wording. I don't particularly care about the distinction, but I'll change it to OR. Hmm, OK. If you think XOR and OR are the same operation, I also have a bridge to sell you. Its in Brooklyn. Its a great value. The correct operation is OR. Not XOR. OR. Drop the X. It cannot be mmapped not particularly because of endianness issues, but because the original format is not indexed and requires a full parse of the whole index before it can be accessed programatically. The wrong endianness just increases the parse time. Wrong endianness has nothing to do with the parse time. Modern CPUs can flip a word around very quickly. In JGit we chose to parse the file at load time because its simpler than having an additional index segment, and we do what you did which is to toss the object SHA-1s into a hashtable for fast lookup. By the time we look for the SHA-1s and toss them into a hashtable we can stride through the file and find the bitmap regions. Simple. In other words, the least complex solution possible that still provides good performance. I'd say we have pretty good performance. and I'm going to try to make it run fast enough in that encoding. Hmph. Is it an option to start from what JGit does, so that people can use both JGit and your code on the same repository? I'm afraid I agree here with Junio. The JGit format is already shipping in JGit 3.0, Gerrit Code Review 2.6, and in heavy production use for almost a year on android.googlesource.com, and Google's own internal Git trees. I would prefer to see a series adding bitmap support to C Git start with the existing format, make it run, taking advantage of the optimizations JGit uses (many of which you ignored and tried to fix in other ways), and then look at improving the file format itself if load time is still the largest low hanging fruit in upload-pack. I'm guessing its not. You futzed around with the object table, but JGit sped itself up considerably by simply not using the object table when the bitmap is used. I think there are several such optimizations you missed in your rush to redefine the file format. And then if you do not succeed, after trying to optimize in-core processing using that on-disk format to make it fast enough, start thinking about tweaking the on-disk format? I'm afraid this is not an option. I have an old patchset that implements JGit v1 bitmap loading (and in fact that's how I initially developed these series -- by loading the bitmaps from JGit for debugging), but I discarded it because it simply doesn't pan out in production. ~3 seconds time to spawn `upload-pack` is not an option for us. I did not develop a tweaked on-disk format out of boredom. I think your code or experiments are bogus. Even on our systems with JGit a cold start for the Linux kernel doesn't take 3s. And this is JGit where Java is slow because Jesus it has a lot of factories, and without mmap'ing the file into the server's address space. Hell the file has to come over the network from a remote disk array. I could dig up the patch if you're particularly interested in backwards compatibility, but since it was several times slower than the current iteration, I have no interest (time, actually) to maintain it, brush it up, and so on. I have already offered myself to port the v2 format to JGit as soon as it's settled. It sounds like a better investment of all our times. Actually, I think the format you propose here is inferior to the JGit format. In particular the idx-ordering means the EWAH code is useless. You might as well not use the EWAH format and just store 2.6M bits per commit. The idx-ordering also makes *much* harder to emit a pack file a reasonable order for the client. Colby and I tried idx-ordering and discarded it when it didn't perform as well as the pack-ordering that JGit uses. Following up on Shawn's comments, I removed the little-endian support from the on-disk format and implemented lazy loading of the bitmaps to make up for it. The result is decent (slowed down from 250ms to 300ms) and it lets us keep the whole format as NWO on disk. I think it's a good tradeback. The maintenance burden of two endian formats in a single file is too high
Re: [PATCH 07/16] compat: add endinanness helpers
On Tue, Jun 25, 2013 at 3:08 PM, Peter Krefting pe...@softwolves.pp.se wrote: endian(3) claims that glibc 2.9+ define be64toh() and htobe64() which should do what you are looking for. The manual page does mention them being named differently across OSes, though, so you may need to be careful with that. I'm aware of that, but Git needs to build with glibc 2.7+ (or was it 2.6?), hence the need for this compat layer. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/16] documentation: add documentation for the bitmap format
On Tue, Jun 25, 2013 at 7:42 AM, Shawn Pearce spea...@spearce.org wrote: I very much hate seeing a file format that is supposed to be portable that supports both big-endian and little-endian encoding. Well, the bitmap index is not supposed to be portable, as it doesn't get sent over the wire in any situation. Regardless, the format is portable because it supports both encodings and clearly defines which one the current file is using. I think that's a good tradeoff! Such a specification forces everyone to implement two code paths to handle reading data from the file, on the off-chance they are on the wrong platform. Extra code paths have never been an issue in the JGitAbstractFactoryGeneratorOfGit, har har har. Ah. I'm such a funny guy when it comes to Java. Anyway, I designed this keeping JGit in mind. In this specific case, it doesn't force you to add any new code paths. The endianness changes only affect the serialization format of the bitmaps, which is not part of Git or JGit itself but of the Javaewah/libewok library. The interface for reading on that library has already been wisely abstracted on JGit (https://github.com/eclipse/jgit/blob/master/org.eclipse.jgit/src/org/eclipse/jgit/internal/storage/file/PackBitmapIndexV1.java#L133), so changing the byte order simply means changing the SimpleDataInput to a LE one. I agree this is not ideal, or elegant, but I'm having a hard time making an argument for sacrificing objective speed for the sake of subjective simplicity. What is wrong with picking one encoding and sticking to it? It prevents us from making this optimally fast on the machines where it needs to be. Regardless, I must admit I haven't generated numbers for this in a while (the BE-LE switch is one of the first optimizations I did). I'm going to try to re-implement full NWO loading and see how much slower it is, before I continue arguing for/against it. If I can get it within a reasonable margin (say, 15%) of the current implementation, I'd definitely be in favor of sticking to only NWO on the whole file. If it's slower than that, well, Git has never compromised on speed, and I don't think there's a point to be made for starting to do that now. + - BITMAP_OPT_HASH_CACHE (0x8) + If present, a hash cache for finding delta bases will be available + right after the header block in this index. See the following + section for details. + + 4-byte entry count (network byte order) + + The total count of entries (bitmapped commits) in this bitmap index. + + 20-byte checksum + + The SHA1 checksum of the pack this bitmap index belongs to. + + - An OPTIONAL delta cache follows the header. Some may find the name delta cache confusing as it does not cache deltas of objects. May I suggest path hash cache as an alternative name? Definitely, this is a typo. + The cache is formed by `n` 4-byte hashes in a row, where `n` is + the amount of objects in the indexed packfile. Note that this amount + is the **total number of objects** and is not related to the + number of commits that have been selected and indexed in the + bitmap index. + + The hashes are stored in Network Byte Order and they are the same + values generated by a normal revision walk during the `pack-objects` + phase. I find it interesting this is network byte order and not big-endian or little-endian based on the flag in the header. As stated before, the flag in the header only affects the Javaewah/libewok interface. Everything Git-related in the bitmap index is in NWO, like it is customary in Git. + The `n`nth hash in the cache is the name hash for the `n`th object + in the index for the indexed packfile. + + [RATIONALE]: + + The bitmap index allows us to skip the Counting Objects phase + during `pack-objects` and yield all the OIDs that would be reachable + (WANTS) when generating the pack. + + This optimization, however, means that we're adding objects to the + packfile straight from the packfile index, and hence we are lacking + path information for the objects that would normally be generated + during the Counting Objects phase. + + This path information for each object is hashed and used as a very + effective way to find good delta bases when compressing the packfile; + without these hashes, the resulting packfiles are much less optimal. + + By storing all the hashes in a cache together with the bitmapsin + the bitmap index, we can yield not only the SHA1 of all
Re: [PATCH 09/16] documentation: add documentation for the bitmap format
On Tue, Jun 25, 2013 at 11:17 PM, Junio C Hamano gits...@pobox.com wrote: What case are you talking about? The n-th object must be one of these four types and can never be of more than one type at the same time, so a natural expectation from the reader is If you OR them together, you will get the same set. If you say If you XOR them, that forces the reader to wonder when these bitmaps ever can overlap at the same bit position. I guess this is just wording. I don't particularly care about the distinction, but I'll change it to OR. To sum it up: I'd like to see this format be strictly in Network Byte Order, Good. I've been wondering what you meant by cannot be mmap-ed from the very beginning. We mmapped the index for a long time, and it is defined in terms of network byte order. Of course, pack .idx files are in network byte order, too, and we mmap them without problems. It seems that it primarily came from your fear that using network byte order may be unnecessarily hard to perform well, and it would be a good thing to do to try to do so first instead of punting from the beginning. It cannot be mmapped not particularly because of endianness issues, but because the original format is not indexed and requires a full parse of the whole index before it can be accessed programatically. The wrong endianness just increases the parse time. and I'm going to try to make it run fast enough in that encoding. Hmph. Is it an option to start from what JGit does, so that people can use both JGit and your code on the same repository? And then if you do not succeed, after trying to optimize in-core processing using that on-disk format to make it fast enough, start thinking about tweaking the on-disk format? I'm afraid this is not an option. I have an old patchset that implements JGit v1 bitmap loading (and in fact that's how I initially developed these series -- by loading the bitmaps from JGit for debugging), but I discarded it because it simply doesn't pan out in production. ~3 seconds time to spawn `upload-pack` is not an option for us. I did not develop a tweaked on-disk format out of boredom. I could dig up the patch if you're particularly interested in backwards compatibility, but since it was several times slower than the current iteration, I have no interest (time, actually) to maintain it, brush it up, and so on. I have already offered myself to port the v2 format to JGit as soon as it's settled. It sounds like a better investment of all our times. Following up on Shawn's comments, I removed the little-endian support from the on-disk format and implemented lazy loading of the bitmaps to make up for it. The result is decent (slowed down from 250ms to 300ms) and it lets us keep the whole format as NWO on disk. I think it's a good tradeback. The relevant commits are available on my fork of Git (I'll be sending v2 of the patchset once I finish tackling the other reviews): https://github.com/vmg/git/commit/d6cdd4329a547580bbc0143764c726c48b887271 https://github.com/vmg/git/commit/d8ec342fee87425e05c0db1e1630db8424612c71 As it stands right now, the only two changes from v1 of the on-disk format are: - There is an index at the end. This is a good idea. - The bitmaps are sorted in packfile-index order, not in packfile order. This is a good idea. As always, all your feedback is appreciated, but please keep in mind I have strict performance concerns. German kisses, vmg -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/16] documentation: add documentation for the bitmap format
On Tue, Jun 25, 2013 at 5:58 PM, Thomas Rast tr...@inf.ethz.ch wrote: This is the technical documentation and design rationale for the new Bitmap v2 on-disk format. Hrmpf, that's what I get for reading the series in order... + The folowing flags are supported: ^^ typos marked by ^ + By storing all the hashes in a cache together with the bitmapsin ^^ + The obvious consequence is that the XOR of all 4 bitmaps will result + in a full set (all bits sets), and the AND of all 4 bitmaps will ^ + - 1-byte XOR-offset + The xor offset used to compress this bitmap. For an entry + in position `x`, a XOR offset of `y` means that the actual + bitmap representing for this commit is composed by XORing the + bitmap for this entry with the bitmap in entry `x-y` (i.e. + the bitmap `y` entries before this one). + + Note that this compression can be recursive. In order to + XOR this entry with a previous one, the previous entry needs + to be decompressed first, and so on. + + The hard-limit for this offset is 160 (an entry can only be + xor'ed against one of the 160 entries preceding it). This + number is always positivea, and hence entries are always xor'ed ^ + with **previous** bitmaps, not bitmaps that will come afterwards + in the index. Clever. Why 160 though? JGit implementation detail. It's the equivalent of the delta-window in `pack-objects` for example. HINT HINT: in practice, JGit only looks 16 positions behind to find deltas, and we do the same. So the practical limit is 16. harhar + - 2 bytes of RESERVED data (used right now for better packing). What do they mean? + With an index at the end of the file, we can load only this index in memory, + allowing for very efficient access to all the available bitmaps lazily (we + have their offsets in the mmaped file). Is there anything preventing you from mmap()ing the index also? Yeah, this format allows you to easily do a SHA1 bsearch with custom step to lookup entries on the bitmap index, except for the fact that the index is not sorted by SHA1, so you'd need a linear search instead. :) I decided against it because during most complex invocations of `pack-objects`, we perform a couple thousand commit lookups to see if they have a bitmap in the index, so it makes a lot of sense to load the index tightly in a hash table before hand (which takes very little time, to be fair). We more-than-make up for the loading time by having much much faster lookups. I felt it was the right tradeoff (JGit does the same, but in their case, because they cannot mmap. :p) +== Appendix A: Serialization format for an EWAH bitmap + +Ewah bitmaps are serialized in the protocol as the JAVAEWAH +library, making them backwards compatible with the JGit +implementation: + + - 4-byte number of bits of the resulting UNCOMPRESSED bitmap + + - 4-byte number of words of the COMPRESSED bitmap, when stored + + - N x 8-byte words, as specified by the previous field + + This is the actual content of the compressed bitmap. + + - 4-byte position of the current RLW for the compressed + bitmap + +Note that the byte order for this serialization is not defined by +default. The byte order for all the content in a serialized EWAH +bitmap can be known by the byte order flags in the header of the +bitmap index file. Please document the RLW format here. Har har. I was going to comment on your review of the Ewah patchset, but might as well do it here: the only thing I know about Ewah bitmaps is that they work. And I know this because I did extensive fuzz testing of my C port. Unfortunately, the original Java code I ported from has 0 comments, so any documentation here would have to be reverse-engineered. Personally, I'd lean towards considering Ewah an external dependency (black box); the headers for the library are commented accordingly, clearly explaining the interfaces while hiding implementation details. Of course, you're welcome to help me reverse engineer the implementation, but I'm not sure this would be of much value. It'd be better to make sure it passes the extensive test suite of the Java version, and assume that Mr Lemire designed a sound format for the bitmaps. Swiss kisses, vmg -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: [PATCH 03/16] pack-objects: use a faster hash table
On Wed, Jun 26, 2013 at 12:48 AM, Junio C Hamano gits...@pobox.com wrote: After this, the function returns. The original did not add to the table the object name we are looking at, but the new code first adds it to the table with the unconditional kh_put_sha1() above. Is a call to kh_del_sha1() missing here ... No, this is not the case. That's the return case for when *the object was found because it already existed in the hash table* (hence we access it if we're excluding it, to tag it as excluded). We don't want to remove it from the hash table because we're not the ones we inserted it. We only call `kh_del_sha1` in the cases where: 1. The object wasn't found. 2. We inserted its key on the hash table. 3. We later learnt that we don't really want to pack this object. @@ -921,38 +916,42 @@ static int add_object_entry(const unsigned char *sha1, enum object_type type, return 0; } - if (!exclude local has_loose_object_nonlocal(sha1)) + if (!exclude local has_loose_object_nonlocal(sha1)) { + kh_del_sha1(packed_objects, ix); return 0; ... like this one, which seems to compensate for ahh, after all we realize we do not want to add this one to the table? @@ -966,19 +965,30 @@ static int add_object_entry(const unsigned char *sha1, enum object_type type, entry-in_pack_offset = found_offset; } - if (object_ix_hashsz * 3 = nr_objects * 4) - rehash_objects(); - else - object_ix[-1 - ix] = nr_objects; + kh_value(packed_objects, ix) = entry; + kh_key(packed_objects, ix) = entry-idx.sha1; + objects[nr_objects++] = entry; display_progress(progress_state, nr_objects); - if (name no_try_delta(name)) - entry-no_try_delta = 1; - return 1; } +static int add_object_entry(const unsigned char *sha1, enum object_type type, + const char *name, int exclude) +{ + if (add_object_entry_1(sha1, type, name_hash(name), exclude, NULL, 0)) { + struct object_entry *entry = objects[nr_objects - 1]; + + if (name no_try_delta(name)) + entry-no_try_delta = 1; + + return 1; + } + + return 0; +} It is somewhat unclear what we are getting from the split of the main part of this function into *_1(), other than the *_1() function now has a very deep indentation inside if (!found_pack), which is always true because the caller always passes NULL to found_pack. Perhaps this is an unrelated refactoring that is needed for later steps and does not have anything to do with the use of new hash function? Yes, apologies for not making this clear. By refactoring into `_1`, you can see how `traverse_bitmap_commit_list` can use the `_1` version directly as a callback, to insert objects straight into the packing list without looking them up. This is very efficient because we can pass the whole API straight from the bitmap code: 1. The SHA1: we find it by simply looking up the `nth` sha1 on the pack index (if we are yielding bit `n`) 2. The object type: we find it because we have type indexes that let us know the type of any given bit in the bitmap by and-ing it with the index. 3. The hash for its name: we can look it up from the name hash cache in the new bitmap format. 4. Exclude flag: we never exclude when working with bitmaps 5. found_pack: all the bitmapped objects come from the same pack! 6. found_offset: we find it by simply looking up the `nth` offset on the pack index (if we are yielding bit `n`) Boom! We filled the callback just from the data in a bitmap. Ain't that nice? Let me amend the commit message. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/16] pack-objects: use bitmaps when packing objects
On Wed, Jun 26, 2013 at 1:06 AM, Junio C Hamano gits...@pobox.com wrote: @@ -83,6 +84,9 @@ static struct progress *progress_state; static int pack_compression_level = Z_DEFAULT_COMPRESSION; static int pack_compression_seen; +static int bitmap_support; +static int use_bitmap_index; OK. @@ -2131,6 +2135,10 @@ static int git_pack_config(const char *k, const char *v, void *cb) cache_max_small_delta_size = git_config_int(k, v); return 0; } + if (!strcmp(k, pack.usebitmaps)) { + bitmap_support = git_config_bool(k, v); + return 0; + } Hmph, so bitmap_support, not use_bitmap_index, keeps track of the user request? Somewhat confusing. if (!strcmp(k, pack.threads)) { delta_search_threads = git_config_int(k, v); if (delta_search_threads 0) @@ -2366,8 +2374,24 @@ static void get_object_list(int ac, const char **av) die(bad revision '%s', line); } + if (use_bitmap_index) { + uint32_t size_hint; + + if (!prepare_bitmap_walk(revs, size_hint)) { + khint_t new_hash_size = (size_hint * (1.0 / __ac_HASH_UPPER)) + 0.5; What is __ac_HASH_UPPER? That is a very unusual name for a variable or a constant. Also it is mildly annoying to see unnecessary use of float like this. See the updated patch at: https://github.com/vmg/git/blob/vmg/bitmaps-master/builtin/pack-objects.c#L2422 + kh_resize_sha1(packed_objects, new_hash_size); + + nr_alloc = (size_hint + 63) ~63; + objects = xrealloc(objects, nr_alloc * sizeof(struct object_entry *)); + + traverse_bitmap_commit_list(add_object_entry_1); + return; + } + } + if (prepare_revision_walk(revs)) die(revision walk setup failed); + mark_edges_uninteresting(revs.commits, revs, show_edge); traverse_commit_list(revs, show_commit, show_object, NULL); @@ -2495,6 +2519,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) N_(pack compression level)), OPT_SET_INT(0, keep-true-parents, grafts_replace_parents, N_(do not hide commits by grafts), 0), + OPT_BOOL(0, bitmaps, bitmap_support, + N_(enable support for bitmap optimizations)), Please match this with the name of configuration variable, i.e. --use-bitmaps OPT_END(), }; @@ -2561,6 +2587,11 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) if (keep_unreachable unpack_unreachable) die(--keep-unreachable and --unpack-unreachable are incompatible.); + if (bitmap_support) { + if (use_internal_rev_list pack_to_stdout) + use_bitmap_index = 1; OK, so only when some internal condition is met, the user request to use bitmap is honored and the deision is kept in use_bitmap_index. It may be easier to read if you get rid of bitmap_support, set user_bitmap_index directly from the command line and config, and did this here instead: if (!(use_internal_rev_list pack_to_stdout)) use_bitmap_index = 0; Yeah, I'm not particularly happy with the way these flags are implemented. I'll update this. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/16] repack: consider bitmaps when performing repacks
On Wed, Jun 26, 2013 at 1:00 AM, Junio C Hamano gits...@pobox.com wrote: @@ -156,6 +156,11 @@ do fullbases=$fullbases pack-$name chmod a-w $PACKTMP-$name.pack chmod a-w $PACKTMP-$name.idx + + test -f $PACKTMP-$name.bitmap + chmod a-w $PACKTMP-$name.bitmap + mv -f $PACKTMP-$name.bitmap $PACKDIR/pack-$name.bitmap If we see a temporary bitmap but somehow failed to move it to the final name, should we _ignore_ that error, or should we die, like the next two lines do? I obviously decided against dying (as you can see on the patch, har har), because the bitmap is not required for the proper operation of the Git repository, unlike the packfile and the index. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/16] rev-list: add bitmap mode to speed up lists
I'm afraid I cannot reproduce the segfault locally (assuming you're performing the rev-list on the git/git repository). Could you please send me more information, and a core dump if possible? On Tue, Jun 25, 2013 at 6:22 PM, Thomas Rast tr...@inf.ethz.ch wrote: Vicent Marti tan...@gmail.com writes: Calling `git rev-list --use-bitmaps [committish]` is the equivalent of `git rev-list --objects`, but the rev list is performed based on a bitmap result instead of using a manual counting objects phase. Why would we ever want to not --use-bitmaps, once it actually works? I.e., shouldn't this be the default if pack.usebitmaps is set (or possibly even core.usebitmaps for these things)? These are some example timings for `torvalds/linux`: $ time ../git/git rev-list --objects master /dev/null real0m25.567s user0m25.148s sys 0m0.384s $ time ../git/git rev-list --use-bitmaps master /dev/null real0m0.393s user0m0.356s sys 0m0.036s I see your badass numbers, and raise you a critical issue: $ time git rev-list --use-bitmaps --count --left-right origin/pu...origin/next Segmentation fault real0m0.408s user0m0.383s sys 0m0.022s It actually seems to be related solely to having negated commits in the walk: thomas@linux-k42r:~/g(next u+65)$ time git rev-list --use-bitmaps --count origin/pu 32315 real0m0.041s user0m0.034s sys 0m0.006s thomas@linux-k42r:~/g(next u+65)$ time git rev-list --use-bitmaps --count origin/pu ^origin/next Segmentation fault real0m0.460s user0m0.214s sys 0m0.244s I also can't help noticing that the time spent generating the segfault would have sufficed to generate the answer the old way as well: $ time git rev-list --count --left-right origin/pu...origin/next 189 125 real0m0.409s user0m0.386s sys 0m0.022s Can we use the same trick to speed up merge base computation and then --left-right? The latter is a component of __git_ps1 and can get somewhat slow in some cases, so it would be nice to make it really fast, too. -- Thomas Rast trast@{inf,student}.ethz.ch -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html