On Mon, Jul 8, 2013 at 12:57 AM, Jeff King <p...@peff.net> wrote: > On Sun, Jul 07, 2013 at 04:52:23PM -0700, Shawn O. Pearce wrote: > >> On Sun, Jul 7, 2013 at 3:14 AM, Jeff King <p...@peff.net> wrote: >> > The pack revindex stores the offsets of the objects in the >> > pack in sorted order, allowing us to easily find the on-disk >> > size of each object. To compute it, we populate an array >> > with the offsets from the sha1-sorted idx file, and then use >> > qsort to order it by offsets. >> > >> > That does O(n log n) offset comparisons, and profiling shows >> > that we spend most of our time in cmp_offset. However, since >> > we are sorting on a simple off_t, we can use numeric sorts >> > that perform better. A radix sort can run in O(k*n), where k >> > is the number of "digits" in our number. For a 64-bit off_t, >> > using 16-bit "digits" gives us k=4. >> >> Did you try the simple bucket sort Colby now uses in JGit? >> >> The sort is pretty simple: >> >> bucket_size = pack_length / object_count; >> buckets = malloc(object_count * sizeof(int)); >> >> foreach obj in idx: >> push_chain(buckets[obj.offset / bucket_size], obj.idx_nth); >> >> foreach bucket: >> insertion sort by offset > > I did do something similar (though I flattened my buckets into a single > list afterwards), but I ended up closer to 700ms (down from 830ms, but > with the radix sort around 200ms). It's entirely possible I screwed up > something in the implementation (the bucket insertion can be done in a > lot of different ways, many of which are terrible), but I didn't keep a > copy of that attempt. If you try it and have better numbers, I'd be > happy to see them.
Colby's sort in Java is coming in around 450ms for linux.git, so sounds like your implementation was doing something suboptimal. But as I thought about it this morning, a radix sort for most pack files should run with k=2 and take only O(2*N) time. It is a very efficient sort for the data. Colby and I didn't even try a radix sort, and I suspect it would out-perform the bucket sort we do now. >> We observed on linux.git that most buckets have an average number of >> objects. IIRC the bucket_size was ~201 bytes and most buckets had very >> few objects each. For lookups we keep the bucket_size parameter and a >> bucket index table. This arrangement uses 8 bytes per object in the >> reverse index, making it very memory efficient. Searches are typically >> below O(log N) time because each bucket has <log N entries. > > I didn't measure lookups at all; I was focused on time to build the > index. So if there were benefits there that make up for a longer setup > time, I wouldn't have measured them (of course, we also care about the > case with few lookups, so it would be a tradeoff). We didn't measure lookup times either. Colby did compute a histogram of bucket sizes and showed nearly all buckets were significantly smaller than log N, so lookups are <log N time even though they are a simple iteration through the elements. Colby considered doing binary search within a bucket but didn't bother given how small the buckets are. So our lookup time benefit is theoretical. The way JGit implements clones we tend not to perform N lookups in revidx, its usually sub 1000 lookups in revidx. That makes it harder to have any noticeable benefit from decreased lookup time. > You could also leave > each bucket unsorted and only lazily sort it when a lookup hits the > bucket, which might help that case (I didn't look to see if you do that > in JGit). We didn't do that in JGit, the sort is done at initialization. But given the remark I just made about clones doing only a few lookups we may want to defer the sort. IIRC the sort is about half of our initialization cost. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html