On Sun, Jul 7, 2013 at 3:14 AM, Jeff King <p...@peff.net> wrote:
> The pack revindex stores the offsets of the objects in the
> pack in sorted order, allowing us to easily find the on-disk
> size of each object. To compute it, we populate an array
> with the offsets from the sha1-sorted idx file, and then use
> qsort to order it by offsets.
> That does O(n log n) offset comparisons, and profiling shows
> that we spend most of our time in cmp_offset. However, since
> we are sorting on a simple off_t, we can use numeric sorts
> that perform better. A radix sort can run in O(k*n), where k
> is the number of "digits" in our number. For a 64-bit off_t,
> using 16-bit "digits" gives us k=4.
Did you try the simple bucket sort Colby now uses in JGit?
The sort is pretty simple:
bucket_size = pack_length / object_count;
buckets = malloc(object_count * sizeof(int));
foreach obj in idx:
push_chain(buckets[obj.offset / bucket_size], obj.idx_nth);
insertion sort by offset
We observed on linux.git that most buckets have an average number of
objects. IIRC the bucket_size was ~201 bytes and most buckets had very
few objects each. For lookups we keep the bucket_size parameter and a
bucket index table. This arrangement uses 8 bytes per object in the
reverse index, making it very memory efficient. Searches are typically
below O(log N) time because each bucket has <log N entries.
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html