On Mon, Oct 28, 2013 at 8:45 PM, Karsten Blees <karsten.bl...@gmail.com> wrote:
>> The new `hashmap.c` covers the first case quite well (albeit slightly
>> more verbosely than I'd like), but in the second case it doesn't quite
>> work. Since the new hash needs to embed the "struct hashmap_entry" on
>> all its values (to allow for separate chaining), having it map to
>> `int` keys requires a struct like this:
>>     struct sha1_position {
>>         struct hashmap_entry {
>>             struct hashmap_entry *next;
>>             unsigned int hash;
>>         };
>>         int position;
>>     }
> Hmm...isn't that position rather an index into two separately maintained 
> arrays? So you'd rather have:
>     struct sha1_position {
>         struct hashmap_entry {
>             struct hashmap_entry *next;
>             unsigned int hash;
>         };
>         uint32_t pack_name_hash;
>         struct object *object;
>      }

No, this is not quite right. We use the separate arrays because the
normal operation mode is to index by position (e.g. we need the nth
object in the extended index); the hash table is an auxiliary
structure to reverse that indexing (e.g. what position does this SHA1
have on the extended index). The information which is always required
to construct bitmaps is position, so we need to store the indexes in a

>> khash on the other hand is capable of storing the position values as
>> part of the hash table itself (i.e. `int **buckets`), and saves us
>> from thousands of bytes of allocations + indirection.
> However, khash keeps separate arrays for flags, keys and values, all of them 
> overallocated by 1 / load factor (so there's lots of unused space). 
> ext_index.objects and ext_index.hashes are also overallocated by the usual 
> alloc_nr() factor 1.5.

FWIW The flags for khash are compacted, so they are stored much more
tightly than pointers, even when overallocated.

> Regarding memory consumption, I think both implementations will be pretty 
> similar. Hashmap allocates many small regions while khash re-allocates a few 
> big ones...I really don't know which is better, ideally entries would be 
> allocated in chunks to minimize both heap overhead and memcpy disadvantes.

I agree, both implementations probably have very similar memory
characteristics, probably enough not to matter.

> Regarding performance, khash uses open addressing, which requires more key 
> comparisons (O(1/(1-load_factor))) than chaining (O(1+load_factor)). However, 
> any measurable differences will most likely be dwarfed by IO costs in this 
> particular use case.

I don't think this is true. If you actually run a couple insertion and
lookup benchmarks comparing the two implementations, you'll find khash
to be around ~30% faster for most workloads (venturing a guess from
past experience). I am obviously not the author of khash, but I've
found that the theoretical increase in key comparisons is completely
dwarfed by the benefit of increased cache locality during the probing
fase. I still haven't found a faster C hash table implementation than
khash for the general case, that's why I normally use it despite the
worrisome preprocessor crash-party going on in that header file.

> Btw., pack-objects.c::rehash_objects() in patch 03 unnecessarily checks for 
> duplicates. That's probably the reason for the high hashcmp times you found 
> in the first round of the patch series.

Patch 03 is a refactoring -- the duplicate checking code has been in
pack-objects.c for years. I am not sure duplicate checking is
superfluous at all, there are many cases where you could be
double-inserting objects in a pack-objects run, and you really don't
want to generate packfiles with dupe objects.

Thanks for the feedback!

To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to