On 13/08/18 04:33, Christian Couder wrote:
> On Mon, Aug 13, 2018 at 3:14 AM, Ramsay Jones
[snip]
>> And neither the sha1 or str hash-maps are destroyed here.
>> (That is not necessarily a problem, of course! ;-) )
> 
> The instances are declared as static:
> 
>   static khash_sha1 *island_marks;
> 
>   static kh_str_t *remote_islands;
> 
> so it maybe ok.

Yes, that's fine.

> 
>>> +struct island_bitmap {
>>> +     uint32_t refcount;
>>> +     uint32_t bits[];
>>
>> Use FLEX_ARRAY here? We are slowly moving toward requiring
>> certain C99 features, but I can't remember a flex array
>> weather-balloon patch.
> 
> This was already discussed by Junio and Peff there:
> 
> https://public-inbox.org/git/20180727130229.gb18...@sigill.intra.peff.net/
> 

That is a fine discussion, without a firm conclusion, but I don't
think you can simply do nothing here:

  $ cat -n junk.c
       1        #include <stdint.h>
       2        
       3        struct island_bitmap {
       4                uint32_t refcount;
       5                uint32_t bits[];
       6        };
       7        
  $ gcc --std=c89 --pedantic -c junk.c
  junk.c:5:11: warning: ISO C90 does not support flexible array members 
[-Wpedantic]
    uint32_t bits[];
             ^~~~
  $ gcc --std=c99 --pedantic -c junk.c
  $ 
  
>>> +};
> 
>>> +int in_same_island(const struct object_id *trg_oid, const struct object_id 
>>> *src_oid)
>>
>> Hmm, what does the trg_ prefix stand for?
>>
>>> +{
>>> +     khiter_t trg_pos, src_pos;
>>> +
>>> +     /* If we aren't using islands, assume everything goes together. */
>>> +     if (!island_marks)
>>> +             return 1;
>>> +
>>> +     /*
>>> +      * If we don't have a bitmap for the target, we can delta it
>>
>> ... Ah, OK, trg_ => target.
> 
> I am ok to replace "trg" with "target" (or maybe "dst"? or something
> else) and "src" with "source" if you think it would make things
> clearer.

If it had been dst_ (or target), I would not have had a 'huh?'
moment, but it is not all that important.

> 
>>> +static void add_ref_to_island(const char *island_name, const struct 
>>> object_id *oid)
>>> +{
>>> +     uint64_t sha_core;
>>> +     struct remote_island *rl = NULL;
>>> +
>>> +     int hash_ret;
>>> +     khiter_t pos = kh_put_str(remote_islands, island_name, &hash_ret);
>>> +
>>> +     if (hash_ret) {
>>> +             kh_key(remote_islands, pos) = xstrdup(island_name);
>>> +             kh_value(remote_islands, pos) = xcalloc(1, sizeof(struct 
>>> remote_island));
>>> +     }
>>> +
>>> +     rl = kh_value(remote_islands, pos);
>>> +     oid_array_append(&rl->oids, oid);
>>> +
>>> +     memcpy(&sha_core, oid->hash, sizeof(uint64_t));
>>> +     rl->hash += sha_core;
>>
>> Hmm, so the first 64-bits of the oid of each ref that is part of
>> this island is added together as a 'hash' for the island. And this
>> is used to de-duplicate the islands? Any false positives? (does it
>> matter - it would only affect performance, not correctness, right?)
> 
> I would think that a false positive from pure chance is very unlikely.
> We would need to approach billions of delta islands (as 2 to the power
> 64/2 is in the order of billions) for the probability to be
> significant. GitHub has less than 50 millions users and it is very
> unlikely that a significant proportion of these users will fork the
> same repo.
> 
> Now if there is a false positive because two forks have exactly the
> same refs, then it is not a problem if they are considered the same,
> because they are actually the same.

Yep, good point.

ATB,
Ramsay Jones

Reply via email to