Re: [PATCH 4/3] ewah: adjust callers of ewah_read_mmap()

2018-06-15 Thread Jeff King
On Fri, Jun 15, 2018 at 02:23:44PM -0400, Derrick Stolee wrote:

> If we are considering changing the reachability bitmap, then I'm very
> intrigued. I think the number one thing to consider is to use the
> multi-pack-index as a reference point (with a stable object order) so the
> objects can be repacked independently from the reachability bitmap
> computation. If we are changing the model at that level, then it is worth
> thinking about other questions, like how we index the file or how we
> compress the bitmaps.

I'm open to a new format if it provides significant improvements over
the existing one. I think the existing bitmaps have served us well for
several years, but they do have a few weaknesses. Some of which I
mentioned before, but the most obvious one is that being very
pack-oriented they require repacking to update (and don't handle
cross-pack reachability at all). I know that doesn't fly for
Windows-sized repos at all, but it would also be nice if we could do
incremental updates more cheaply (e.g., after every push instead of just
once a day).

The Roaring stuff looks really interesting. I'm curious about the stable
object order you guys use. Because EWAH is basically run-length-encoding,
it benefits hugely from having the bitmaps in pack order (where there's
a enormous locality with respect to reachability) as opposed to sha1
order (where it's essentially random).

Is your stable object order based on traversing the commit graph? Or
does Roaring do a sufficiently better job of compressing the jumbled
sha1 order?

-Peff


Re: [PATCH 4/3] ewah: adjust callers of ewah_read_mmap()

2018-06-15 Thread Derrick Stolee

On 6/15/2018 1:31 PM, Jeff King wrote:

On Fri, Jun 15, 2018 at 09:41:46AM -0700, Junio C Hamano wrote:


Derrick Stolee  writes:


On 6/14/2018 11:44 PM, Jeff King wrote:

The return value of ewah_read_mmap() is now an ssize_t,
since we could (in theory) process up to 32GB of data. This
would never happen in practice, but a corrupt or malicious
.bitmap or index file could convince us to do so.

Aside: Why a 32GB limit? Usually I see 32-bit issues limiting by 2
(signed) or 4 GB (unsigned). Is there something specifically in the
bitmap format that stops at this size?

The proposed log message for 1/3 has this bit

   - check the size for integer overflow (this should be
 impossible on 64-bit, as the size is given as a 32-bit
 count of 8-byte words, but is possible on a 32-bit
 system)

4 Giga 8-byte words makes 32 Giga bytes, I'd guess.

Yes, exactly. It's definitely an odd size. I think using the same
varints we use in the packfile would be more efficient (and they're
already variable-length records), though we tend to have few enough of
these that it probably doesn't matter.

I think a more useful change for the bitmap format would be some kind of
index. IIRC, right now readers have to linearly scan the whole file in
order to use a single bitmap.

With Stolee's commit-metadata files, we could potentially move to
storing reachability bitmaps as a data element there. But there are two
complications:

  - the bitmaps themselves are dependent on having an ordered list of
objects (one per bit) to talk about. And that's why they're
integrated with packfiles, since that provides a constrained universe
with a set ordering.

  - the existing storage isn't entirely independent between bitmaps. Some
of them are basically "deltas" off of nearby bitmaps.


The VSTS reachability bitmap is different from the Git bitmap in a few ways.

1. It uses Roaring+Run bitmaps [1] instead of EWAH bitmap. This format 
is simpler (in my opinion) in several ways, especially in how it splits 
the bitmap into 16-bit address ranges and compresses each on its own. 
This makes set containment queries much faster, as we can navigate 
directly to the region that is important (and then binary search if that 
chunk is an "array" or "run" chunk). I say this is simpler because I can 
explain the entire compression format to you in five minutes and a 
whiteboard. The paper [2] is a simple enough read (start at the "Roaring 
Bitmap" section at the end of page 4).


2. Our file uses a chunk-based approach similar to the commit-graph 
file: one chunk is simply a list of the commits that have bitmaps. 
Another chunk is the raw binary data of all bitmaps concatenated 
together. The last chunk connects the other two chunks by a) providing 
an offset into the binary data for the bitmap at that commit, and b) the 
commit that provides a delta base for the bitmap.


If we are considering changing the reachability bitmap, then I'm very 
intrigued. I think the number one thing to consider is to use the 
multi-pack-index as a reference point (with a stable object order) so 
the objects can be repacked independently from the reachability bitmap 
computation. If we are changing the model at that level, then it is 
worth thinking about other questions, like how we index the file or how 
we compress the bitmaps.


Thanks,
-Stolee

[1] https://roaringbitmap.org/about/

[2] https://arxiv.org/abs/1603.06549
    Consistently faster and smaller compressed bitmaps with Roaring


Re: [PATCH 4/3] ewah: adjust callers of ewah_read_mmap()

2018-06-15 Thread Jeff King
On Fri, Jun 15, 2018 at 09:41:46AM -0700, Junio C Hamano wrote:

> Derrick Stolee  writes:
> 
> > On 6/14/2018 11:44 PM, Jeff King wrote:
> >> The return value of ewah_read_mmap() is now an ssize_t,
> >> since we could (in theory) process up to 32GB of data. This
> >> would never happen in practice, but a corrupt or malicious
> >> .bitmap or index file could convince us to do so.
> >
> > Aside: Why a 32GB limit? Usually I see 32-bit issues limiting by 2
> > (signed) or 4 GB (unsigned). Is there something specifically in the
> > bitmap format that stops at this size?
> 
> The proposed log message for 1/3 has this bit
> 
>   - check the size for integer overflow (this should be
> impossible on 64-bit, as the size is given as a 32-bit
> count of 8-byte words, but is possible on a 32-bit
> system)
> 
> 4 Giga 8-byte words makes 32 Giga bytes, I'd guess.

Yes, exactly. It's definitely an odd size. I think using the same
varints we use in the packfile would be more efficient (and they're
already variable-length records), though we tend to have few enough of
these that it probably doesn't matter.

I think a more useful change for the bitmap format would be some kind of
index. IIRC, right now readers have to linearly scan the whole file in
order to use a single bitmap.

With Stolee's commit-metadata files, we could potentially move to
storing reachability bitmaps as a data element there. But there are two
complications:

 - the bitmaps themselves are dependent on having an ordered list of
   objects (one per bit) to talk about. And that's why they're
   integrated with packfiles, since that provides a constrained universe
   with a set ordering.

 - the existing storage isn't entirely independent between bitmaps. Some
   of them are basically "deltas" off of nearby bitmaps.

-Peff


Re: [PATCH 4/3] ewah: adjust callers of ewah_read_mmap()

2018-06-15 Thread Junio C Hamano
Jeff King  writes:

> On Thu, Jun 14, 2018 at 11:28:50PM -0400, Jeff King wrote:
>
>> Yep. We also fail to check if we even have enough bytes to read the
>> buffer_size in the first place.
>> 
>> Here are some patches. The first one fixes the problem you found. The
>> second one drops some dead code that has a related problem. And the
>> third just drops some dead code that I noticed in the same file. :)
>> 
>>   [1/3]: ewah_read_mmap: bounds-check mmap reads
>>   [2/3]: ewah: drop ewah_deserialize function
>>   [3/3]: ewah: drop ewah_serialize_native function
>
> Actually, we'd want this one on top. Arguably it could be squashed into
> patch 1.
>
> -- >8 --
> Subject: ewah: adjust callers of ewah_read_mmap()
>
> The return value of ewah_read_mmap() is now an ssize_t,
> since we could (in theory) process up to 32GB of data. This
> would never happen in practice, but a corrupt or malicious
> .bitmap or index file could convince us to do so.
>
> Let's make sure that we don't stuff the value into an int,
> which would cause us to incorrectly move our pointer
> forward.  We'd always move too little, since negative values
> are used for reporting errors. So the worst case is just
> that we end up reporting a corrupt file, not an
> out-of-bounds read.
>
> Signed-off-by: Jeff King 
> ---

Makes sense.

>  dir.c | 3 ++-
>  pack-bitmap.c | 2 +-
>  2 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/dir.c b/dir.c
> index 61b513a078..d5185660f1 100644
> --- a/dir.c
> +++ b/dir.c
> @@ -2725,7 +2725,8 @@ struct untracked_cache *read_untracked_extension(const 
> void *data, unsigned long
>   struct read_data rd;
>   const unsigned char *next = data, *end = (const unsigned char *)data + 
> sz;
>   const char *ident;
> - int ident_len, len;
> + int ident_len;
> + ssize_t len;
>   const char *exclude_per_dir;
>  
>   if (sz <= 1 || end[-1] != '\0')
> diff --git a/pack-bitmap.c b/pack-bitmap.c
> index 369bf69d75..2f27b10e35 100644
> --- a/pack-bitmap.c
> +++ b/pack-bitmap.c
> @@ -125,7 +125,7 @@ static struct ewah_bitmap *read_bitmap_1(struct 
> bitmap_index *index)
>  {
>   struct ewah_bitmap *b = ewah_pool_new();
>  
> - int bitmap_size = ewah_read_mmap(b,
> + ssize_t bitmap_size = ewah_read_mmap(b,
>   index->map + index->map_pos,
>   index->map_size - index->map_pos);


Re: [PATCH 4/3] ewah: adjust callers of ewah_read_mmap()

2018-06-15 Thread Junio C Hamano
Derrick Stolee  writes:

> On 6/14/2018 11:44 PM, Jeff King wrote:
>> The return value of ewah_read_mmap() is now an ssize_t,
>> since we could (in theory) process up to 32GB of data. This
>> would never happen in practice, but a corrupt or malicious
>> .bitmap or index file could convince us to do so.
>
> Aside: Why a 32GB limit? Usually I see 32-bit issues limiting by 2
> (signed) or 4 GB (unsigned). Is there something specifically in the
> bitmap format that stops at this size?

The proposed log message for 1/3 has this bit

  - check the size for integer overflow (this should be
impossible on 64-bit, as the size is given as a 32-bit
count of 8-byte words, but is possible on a 32-bit
system)

4 Giga 8-byte words makes 32 Giga bytes, I'd guess.


Re: [PATCH 4/3] ewah: adjust callers of ewah_read_mmap()

2018-06-15 Thread Derrick Stolee

On 6/14/2018 11:44 PM, Jeff King wrote:

The return value of ewah_read_mmap() is now an ssize_t,
since we could (in theory) process up to 32GB of data. This
would never happen in practice, but a corrupt or malicious
.bitmap or index file could convince us to do so.


Aside: Why a 32GB limit? Usually I see 32-bit issues limiting by 2 
(signed) or 4 GB (unsigned). Is there something specifically in the 
bitmap format that stops at this size?


Thanks,
-Stolee