Re: [PATCH v3 2/4] eoie: add End of Index Entry (EOIE) extension

2018-09-08 Thread Martin Ågren
On Sat, 8 Sep 2018 at 16:04, Ben Peart  wrote:
> On 9/8/2018 2:29 AM, Martin Ågren wrote:
> > Maybe it all works out, e.g., so that when someone (brian) merges a
> > NewHash and runs the testsuite, this will fail consistently and in a
> > safe way. But I wonder if it would be too hard to avoid the hardcoded 24
> > already now.
>
> I can certainly change this to be:
>
> #define EOIE_SIZE (4 + GIT_SHA1_RAWSZ)
>
> which should (hopefully) make it easier to find this hard coded hash
> length in the sea of hard coded "20" and "160" (bits) littered through
> the codebase.

Yeah, that seems more grep-friendly.

Martin


Re: [PATCH v3 2/4] eoie: add End of Index Entry (EOIE) extension

2018-09-08 Thread Ben Peart




On 9/8/2018 2:29 AM, Martin Ågren wrote:

On Fri, 7 Sep 2018 at 22:24, Ben Peart  wrote:

Ben Peart  writes:



- 160-bit SHA-1 over the extension types and their sizes (but not
their contents).  E.g. if we have "TREE" extension that is N-bytes
long, "REUC" extension that is M-bytes long, followed by "EOIE",
then the hash would be:



The purpose of the SHA isn't to detect disk corruption (we already have
a SHA for the entire index that can serve that purpose) but to help
ensure that this was actually a valid EOIE extension and not a lucky
random set of bytes. [...]



+#define EOIE_SIZE 24 /* <4-byte offset> + <20-byte hash> */



+the_hash_algo->init_fn();
+while (src_offset < mmap_size - the_hash_algo->rawsz - 
EOIE_SIZE_WITH_HEADER) {

[...]

+the_hash_algo->final_fn(hash, );
+if (hashcmp(hash, (unsigned char *)index))
+return 0;
+
+/* Validate that the extension offsets returned us back to the eoie 
extension. */
+if (src_offset != mmap_size - the_hash_algo->rawsz - EOIE_SIZE_WITH_HEADER)
+return 0;


Besides the issue you and Junio discussed with "should we document this
as being SHA-1 or NewHash" (or "the hash algo"), it seems to me that
this implementation is living somewhere between using SHA-1 and "the
hash algo". The hashing uses `the_hash_algo`, but the hash size is
hardcoded at 20 bytes.

Maybe it all works out, e.g., so that when someone (brian) merges a
NewHash and runs the testsuite, this will fail consistently and in a
safe way. But I wonder if it would be too hard to avoid the hardcoded 24
already now.

Martin



I can certainly change this to be:

#define EOIE_SIZE (4 + GIT_SHA1_RAWSZ)

which should (hopefully) make it easier to find this hard coded hash 
length in the sea of hard coded "20" and "160" (bits) littered through 
the codebase.


Re: [PATCH v3 2/4] eoie: add End of Index Entry (EOIE) extension

2018-09-08 Thread Martin Ågren
On Fri, 7 Sep 2018 at 22:24, Ben Peart  wrote:
> > Ben Peart  writes:

> >> - 160-bit SHA-1 over the extension types and their sizes (but not
> >> their contents).  E.g. if we have "TREE" extension that is N-bytes
> >> long, "REUC" extension that is M-bytes long, followed by "EOIE",
> >> then the hash would be:

> The purpose of the SHA isn't to detect disk corruption (we already have
> a SHA for the entire index that can serve that purpose) but to help
> ensure that this was actually a valid EOIE extension and not a lucky
> random set of bytes. [...]

> >> +#define EOIE_SIZE 24 /* <4-byte offset> + <20-byte hash> */

> >> +the_hash_algo->init_fn();
> >> +while (src_offset < mmap_size - the_hash_algo->rawsz - 
> >> EOIE_SIZE_WITH_HEADER) {
[...]
> >> +the_hash_algo->final_fn(hash, );
> >> +if (hashcmp(hash, (unsigned char *)index))
> >> +return 0;
> >> +
> >> +/* Validate that the extension offsets returned us back to the eoie 
> >> extension. */
> >> +if (src_offset != mmap_size - the_hash_algo->rawsz - 
> >> EOIE_SIZE_WITH_HEADER)
> >> +return 0;

Besides the issue you and Junio discussed with "should we document this
as being SHA-1 or NewHash" (or "the hash algo"), it seems to me that
this implementation is living somewhere between using SHA-1 and "the
hash algo". The hashing uses `the_hash_algo`, but the hash size is
hardcoded at 20 bytes.

Maybe it all works out, e.g., so that when someone (brian) merges a
NewHash and runs the testsuite, this will fail consistently and in a
safe way. But I wonder if it would be too hard to avoid the hardcoded 24
already now.

Martin


Re: [PATCH v3 2/4] eoie: add End of Index Entry (EOIE) extension

2018-09-07 Thread Ben Peart




On 9/7/2018 1:55 PM, Junio C Hamano wrote:

Ben Peart  writes:


The extension consists of:

- 32-bit offset to the end of the index entries

- 160-bit SHA-1 over the extension types and their sizes (but not
their contents).  E.g. if we have "TREE" extension that is N-bytes
long, "REUC" extension that is M-bytes long, followed by "EOIE",
then the hash would be:

SHA-1("TREE" +  +
"REUC" + )


I didn't look at the documentation patch in the larger context, but
please make sure that it is clear to the readers that these fixed
width integers "binary representations" use network byte order.



At the top of the documentation it says "All binary numbers are in 
network byte order" and that is not repeated for any of the other 
sections that are documenting the file format.



I briefly wondered if the above should include

 + "EOIE" + 

as it is pretty much common file format design to include the header
part of the checksum record (with checksum values padded out with NUL
bytes) when you define a record to hold the checksum of the entire
file.  Since this does not protect the contents of each section from
bit-flipping, adding the data for EOIE itself in the sum does not
give us much (iow, what I am adding above is a constant that does
not contribute any entropy).

How big is a typical TREE extension in _your_ work repository
housing Windows sources?  I am guessing that replacing SHA-1 with
something faster (as this is not about security but is about disk
corruption) and instead hash also the contents of these sections
would NOT help all that much in the performance department, as
having to page them in to read through would already consume
non-trivial amount of time, and that is why you are not hashing the
contents.



The purpose of the SHA isn't to detect disk corruption (we already have 
a SHA for the entire index that can serve that purpose) but to help 
ensure that this was actually a valid EOIE extension and not a lucky 
random set of bytes.  I had used leading and trailing signature bytes 
along with the length and version bytes to validate it was an actual 
EOIE extension but you suggested [1] that I use a SHA of the 4 byte 
extension type + 4 byte extension length instead so I rewrote it that way.


[1] 
https://public-inbox.org/git/xmqq1sl017dw@gitster.mtv.corp.google.com/



+   /*
+* CACHE_EXT_ENDOFINDEXENTRIES must be written as the last entry before 
the SHA1


s/SHA1/trailing checksum/ or something so that we can withstand
NewHash world order?



I thought about this but in the document elsewhere it refers to it as 
"160-bit SHA-1 over the content of the index file before this checksum." 
and there are at least a dozen other references to "SHA-1" so I figured 
we can fix them all at the same time when we have a new/better name. :-)



+* so that it can be found and processed before all the index entries 
are
+* read.
+*/
+   if (!strip_extensions && offset && 
!git_env_bool("GIT_TEST_DISABLE_EOIE", 0)) {
+   struct strbuf sb = STRBUF_INIT;
+
+   write_eoie_extension(, _c, offset);
+   err = write_index_ext_header(, NULL, newfd, 
CACHE_EXT_ENDOFINDEXENTRIES, sb.len) < 0
|| ce_write(, newfd, sb.buf, sb.len) < 0;
strbuf_release();
if (err)


OK.


+#define EOIE_SIZE 24 /* <4-byte offset> + <20-byte hash> */
+#define EOIE_SIZE_WITH_HEADER (4 + 4 + EOIE_SIZE) /* <4-byte signature> + <4-byte 
length> + EOIE_SIZE */
+
+#ifndef NO_PTHREADS
+static unsigned long read_eoie_extension(void *mmap, size_t mmap_size)
+{
+   /*
+* The end of index entries (EOIE) extension is guaranteed to be last
+* so that it can be found by scanning backwards from the EOF.
+*
+* "EOIE"
+* <4-byte length>
+* <4-byte offset>
+* <20-byte hash>
+*/
+   const char *index, *eoie = (const char *)mmap + mmap_size - 
GIT_SHA1_RAWSZ - EOIE_SIZE_WITH_HEADER;
+   uint32_t extsize;
+   unsigned long offset, src_offset;
+   unsigned char hash[GIT_MAX_RAWSZ];
+   git_hash_ctx c;
+
+   /* validate the extension signature */
+   index = eoie;
+   if (CACHE_EXT(index) != CACHE_EXT_ENDOFINDEXENTRIES)
+   return 0;
+   index += sizeof(uint32_t);
+
+   /* validate the extension size */
+   extsize = get_be32(index);
+   if (extsize != EOIE_SIZE)
+   return 0;
+   index += sizeof(uint32_t);


Do we know we have at least 8-byte to consume to perform the above
two checks, or is that something we need to verify at the beginning
of this function?  Better yet, as we know that a correct EOIE with
its own header is 28-byte long, we probably should abort if
mmap_size is smaller than that.



I'll add that additional test.


+   /*
+* Validate the offset we're going to look for the first extension
+* signature is after the index header and 

Re: [PATCH v3 2/4] eoie: add End of Index Entry (EOIE) extension

2018-09-07 Thread Junio C Hamano
Ben Peart  writes:

> The extension consists of:
>
> - 32-bit offset to the end of the index entries
>
> - 160-bit SHA-1 over the extension types and their sizes (but not
> their contents).  E.g. if we have "TREE" extension that is N-bytes
> long, "REUC" extension that is M-bytes long, followed by "EOIE",
> then the hash would be:
>
> SHA-1("TREE" +  +
>   "REUC" + )

I didn't look at the documentation patch in the larger context, but
please make sure that it is clear to the readers that these fixed
width integers "binary representations" use network byte order.

I briefly wondered if the above should include

+ "EOIE" + 

as it is pretty much common file format design to include the header
part of the checksum record (with checksum values padded out with NUL
bytes) when you define a record to hold the checksum of the entire
file.  Since this does not protect the contents of each section from
bit-flipping, adding the data for EOIE itself in the sum does not
give us much (iow, what I am adding above is a constant that does
not contribute any entropy).

How big is a typical TREE extension in _your_ work repository
housing Windows sources?  I am guessing that replacing SHA-1 with
something faster (as this is not about security but is about disk
corruption) and instead hash also the contents of these sections
would NOT help all that much in the performance department, as
having to page them in to read through would already consume
non-trivial amount of time, and that is why you are not hashing the
contents.

> + /*
> +  * CACHE_EXT_ENDOFINDEXENTRIES must be written as the last entry before 
> the SHA1

s/SHA1/trailing checksum/ or something so that we can withstand
NewHash world order?

> +  * so that it can be found and processed before all the index entries 
> are
> +  * read.
> +  */
> + if (!strip_extensions && offset && 
> !git_env_bool("GIT_TEST_DISABLE_EOIE", 0)) {
> + struct strbuf sb = STRBUF_INIT;
> +
> + write_eoie_extension(, _c, offset);
> + err = write_index_ext_header(, NULL, newfd, 
> CACHE_EXT_ENDOFINDEXENTRIES, sb.len) < 0
>   || ce_write(, newfd, sb.buf, sb.len) < 0;
>   strbuf_release();
>   if (err)

OK.

> +#define EOIE_SIZE 24 /* <4-byte offset> + <20-byte hash> */
> +#define EOIE_SIZE_WITH_HEADER (4 + 4 + EOIE_SIZE) /* <4-byte signature> + 
> <4-byte length> + EOIE_SIZE */
> +
> +#ifndef NO_PTHREADS
> +static unsigned long read_eoie_extension(void *mmap, size_t mmap_size)
> +{
> + /*
> +  * The end of index entries (EOIE) extension is guaranteed to be last
> +  * so that it can be found by scanning backwards from the EOF.
> +  *
> +  * "EOIE"
> +  * <4-byte length>
> +  * <4-byte offset>
> +  * <20-byte hash>
> +  */
> + const char *index, *eoie = (const char *)mmap + mmap_size - 
> GIT_SHA1_RAWSZ - EOIE_SIZE_WITH_HEADER;
> + uint32_t extsize;
> + unsigned long offset, src_offset;
> + unsigned char hash[GIT_MAX_RAWSZ];
> + git_hash_ctx c;
> +
> + /* validate the extension signature */
> + index = eoie;
> + if (CACHE_EXT(index) != CACHE_EXT_ENDOFINDEXENTRIES)
> + return 0;
> + index += sizeof(uint32_t);
> +
> + /* validate the extension size */
> + extsize = get_be32(index);
> + if (extsize != EOIE_SIZE)
> + return 0;
> + index += sizeof(uint32_t);

Do we know we have at least 8-byte to consume to perform the above
two checks, or is that something we need to verify at the beginning
of this function?  Better yet, as we know that a correct EOIE with
its own header is 28-byte long, we probably should abort if
mmap_size is smaller than that.

> + /*
> +  * Validate the offset we're going to look for the first extension
> +  * signature is after the index header and before the eoie extension.
> +  */
> + offset = get_be32(index);
> + if ((const char *)mmap + offset < (const char *)mmap + sizeof(struct 
> cache_header))
> + return 0;

Claims that the end is before the beginning, which we reject as bogus.  Good.

> + if ((const char *)mmap + offset >= eoie)
> + return 0;

Claims that the end is beyond the EOIE marker we should have placed
after it, which we reject as bogus.  Good.

> + index += sizeof(uint32_t);
> +
> + /*
> +  * The hash is computed over extension types and their sizes (but not
> +  * their contents).  E.g. if we have "TREE" extension that is N-bytes
> +  * long, "REUC" extension that is M-bytes long, followed by "EOIE",
> +  * then the hash would be:
> +  *
> +  * SHA-1("TREE" +  +
> +  *   "REUC" + )
> +  */
> + src_offset = offset;
> + the_hash_algo->init_fn();
> + while (src_offset < mmap_size - the_hash_algo->rawsz - 
> EOIE_SIZE_WITH_HEADER) {
> + /* After an array of active_nr index entries,
(Style nit).
> +  *