Re: RFC v3: Another proposed hash function transition plan

2017-09-13 Thread Junio C Hamano
Jonathan Nieder  writes:

> In other words, a long lifetime for the hash absolutely is a design
> goal.  Coping well with an unexpectedly short lifetime for the hash is
> also a design goal.
>
> If the hash function lasts 10 years then I am happy.

Absolutely.  When two functions have similar expected remaining life
and are equally widely supported, then faster is better than slower.
Otherwise our primary goal when picking the function from candidates
should be to optimize for its remaining life and wider availability.

Thanks.


Re: RFC v3: Another proposed hash function transition plan

2017-09-13 Thread Junio C Hamano
Jonathan Nieder  writes:

> The NewHash-based signature is included in the SHA-1 content as well,
> for the sake of round-tripping.  It is not stripped out.

Ah, OK, that allays my worries.  We rely on the fact that unknown
object headers from the future are ignored.  We use something other
than "gpgsig" header (say, "gpgsigN") to store NewHash based
signature on a commit object created in the NewHash world, so that
SHA-1 clients will ignore it but still include in the signature
computation---is that the idea?

Existing versions of Git that live in the SHA-1 world may still need
to learn to ignore/drop "gpgsigN" while amending a commit that
originally was created in the NewHash world.  Or to force upgrade we
may freeze the SHA-1 only versions of Git and stop updating them
altogether.  I dunno.

We also need to use something other than "mergetag" when carrying
over the contents of a tag being merged in the NewHash world, but
I'd imagine that you've thought about this already.

Thanks.




Re: [PATCH v2] config: avoid "write_in_full(fd, buf, len) < len" pattern

2017-09-13 Thread Ramsay Jones


On 13/09/17 23:43, Ramsay Jones wrote:
> 
> 
> On 13/09/17 19:58, Jeff King wrote:
>> On Wed, Sep 13, 2017 at 11:24:31AM -0700, Jonathan Nieder wrote:
>>
>>> For what it's worth,
>>> Reviewed-by: Jonathan Nieder 
>>
>> Thanks, and thank you for questioning the ptrdiff_t bits that didn't
>> make sense. I _thought_ they felt like nonsense, but couldn't quite
>> puzzle it out.
>>
>>> Compilers' signed/unsigned comparison warning can be noisy, but I'm
>>> starting to feel it's worth the suppression noise to turn it on when
>>> DEVELOPER=1 anyway.  What do you think?  Is there a way to turn it on
>>> selectively for certain functions on the LHS (like read() and write()
>>> style functions)?
>>
>> Obviously we couldn't turn them on for -Werror at this point. Let me see
>> how bad "-Wsign-compare -Wno-error=sign-compare" is.
>>
>>   $ make 2>&1 | grep -c warning:
>>   1391
>>
>> Ouch. I'm afraid that's probably not going to be very helpful. Even if I
>> were introducing a new problem, I'm not likely to see it amidst the mass
>> of existing complaints.
> 
> Hmm, about three or four years ago, I spent two or three evenings
> getting git to compile -Wextra clean. I remember the signed/unsigned
> issue was the cause of a large portion of the warnings issued by
> the compiler. I was surprised that it took such a relatively short
> time to do. However, it affected quite a large portion of the code, so
> I didn't think Junio would even consider applying it. Also, I only used
> gcc and was anticipating having additional warnings on clang (but I
> didn't get around to trying).
> 
> Maybe I should give it another go. :-D

For example, I remember the patch given below reduced the number
of warnings quite a bit (because it's an inline function in a
header file).

I just tried it again tonight; the current master branch has 3532
warnings when compiled with -Wextra, 1409 of which are -Wsign-compare
warnings. After applying the patch below, those numbers are reduced
by 344 to 3188/1065.

Still quite a bit to go ... ;-)

ATB,
Ramsay Jones

-- >8 --
Subject: [PATCH] git-compat-util: avoid -Wsign-compare warnings from xsize_t()

Signed-off-by: Ramsay Jones 
---
 git-compat-util.h | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/git-compat-util.h b/git-compat-util.h
index 6678b488c..2f3cf0883 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -898,9 +898,11 @@ static inline char *xstrdup_or_null(const char *str)
 
 static inline size_t xsize_t(off_t len)
 {
-   if (len > (size_t) len)
+   size_t size = (size_t) len;
+
+   if (len != (off_t) size)
die("Cannot handle files this big");
-   return (size_t)len;
+   return size;
 }
 
 __attribute__((format (printf, 3, 4)))
-- 
2.14.0


Re: RFC v3: Another proposed hash function transition plan

2017-09-13 Thread Linus Torvalds
On Wed, Sep 13, 2017 at 6:43 AM, demerphq  wrote:
>
> SHA3 however uses a completely different design where it mixes a 1088
> bit block into a 1600 bit state, for a leverage of 2:3, and the excess
> is *preserved between each block*.

Yes. And considering that the SHA1 attack was actually predicated on
the fact that each block was independent (no extra state between), I
do think SHA3 is a better model.

So I'd rather see SHA3-256 than SHA256.

  Linus


Investment

2017-09-13 Thread Michael Stewart
Dear Friend

I write to seek if I can have your professional assistance in this important 
matter which I bring to you. Hence I believe you have the right experience 
needed to accomplish this.
There are fund available and ready for investment which we will need your 
assistance to proceed. Please do get back to me as soon as you can.

We will give you more details as we are interested to work with you by having 
you investing and managing the fund accordingly

I wait to hear from you
Yours sincerely,
Michael Stewart 


Re: [PATCH 20/20] packed-backend.c: rename a bunch of things and update comments

2017-09-13 Thread Stefan Beller
On Wed, Sep 13, 2017 at 10:16 AM, Michael Haggerty  wrote:
> We've made huge changes to this file, and some of the old names and
> comments are no longer very fitting. So rename a bunch of things:
>
> * `struct packed_ref_cache` → `struct snapshot`
> * `acquire_packed_ref_cache()` → `acquire_snapshot()`
> * `release_packed_ref_buffer()` → `clear_snapshot_buffer()`
> * `release_packed_ref_cache()` → `release_snapshot()`
> * `clear_packed_ref_cache()` → `clear_snapshot()`
> * `struct packed_ref_entry` → `struct snapshot_record`
> * `cmp_packed_ref_entries()` → `cmp_packed_ref_records()`
> * `cmp_entry_to_refname()` → `cmp_record_to_refname()`
> * `sort_packed_refs()` → `sort_snapshot()`
> * `read_packed_refs()` → `create_snapshot()`
> * `validate_packed_ref_cache()` → `validate_snapshot()`
> * `get_packed_ref_cache()` → `get_snapshot()`
> * Renamed local variables and struct members accordingly.
>
> Also update a bunch of comments to reflect the renaming and the
> accumulated changes that the code has undergone.
>
> Signed-off-by: Michael Haggerty 

I have skimmed this series and it looks good.
Thanks,
Stefan


Re: RFC v3: Another proposed hash function transition plan

2017-09-13 Thread Jonathan Nieder
Hi,

Yves wrote:
> On 13 September 2017 at 14:05, Johannes Schindelin

>> For example, I am still in favor of SHA-256 over SHA3-256, after learning
>> some background details from in-house cryptographers: it provides
>> essentially the same level of security, according to my sources, while
>> hardware support seems to be coming to SHA-256 a lot sooner than to
>> SHA3-256.
>
> FWIW, and I know it is not worth much, as far as I can tell there is
> at least some security/math basis to prefer SHA3-256 to SHA-256.

Thanks for spelling this out.  From my (very cursory) understanding of
the math, what you are saying makes sense.  I think there were some
hints of this topic on-list before, but not made so explicit before.

Here's my summary of the discussion of other aspects of the choice of
hash functions so far:

My understanding from asking cryptographers matches what Dscho said.
One of the lessons of the history of hash functions is that some kinds
of attempts to improve the security margin of a hash function do not
help as much as expected once a function is broken.

In practice, what we are looking for is

- is the algorithm broken, or likely to be broken soon
- do the algorithm's guarantees match the application
- is the algorithm fast enough
- are high quality implementations widely available

On that first question, every well informed person I have asked has
assured me that SHA-256, SHA-512, SHA-512/256, SHA-256x16, SHA3-256,
K12, BLAKE2bp-256, etc are equally likely to be broken in the next 10
years.  The main difference for the longevity question is that some of
those algorithms have had more scrutiny than others, but all have had
significant scrutiny.  See [1] and the surrounding thread for more
discussion on that.

On the second question, SHA-256 is vulnerable to length extension
attacks, which means it would not be usable as a MAC directly (instead
of using the HMAC construction).  Fortunately Git doesn't use its hash
function that way.

On the third question, SHA-256 is one of the slower ones, even with
hardware accelaration, but it should be fast enough.

On the fourth question, SHA-256 shines.  See [2].  That is where I had
thought the conversation ended up.

For what it's worth, I'm pretty happy both with the level of scrutiny
we've given to this question and SHA-256 as an answer.  Luckily even
if at the last minute we learn something that changes the choice of
hash function, that would not significantly affect the transition
plan, so we have a chance to learn more.

See also [3].

Thanks,
Jonathan

[1] 
https://public-inbox.org/git/CAL9PXLzhPyE+geUdcLmd=pidt5p8efebbsgx_ds88knz2q_...@mail.gmail.com/#t
[2] https://public-inbox.org/git/xmqq37azy7ru@gitster.mtv.corp.google.com/
[3] https://www.imperialviolet.org/2017/05/31/skipsha3.html,
https://news.ycombinator.com/item?id=14453622


Re: [PATCH v2] config: avoid "write_in_full(fd, buf, len) < len" pattern

2017-09-13 Thread Ramsay Jones


On 13/09/17 19:58, Jeff King wrote:
> On Wed, Sep 13, 2017 at 11:24:31AM -0700, Jonathan Nieder wrote:
> 
>> For what it's worth,
>> Reviewed-by: Jonathan Nieder 
> 
> Thanks, and thank you for questioning the ptrdiff_t bits that didn't
> make sense. I _thought_ they felt like nonsense, but couldn't quite
> puzzle it out.
> 
>> Compilers' signed/unsigned comparison warning can be noisy, but I'm
>> starting to feel it's worth the suppression noise to turn it on when
>> DEVELOPER=1 anyway.  What do you think?  Is there a way to turn it on
>> selectively for certain functions on the LHS (like read() and write()
>> style functions)?
> 
> Obviously we couldn't turn them on for -Werror at this point. Let me see
> how bad "-Wsign-compare -Wno-error=sign-compare" is.
> 
>   $ make 2>&1 | grep -c warning:
>   1391
> 
> Ouch. I'm afraid that's probably not going to be very helpful. Even if I
> were introducing a new problem, I'm not likely to see it amidst the mass
> of existing complaints.

Hmm, about three or four years ago, I spent two or three evenings
getting git to compile -Wextra clean. I remember the signed/unsigned
issue was the cause of a large portion of the warnings issued by
the compiler. I was surprised that it took such a relatively short
time to do. However, it affected quite a large portion of the code, so
I didn't think Junio would even consider applying it. Also, I only used
gcc and was anticipating having additional warnings on clang (but I
didn't get around to trying).

Maybe I should give it another go. :-D

ATB,
Ramsay Jones



Re: [PATCH 3/8] daemon: recognize hidden request arguments

2017-09-13 Thread Stefan Beller
On Wed, Sep 13, 2017 at 2:54 PM, Brandon Williams  wrote:
> A normal request to git-daemon is structured as
> "command path/to/repo\0host=..\0" and due to a bug in an old version of
> git-daemon 73bb33a94 (daemon: Strictly parse the "extra arg" part of the
> command, 2009-06-04) we aren't able to place any extra args (separated
> by NULs) besides the host.
>
> In order to get around this limitation teach git-daemon to recognize
> additional request arguments hidden behind a second NUL byte.  Requests
> can then be structured like:
> "command path/to/repo\0host=..\0\0version=1\0key=value\0".  git-daemon
> can then parse out the extra arguments and set 'GIT_PROTOCOL'
> accordingly.
>
> Signed-off-by: Brandon Williams 
> ---
>  daemon.c | 71 
> +++-
>  1 file changed, 61 insertions(+), 10 deletions(-)
>
> diff --git a/daemon.c b/daemon.c
> index 30747075f..250dbf82c 100644
> --- a/daemon.c
> +++ b/daemon.c
> @@ -282,7 +282,7 @@ static const char *path_ok(const char *directory, struct 
> hostinfo *hi)
> return NULL;/* Fallthrough. Deny by default */
>  }
>
> -typedef int (*daemon_service_fn)(void);
> +typedef int (*daemon_service_fn)(const struct argv_array *env);
>  struct daemon_service {
> const char *name;
> const char *config_name;
> @@ -363,7 +363,7 @@ static int run_access_hook(struct daemon_service 
> *service, const char *dir,
>  }
>
>  static int run_service(const char *dir, struct daemon_service *service,
> -  struct hostinfo *hi)
> +  struct hostinfo *hi, const struct argv_array *env)
>  {
> const char *path;
> int enabled = service->enabled;
> @@ -422,7 +422,7 @@ static int run_service(const char *dir, struct 
> daemon_service *service,
>  */
> signal(SIGTERM, SIG_IGN);
>
> -   return service->fn();
> +   return service->fn(env);
>  }
>
>  static void copy_to_log(int fd)
> @@ -462,25 +462,34 @@ static int run_service_command(struct child_process 
> *cld)
> return finish_command(cld);
>  }
>
> -static int upload_pack(void)
> +static int upload_pack(const struct argv_array *env)
>  {
> struct child_process cld = CHILD_PROCESS_INIT;
> argv_array_pushl(, "upload-pack", "--strict", NULL);
> argv_array_pushf(, "--timeout=%u", timeout);
> +
> +   argv_array_pushv(_array, env->argv);
> +
> return run_service_command();
>  }
>
> -static int upload_archive(void)
> +static int upload_archive(const struct argv_array *env)
>  {
> struct child_process cld = CHILD_PROCESS_INIT;
> argv_array_push(, "upload-archive");
> +
> +   argv_array_pushv(_array, env->argv);
> +
> return run_service_command();
>  }
>
> -static int receive_pack(void)
> +static int receive_pack(const struct argv_array *env)
>  {
> struct child_process cld = CHILD_PROCESS_INIT;
> argv_array_push(, "receive-pack");
> +
> +   argv_array_pushv(_array, env->argv);
> +
> return run_service_command();
>  }
>
> @@ -574,7 +583,7 @@ static void canonicalize_client(struct strbuf *out, const 
> char *in)
>  /*
>   * Read the host as supplied by the client connection.
>   */
> -static void parse_host_arg(struct hostinfo *hi, char *extra_args, int buflen)
> +static char *parse_host_arg(struct hostinfo *hi, char *extra_args, int 
> buflen)
>  {
> char *val;
> int vallen;
> @@ -602,6 +611,39 @@ static void parse_host_arg(struct hostinfo *hi, char 
> *extra_args, int buflen)
> if (extra_args < end && *extra_args)
> die("Invalid request");
> }
> +
> +   return extra_args;
> +}
> +
> +static void parse_extra_args(struct argv_array *env, const char *extra_args,
> +int buflen)
> +{
> +   const char *end = extra_args + buflen;
> +   struct strbuf git_protocol = STRBUF_INIT;
> +
> +   for (; extra_args < end; extra_args += strlen(extra_args) + 1) {
> +   const char *arg = extra_args;
> +
> +   /*
> +* Parse the extra arguments, adding most to 'git_protocol'
> +* which will be used to set the 'GIT_PROTOCOL' envvar in the
> +* service that will be run.
> +*
> +* If there ends up being a particular arg in the future that
> +* git-daemon needs to parse specificly (like the 'host' arg)
> +* then it can be parsed here and not added to 'git_protocol'.
> +*/
> +   if (*arg) {
> +   if (git_protocol.len > 0)
> +   strbuf_addch(_protocol, ':');
> +   strbuf_addstr(_protocol, arg);
> +   }
> +   }
> +
> +   if (git_protocol.len > 0)
> +   argv_array_pushf(env, GIT_PROTOCOL_ENVIRONMENT "=%s",
> + 

Re: RFC v3: Another proposed hash function transition plan

2017-09-13 Thread Jonathan Nieder
Junio C Hamano wrote:

> In the proposed transition plan, the treatment of various signatures
> (deliberately) makes the conversion not quite roundtrip.

That's not precisely true.  Details below.

> When existing SHA-1 history in individual clones are converted to
> NewHash, we obviously cannot re-sign the corresponding NewHash
> contents with the same PGP key, so these converted objects will
> carry only signature on SHA-1 contents.  They can still be validated
> when they are exported back to SHA-1 world via the fetch/push
> protocol, and can be validated locally by converting them back to
> SHA-1 contents and then passing the result to gpgv.

Correct.

> The plan also states, if I remember what I read correctly, that
> newly created and signed objects (this includes signed commits and
> signed tags; mergetags merely carry over what the tag object that
> was merged was signed with, so we do not have to worry about them
> unless the resulting commit that has mergetag is signed itself, but
> that is already covered by how we handle signed commits) would be
> signed both for NewHash contents and its corresponding SHA-1
> contents (after internally convering it to SHA-1 contents).

Also correct.

> would allow us to strip the signature over NewHash contents and
> derive the SHA-1 contents to be shown to the outside world while
> migration is going on and I'd imagine it would be a good practice;
> it would allow us to sign something that allows everybody to verify,
> when some participants of the project are not yet NewHash capable.

The NewHash-based signature is included in the SHA-1 content as well,
for the sake of round-tripping.  It is not stripped out.

> But the signing over SHA-1 contents has to stop at some point, when
> everybody's Git becomes completely unaware of SHA-1.  We may want to
> have a guideline in the transition plan to (1) encourage signing for
> both for quite some time, and (2) the criteria for us to decide when
> to stop.

Yes, spelling out a rough schedule is a good idea.  I'll add that.

A version of Git that is aware of NewHash should be able to verify
NewHash signatures even for users that are using SHA-1 locally for the
sake of faster fetches and pushes to SHA-1 based peers.

In addition to a new enough Git, this requires the translation table
to translate to NewHash to be present.

So the criterion (2) is largely based on how up-to-date the Git used
by users wanting to verify signatures is and whether they are willing
to tolerate the performance implications of having a translation
table.  My hope is that when communicating with peers using the same
hash function, the translation table will not add too much performance
overhead.

Thank you,
Jonathan


Re: [PATCH 2/8] protocol: introduce protocol extention mechanisms

2017-09-13 Thread Stefan Beller
On Wed, Sep 13, 2017 at 2:54 PM, Brandon Williams  wrote:
> Create protocol.{c,h} and provide functions which future servers and
> clients can use to determine which protocol to use or is being used.
>
> Also introduce the 'GIT_PROTOCOL' environment variable which will be
> used to communicate a colon separated list of keys with optional values
> to a server.  Unknown keys and values must be tolerated.  This mechanism
> is used to communicate which version of the wire protocol a client would
> like to use with a server.
>
> Signed-off-by: Brandon Williams 
> ---
>  Documentation/config.txt | 16 +++
>  Documentation/git.txt|  5 
>  Makefile |  1 +
>  cache.h  |  7 +
>  protocol.c   | 72 
> 
>  protocol.h   | 15 ++
>  6 files changed, 116 insertions(+)
>  create mode 100644 protocol.c
>  create mode 100644 protocol.h
>
> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index dc4e3f58a..d5b28a32c 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -2517,6 +2517,22 @@ The protocol names currently used by git are:
>  `hg` to allow the `git-remote-hg` helper)
>  --
>
> +protocol.version::

It would be cool to set a set of versions that are good. (I am not sure
if that can be deferred to a later patch.)

  Consider we'd have versions 0,1,2,3,4 in the future:
  In an ideal world the client and server would talk using v4
  as it is the most advanced protocol, right?
  Maybe a security/performance issue is found on the server side
  with say protocol v3. Still no big deal as we are speaking v4.
  But then consider an issue is found on the client side with v4.
  Then the client would happily talk 0..3 while the server would
  love to talk using 0,1,2,4.

The way I think about protocol version negotiation is that
both parties involved have a set of versions that they tolerate
to talk (they might understand more than the tolerated set, but the
user forbade some), and the goal of the negotiation is to find
the highest version number that is part of both the server set
and client set. So quite naturally with this line of thinking the
configuration is to configure a set of versions, which is what
I propose here. Maybe even in the wire format, separated
with colons?

> +   If set, clients will attempt to communicate with a server using
> +   the specified protocol version.  If unset, no attempt will be
> +   made by the client to communicate using a particular protocol
> +   version, this results in protocol version 0 being used.

This sounds as if we're going to be really shy at first and only
users that care will try out new versions at their own risk.
>From a users POV this may be frustrating as I would imagine that
people want to run

  git config --global protocol.version 2

to try it out and then realize that some of their hosts do not speak
2, so they have to actually configure it per repo/remote.

> +   Supported versions:

> +* `0` - the original wire protocol.

In the future this may be misleading as it doesn't specify the date of
when it was original. e.g. are capabilities already supported in "original"?

Maybe phrase it as "wire protocol as of v2.14" ? (Though this sounds
as if new capabilities added in the future are not allowed)


> +
> +extern enum protocol_version parse_protocol_version(const char *value);
> +extern enum protocol_version get_protocol_version_config(void);
> +extern enum protocol_version determine_protocol_version_server(void);
> +extern enum protocol_version determine_protocol_version_client(const char 
> *server_response);

Here is a good place to have some comments.


Re: RFC v3: Another proposed hash function transition plan

2017-09-13 Thread Jonathan Nieder
Hi,

Stefan Beller wrote:
> On Wed, Sep 13, 2017 at 2:52 PM, Junio C Hamano  wrote:

>> This is a tangent, but it may be fine for a shallow clone to treat
>> the cut-off points in the history as if they are root commits and
>> compute generation numbers locally, just like everybody else does.
[...]
> Locally it helps for some operations such as correct walks.
> For the network case however, it doesn't really help either.
>
> If we had global generation numbers, one could imagine that they
> are used in the pack negotiation (server advertises the maximum
> generation number or even gen number per branch; client
> could binary search in there for the fork point)
>
> I wonder if locally generated generation numbers (for the shallow
> case) could be used somehow to still improve network operations.

I have a different concern about locally generated generation numbers in
a shallow clone.  My concern is that it is slow to recompute them when
deepening the shallow clone.

However:

 1. That only affects performance and for some use cases could be
mitigated e.g. by introducing some laziness, and, more
convincingly,

 2. With a small protocol change, the server could communicate the
generation numbers for commit objects at the edge of a shallow
clone, avoiding this trouble.

So I am not too concerned.

More generally, unless there is a very very compelling reason to, I
don't want to couple other changes into the hash function transition.
If they're worthwhile enough to do, they're worthwhile enough to do
whether we're transitioning to a new hash function or not: I have not
heard a convincing example yet of a "while at it" that is worth the
complexity of such coupling.

(That said, if two format changes are worth doing and happen to be
implemented at the same time, then we can save users the trouble of
experiencing two format change transitions.  That is a kind of
coupling from the end user's point of view.  But from the perspective
of someone writing the code, there is no need to count on that, and it
is not likely to happen anyway.)

> If we'd get the transition somewhat right, the next transition will
> be easier than the current transition, such that I am not that concerned
> about longevity. I am rather concerned about the complexity that is added
> to the code base (whilst accumulating technical debt instead of clearer
> abstraction layers)

During the transition, users have to suffer reencoding overhead, so it
is not good for such transitions to need to happen very often.  If the
new hash function breaks early, then we have to cope with it and as
you say, having the framework in place means we'd be ready for that.
But I still don't want the chosen hash function to break early.

In other words, a long lifetime for the hash absolutely is a design
goal.  Coping well with an unexpectedly short lifetime for the hash is
also a design goal.

If the hash function lasts 10 years then I am happy.

Thanks,
Jonathan


Re: [PATCH 1/1] reset: fix reset when using the sparse-checkout feature.

2017-09-13 Thread Junio C Hamano
Jacob Keller  writes:

> By definition if you do a sparse checkout, you're telling git "I only
> care about the files in this sparse checkout, and do not concern me
> with anything else"... So the proposed fix is "since git cleared the
> skip-worktree flag, we should actually also copy the file out again."
> but I think the real problem is that we're clearing skip worktree to
> begin with?

As this 100% agree with what I've been saying, I do not have
anything to add to the discussion at the moment, so I'll stop
speaking now but will continue to monitor the thread so that I may
hear a new argument and datapoint that would make me change my mind.

Thanks for a healthy discussion.


Re: RFC v3: Another proposed hash function transition plan

2017-09-13 Thread Junio C Hamano
Junio C Hamano  writes:

> Jonathan Nieder  writes:
>
>> Treating generation numbers as derived data (as in Jeff King's
>> preferred design, if I have understood his replies correctly) would
>> also be possible but it does not interact well with shallow clone or
>> narrow clone.
>
> Just like we have skewed committer timestamps, there is no reason to
> believe that generation numbers embedded in objects are trustable,
> and there is no way for narrow clients to even verify their correctness.
>
> So I agree with Peff that having generation numbers in object is
> pointless; I agree any other derivables like corresponding sha-1
> name is also pointless to have.
>
> This is a tangent, but it may be fine for a shallow clone to treat
> the cut-off points in the history as if they are root commits and
> compute generation numbers locally, just like everybody else does.
> As generation numbers won't have to be global (because we will not
> be embedding them in objects), nobody gets hurt if they do not match
> across repositories---just like often-mentioned rename detection
> cache, it can be kept as a mere local performance aid and does not
> have to participate in the object model.
>
>> All that said, for simplicity I still lean against including
>> generation numbers as part of a hash function transition.
>
> Good.

In the proposed transition plan, the treatment of various signatures
(deliberately) makes the conversion not quite roundtrip.

When existing SHA-1 history in individual clones are converted to
NewHash, we obviously cannot re-sign the corresponding NewHash
contents with the same PGP key, so these converted objects will
carry only signature on SHA-1 contents.  They can still be validated
when they are exported back to SHA-1 world via the fetch/push
protocol, and can be validated locally by converting them back to
SHA-1 contents and then passing the result to gpgv.

The plan also states, if I remember what I read correctly, that
newly created and signed objects (this includes signed commits and
signed tags; mergetags merely carry over what the tag object that
was merged was signed with, so we do not have to worry about them
unless the resulting commit that has mergetag is signed itself, but
that is already covered by how we handle signed commits) would be
signed both for NewHash contents and its corresponding SHA-1
contents (after internally convering it to SHA-1 contents).  That
would allow us to strip the signature over NewHash contents and
derive the SHA-1 contents to be shown to the outside world while
migration is going on and I'd imagine it would be a good practice;
it would allow us to sign something that allows everybody to verify,
when some participants of the project are not yet NewHash capable.

But the signing over SHA-1 contents has to stop at some point, when
everybody's Git becomes completely unaware of SHA-1.  We may want to
have a guideline in the transition plan to (1) encourage signing for
both for quite some time, and (2) the criteria for us to decide when
to stop.

Thanks.


Re: RFC v3: Another proposed hash function transition plan

2017-09-13 Thread Stefan Beller
On Wed, Sep 13, 2017 at 2:52 PM, Junio C Hamano  wrote:
> Jonathan Nieder  writes:
>
>> Treating generation numbers as derived data (as in Jeff King's
>> preferred design, if I have understood his replies correctly) would
>> also be possible but it does not interact well with shallow clone or
>> narrow clone.
>
> Just like we have skewed committer timestamps, there is no reason to
> believe that generation numbers embedded in objects are trustable,
> and there is no way for narrow clients to even verify their correctness.
>
> So I agree with Peff that having generation numbers in object is
> pointless; I agree any other derivables like corresponding sha-1
> name is also pointless to have.
>
> This is a tangent, but it may be fine for a shallow clone to treat
> the cut-off points in the history as if they are root commits and
> compute generation numbers locally, just like everybody else does.
> As generation numbers won't have to be global (because we will not
> be embedding them in objects), nobody gets hurt if they do not match
> across repositories---just like often-mentioned rename detection
> cache, it can be kept as a mere local performance aid and does not
> have to participate in the object model.

Locally it helps for some operations such as correct walks.
For the network case however, it doesn't really help either.

If we had global generation numbers, one could imagine that they
are used in the pack negotiation (server advertises the maximum
generation number or even gen number per branch; client
could binary search in there for the fork point)

I wonder if locally generated generation numbers (for the shallow
case) could be used somehow to still improve network operations.



>> My assumption based on previous conversations (and other external
>> conversations like [1]) is that we are going to use SHA2-256 and have
>> a pretty strong consensus for that.  Don't worry!
>
> Hmph, I actually re-read the thread recently, and my impression was
> that we didn't quite have a consensus but were leaning towards
> SHA3-256.
>
> I do not personally have a strong preference myself and I would say
> that anything will do as long as it is with good longevity and
> availability.  SHA2 family would be a fine choice due to its age on
> both counts, being scrutinized longer and having a chance to be
> implemented in many places, even though its age itself may have to
> be subtracted from the longevity factor.

If we'd get the transition somewhat right, the next transition will
be easier than the current transition, such that I am not that concerned
about longevity. I am rather concerned about the complexity that is added
to the code base (whilst accumulating technical debt instead of clearer
abstraction layers)


[PATCH 8/8] i5700: add interop test for protocol transition

2017-09-13 Thread Brandon Williams
Signed-off-by: Brandon Williams 
---
 t/interop/i5700-protocol-transition.sh | 68 ++
 1 file changed, 68 insertions(+)
 create mode 100755 t/interop/i5700-protocol-transition.sh

diff --git a/t/interop/i5700-protocol-transition.sh 
b/t/interop/i5700-protocol-transition.sh
new file mode 100755
index 0..9e83428a8
--- /dev/null
+++ b/t/interop/i5700-protocol-transition.sh
@@ -0,0 +1,68 @@
+#!/bin/sh
+
+VERSION_A=.
+VERSION_B=v2.0.0
+
+: ${LIB_GIT_DAEMON_PORT:=5600}
+LIB_GIT_DAEMON_COMMAND='git.b daemon'
+
+test_description='clone and fetch by client who is trying to use a new 
protocol'
+. ./interop-lib.sh
+. "$TEST_DIRECTORY"/lib-git-daemon.sh
+
+start_git_daemon --export-all
+
+repo=$GIT_DAEMON_DOCUMENT_ROOT_PATH/repo
+
+test_expect_success "create repo served by $VERSION_B" '
+   git.b init "$repo" &&
+   git.b -C "$repo" commit --allow-empty -m one
+'
+
+test_expect_success "git:// clone with $VERSION_A and protocol v1" '
+   GIT_TRACE_PACKET=1 git.a -c protocol.version=1 clone 
"$GIT_DAEMON_URL/repo" child 2>log &&
+   git.a -C child log -1 --format=%s >actual &&
+   git.b -C "$repo" log -1 --format=%s >expect &&
+   test_cmp expect actual &&
+   grep "version=1" log
+'
+
+test_expect_success "git:// fetch with $VERSION_A and protocol v1" '
+   git.b -C "$repo" commit --allow-empty -m two &&
+   git.b -C "$repo" log -1 --format=%s >expect &&
+
+   GIT_TRACE_PACKET=1 git.a -C child -c protocol.version=1 fetch 2>log &&
+   git.a -C child log -1 --format=%s FETCH_HEAD >actual &&
+
+   test_cmp expect actual &&
+   grep "version=1" log &&
+   ! grep "version 1" log
+'
+
+stop_git_daemon
+
+test_expect_success "create repo served by $VERSION_B" '
+   git.b init parent &&
+   git.b -C parent commit --allow-empty -m one
+'
+
+test_expect_success "file:// clone with $VERSION_A and protocol v1" '
+   GIT_TRACE_PACKET=1 git.a -c protocol.version=1 clone 
--upload-pack="git.b upload-pack" parent child2 2>log &&
+   git.a -C child2 log -1 --format=%s >actual &&
+   git.b -C parent log -1 --format=%s >expect &&
+   test_cmp expect actual &&
+   ! grep "version 1" log
+'
+
+test_expect_success "file:// fetch with $VERSION_A and protocol v1" '
+   git.b -C parent commit --allow-empty -m two &&
+   git.b -C parent log -1 --format=%s >expect &&
+
+   GIT_TRACE_PACKET=1 git.a -C child2 -c protocol.version=1 fetch 
--upload-pack="git.b upload-pack" 2>log &&
+   git.a -C child2 log -1 --format=%s FETCH_HEAD >actual &&
+
+   test_cmp expect actual &&
+   ! grep "version 1" log
+'
+
+test_done
-- 
2.14.1.690.gbb1197296e-goog



[PATCH 7/8] http: tell server that the client understands v1

2017-09-13 Thread Brandon Williams
Tell a server that protocol v1 can be used by sending the http header
'Git-Protocol' indicating this.

Also teach the apache http server to pass through the 'Git-Protocol'
header as an environment variable 'GIT_PROTOCOL'.

Signed-off-by: Brandon Williams 
---
 cache.h |  2 ++
 http.c  | 18 +
 t/lib-httpd/apache.conf |  7 +
 t/t5700-protocol-v1.sh  | 69 +
 4 files changed, 96 insertions(+)

diff --git a/cache.h b/cache.h
index 8839b1ed4..82a643968 100644
--- a/cache.h
+++ b/cache.h
@@ -450,6 +450,8 @@ static inline enum object_type object_type(unsigned int 
mode)
  * 'key[=value]'.  Presence of unknown keys must be tolerated.
  */
 #define GIT_PROTOCOL_ENVIRONMENT "GIT_PROTOCOL"
+/* HTTP header used to handshake the wire protocol */
+#define GIT_PROTOCOL_HEADER "Git-Protocol"
 
 /*
  * This environment variable is expected to contain a boolean indicating
diff --git a/http.c b/http.c
index 9e40a465f..ffb719216 100644
--- a/http.c
+++ b/http.c
@@ -12,6 +12,7 @@
 #include "gettext.h"
 #include "transport.h"
 #include "packfile.h"
+#include "protocol.h"
 
 static struct trace_key trace_curl = TRACE_KEY_INIT(CURL);
 #if LIBCURL_VERSION_NUM >= 0x070a08
@@ -897,6 +898,21 @@ static void set_from_env(const char **var, const char 
*envname)
*var = val;
 }
 
+static void protocol_http_header(void)
+{
+   if (get_protocol_version_config() > 0) {
+   struct strbuf protocol_header = STRBUF_INIT;
+
+   strbuf_addf(_header, GIT_PROTOCOL_HEADER ": 
version=%d",
+   get_protocol_version_config());
+
+
+   extra_http_headers = curl_slist_append(extra_http_headers,
+  protocol_header.buf);
+   strbuf_release(_header);
+   }
+}
+
 void http_init(struct remote *remote, const char *url, int proactive_auth)
 {
char *low_speed_limit;
@@ -927,6 +943,8 @@ void http_init(struct remote *remote, const char *url, int 
proactive_auth)
if (remote)
var_override(_proxy_authmethod, 
remote->http_proxy_authmethod);
 
+   protocol_http_header();
+
pragma_header = curl_slist_append(http_copy_default_headers(),
"Pragma: no-cache");
no_pragma_header = curl_slist_append(http_copy_default_headers(),
diff --git a/t/lib-httpd/apache.conf b/t/lib-httpd/apache.conf
index 0642ae7e6..df1943631 100644
--- a/t/lib-httpd/apache.conf
+++ b/t/lib-httpd/apache.conf
@@ -67,6 +67,9 @@ LockFile accept.lock
 
LoadModule unixd_module modules/mod_unixd.so
 
+
+   LoadModule setenvif_module modules/mod_setenvif.so
+
 
 
 PassEnv GIT_VALGRIND
@@ -76,6 +79,10 @@ PassEnv ASAN_OPTIONS
 PassEnv GIT_TRACE
 PassEnv GIT_CONFIG_NOSYSTEM
 
+= 2.4>
+   SetEnvIf Git-Protocol ".*" GIT_PROTOCOL=$0
+
+
 Alias /dumb/ www/
 Alias /auth/dumb/ www/auth/dumb/
 
diff --git a/t/t5700-protocol-v1.sh b/t/t5700-protocol-v1.sh
index 1988bbce6..65127 100755
--- a/t/t5700-protocol-v1.sh
+++ b/t/t5700-protocol-v1.sh
@@ -220,4 +220,73 @@ test_expect_success 'push with ssh:// using protocol v1' '
grep "push< version 1" log
 '
 
+# Test protocol v1 with 'http://' transport
+#
+. "$TEST_DIRECTORY"/lib-httpd.sh
+start_httpd
+
+test_expect_success 'create repo to be served by http:// transport' '
+   git init "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" &&
+   git -C "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" config http.receivepack 
true &&
+   test_commit -C "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" one
+'
+
+test_expect_success 'clone with http:// using protocol v1' '
+   GIT_TRACE_PACKET=1 GIT_TRACE_CURL=1 git -c protocol.version=1 \
+   clone "$HTTPD_URL/smart/http_parent" http_child 2>log &&
+
+   git -C http_child log -1 --format=%s >actual &&
+   git -C "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" log -1 --format=%s 
>expect &&
+   test_cmp expect actual &&
+
+   # Client requested to use protocol v1
+   grep "Git-Protocol: version=1" log &&
+   # Server responded using protocol v1
+   grep "git< version 1" log
+'
+
+test_expect_success 'fetch with http:// using protocol v1' '
+   test_commit -C "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" two &&
+
+   GIT_TRACE_PACKET=1 git -C http_child -c protocol.version=1 \
+   fetch 2>log &&
+
+   git -C http_child log -1 --format=%s FETCH_HEAD >actual &&
+   git -C "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" log -1 --format=%s 
>expect &&
+   test_cmp expect actual &&
+
+   # Server responded using protocol v1
+   grep "git< version 1" log
+'
+
+test_expect_success 'pull with http:// using protocol v1' '
+   GIT_TRACE_PACKET=1 git -C http_child -c protocol.version=1 \
+   pull 2>log &&
+
+   git -C http_child log -1 --format=%s >actual &&
+   git -C "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" log -1 --format=%s 

[PATCH 6/8] connect: tell server that the client understands v1

2017-09-13 Thread Brandon Williams
Teach the connection logic to tell a serve that it understands protocol
v1.  This is done in 2 different ways for the built in protocols.

1. git://
   A normal request is structured as "command path/to/repo\0host=..\0"
   and due to a bug in an old version of git-daemon 73bb33a94 (daemon:
   Strictly parse the "extra arg" part of the command, 2009-06-04) we
   aren't able to place any extra args (separated by NULs) besides the
   host.  In order to get around this limitation put protocol version
   information after a second NUL byte so the request is structured
   like: "command path/to/repo\0host=..\0\0version=1\0".  git-daemon can
   then parse out the version number and set GIT_PROTOCOL.

2. ssh://, file://
   Set GIT_PROTOCOL envvar with the desired protocol version.  The
   envvar can be sent across ssh by using '-o SendEnv=GIT_PROTOCOL' and
   having the server whitelist this envvar.

Signed-off-by: Brandon Williams 
---
 connect.c  |  37 ++--
 t/t5700-protocol-v1.sh | 223 +
 2 files changed, 255 insertions(+), 5 deletions(-)
 create mode 100755 t/t5700-protocol-v1.sh

diff --git a/connect.c b/connect.c
index 2702e1f2e..40b388acd 100644
--- a/connect.c
+++ b/connect.c
@@ -815,6 +815,7 @@ struct child_process *git_connect(int fd[2], const char 
*url,
printf("Diag: path=%s\n", path ? path : "NULL");
conn = NULL;
} else if (protocol == PROTO_GIT) {
+   struct strbuf request = STRBUF_INIT;
/*
 * Set up virtual host information based on where we will
 * connect, unless the user has overridden us in
@@ -842,12 +843,24 @@ struct child_process *git_connect(int fd[2], const char 
*url,
 * Note: Do not add any other headers here!  Doing so
 * will cause older git-daemon servers to crash.
 */
-   packet_write_fmt(fd[1],
-"%s %s%chost=%s%c",
-prog, path, 0,
-target_host, 0);
+   strbuf_addf(,
+   "%s %s%chost=%s%c",
+   prog, path, 0,
+   target_host, 0);
+
+   /* If using a new version put that stuff here after a second 
null byte */
+   if (get_protocol_version_config() > 0) {
+   strbuf_addch(, '\0');
+   strbuf_addf(, "version=%d%c",
+   get_protocol_version_config(), '\0');
+   }
+
+   packet_write(fd[1], request.buf, request.len);
+
free(target_host);
+   strbuf_release();
} else {
+   const char *const *var;
conn = xmalloc(sizeof(*conn));
child_process_init(conn);
 
@@ -859,7 +872,9 @@ struct child_process *git_connect(int fd[2], const char 
*url,
sq_quote_buf(, path);
 
/* remove repo-local variables from the environment */
-   conn->env = local_repo_env;
+   for (var = local_repo_env; *var; var++)
+   argv_array_push(>env_array, *var);
+
conn->use_shell = 1;
conn->in = conn->out = -1;
if (protocol == PROTO_SSH) {
@@ -912,6 +927,14 @@ struct child_process *git_connect(int fd[2], const char 
*url,
}
 
argv_array_push(>args, ssh);
+
+   if (get_protocol_version_config() > 0) {
+   argv_array_push(>args, "-o");
+   argv_array_push(>args, "SendEnv=" 
GIT_PROTOCOL_ENVIRONMENT);
+   argv_array_pushf(>env_array, 
GIT_PROTOCOL_ENVIRONMENT "=version=%d",
+get_protocol_version_config());
+   }
+
if (flags & CONNECT_IPV4)
argv_array_push(>args, "-4");
else if (flags & CONNECT_IPV6)
@@ -926,6 +949,10 @@ struct child_process *git_connect(int fd[2], const char 
*url,
argv_array_push(>args, ssh_host);
} else {
transport_check_allowed("file");
+   if (get_protocol_version_config() > 0) {
+   argv_array_pushf(>env_array, 
GIT_PROTOCOL_ENVIRONMENT "=version=%d",
+get_protocol_version_config());
+   }
}
argv_array_push(>args, cmd.buf);
 
diff --git a/t/t5700-protocol-v1.sh b/t/t5700-protocol-v1.sh
new file mode 100755
index 0..1988bbce6
--- /dev/null
+++ b/t/t5700-protocol-v1.sh
@@ -0,0 +1,223 @@
+#!/bin/sh
+
+test_description='test git wire-protocol transition'
+
+TEST_NO_CREATE_REPO=1
+

[PATCH 1/8] pkt-line: add packet_write function

2017-09-13 Thread Brandon Williams
Add a function which can be used to write the contents of an arbitrary
buffer.  This makes it easy to build up data in a buffer before writing
the packet instead of formatting the entire contents of the packet using
'packet_write_fmt()'.

Signed-off-by: Brandon Williams 
---
 pkt-line.c | 6 ++
 pkt-line.h | 1 +
 2 files changed, 7 insertions(+)

diff --git a/pkt-line.c b/pkt-line.c
index 7db911957..cf98f371b 100644
--- a/pkt-line.c
+++ b/pkt-line.c
@@ -188,6 +188,12 @@ static int packet_write_gently(const int fd_out, const 
char *buf, size_t size)
return error("packet write failed");
 }
 
+void packet_write(const int fd_out, const char *buf, size_t size)
+{
+   if (packet_write_gently(fd_out, buf, size))
+   die_errno("packet write failed");
+}
+
 void packet_buf_write(struct strbuf *buf, const char *fmt, ...)
 {
va_list args;
diff --git a/pkt-line.h b/pkt-line.h
index 66ef610fc..d9e9783b1 100644
--- a/pkt-line.h
+++ b/pkt-line.h
@@ -22,6 +22,7 @@
 void packet_flush(int fd);
 void packet_write_fmt(int fd, const char *fmt, ...) __attribute__((format 
(printf, 2, 3)));
 void packet_buf_flush(struct strbuf *buf);
+void packet_write(const int fd_out, const char *buf, size_t size);
 void packet_buf_write(struct strbuf *buf, const char *fmt, ...) 
__attribute__((format (printf, 2, 3)));
 int packet_flush_gently(int fd);
 int packet_write_fmt_gently(int fd, const char *fmt, ...) 
__attribute__((format (printf, 2, 3)));
-- 
2.14.1.690.gbb1197296e-goog



[PATCH 5/8] connect: teach client to recognize v1 server response

2017-09-13 Thread Brandon Williams
Teach a client to recognize that a server understands protocol v1 by
looking at the first pkt-line the server sends in response.  This is
done by looking for the response "version 1" send by upload-pack or
receive-pack.

Signed-off-by: Brandon Williams 
---
 connect.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/connect.c b/connect.c
index 49b28b83b..2702e1f2e 100644
--- a/connect.c
+++ b/connect.c
@@ -11,6 +11,7 @@
 #include "string-list.h"
 #include "sha1-array.h"
 #include "transport.h"
+#include "protocol.h"
 
 static char *server_capabilities;
 static const char *parse_feature_value(const char *, const char *, int *);
@@ -142,6 +143,27 @@ struct ref **get_remote_heads(int in, char *src_buf, 
size_t src_len,
if (len < 0)
die_initial_contact(saw_response);
 
+   /* Only check for version information on first response */
+   if (!saw_response) {
+   switch (determine_protocol_version_client(buffer)) {
+   case protocol_v1:
+   /*
+* First pkt-line contained the version string.
+* Continue on to process the ref advertisement.
+*/
+   continue;
+   case protocol_v0:
+   /*
+* Server is speaking protocol v0 and sent a
+* ref so we need to process it.
+*/
+   break;
+   default:
+   die("server is speaking an unknown protocol");
+   break;
+   }
+   }
+
if (!len)
break;
 
-- 
2.14.1.690.gbb1197296e-goog



[PATCH 4/8] upload-pack, receive-pack: introduce protocol version 1

2017-09-13 Thread Brandon Williams
Teach upload-pack and receive-pack to understand and respond using
protocol version 1, if requested.

Protocol version 1 is simply the original and current protocol (what I'm
calling version 0) with the addition of a single packet line, which
precedes the ref advertisement, indicating the protocol version being
spoken.

Signed-off-by: Brandon Williams 
---
 builtin/receive-pack.c | 14 ++
 upload-pack.c  | 17 -
 2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/builtin/receive-pack.c b/builtin/receive-pack.c
index 52c63ebfd..aebe77cc3 100644
--- a/builtin/receive-pack.c
+++ b/builtin/receive-pack.c
@@ -24,6 +24,7 @@
 #include "tmp-objdir.h"
 #include "oidset.h"
 #include "packfile.h"
+#include "protocol.h"
 
 static const char * const receive_pack_usage[] = {
N_("git receive-pack "),
@@ -1963,6 +1964,19 @@ int cmd_receive_pack(int argc, const char **argv, const 
char *prefix)
else if (0 <= receive_unpack_limit)
unpack_limit = receive_unpack_limit;
 
+   switch (determine_protocol_version_server()) {
+   case protocol_v1:
+   if (advertise_refs || !stateless_rpc)
+   packet_write_fmt(1, "version 1\n");
+   /*
+* v1 is just the original protocol with a version string,
+* so just fall through after writing the version string.
+*/
+   case protocol_v0:
+   default:
+   break;
+   }
+
if (advertise_refs || !stateless_rpc) {
write_head_info();
}
diff --git a/upload-pack.c b/upload-pack.c
index 7efff2fbf..5cab39819 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -18,6 +18,7 @@
 #include "parse-options.h"
 #include "argv-array.h"
 #include "prio-queue.h"
+#include "protocol.h"
 
 static const char * const upload_pack_usage[] = {
N_("git upload-pack [] "),
@@ -1067,6 +1068,20 @@ int cmd_main(int argc, const char **argv)
die("'%s' does not appear to be a git repository", dir);
 
git_config(upload_pack_config, NULL);
-   upload_pack();
+
+   switch (determine_protocol_version_server()) {
+   case protocol_v1:
+   if (advertise_refs || !stateless_rpc)
+   packet_write_fmt(1, "version 1\n");
+   /*
+* v1 is just the original protocol with a version string,
+* so just fall through after writing the version string.
+*/
+   case protocol_v0:
+   default:
+   upload_pack();
+   break;
+   }
+
return 0;
 }
-- 
2.14.1.690.gbb1197296e-goog



[PATCH 0/8] protocol transition

2017-09-13 Thread Brandon Williams
Here is the non-RFC version of of my proposed protocol transition plan which
can be found here:
https://public-inbox.org/git/20170824225328.8174-1-bmw...@google.com/

The main take away from the comments on the RFC were that the first transition
shouldn't be drastic, so this patch set introduces protocol v1 which is simply
protocol v0 (which is what I'm calling the current git wire protocol) with a
single pkt-line containing a version string before the ref advertisement.

I have included tests for protocol version 1 to verify that it works with the
following transports: git://, file://, ssh://, and http://.  I have also
included an interop test to ensure that sending the version request out of band
doesn't cause issues with older servers.

Any and all comments and feedback are welcome, thanks!

Brandon Williams (8):
  pkt-line: add packet_write function
  protocol: introduce protocol extention mechanisms
  daemon: recognize hidden request arguments
  upload-pack, receive-pack: introduce protocol version 1
  connect: teach client to recognize v1 server response
  connect: tell server that the client understands v1
  http: tell server that the client understands v1
  i5700: add interop test for protocol transition

 Documentation/config.txt   |  16 ++
 Documentation/git.txt  |   5 +
 Makefile   |   1 +
 builtin/receive-pack.c |  14 ++
 cache.h|   9 +
 connect.c  |  59 ++-
 daemon.c   |  71 ++--
 http.c |  18 ++
 pkt-line.c |   6 +
 pkt-line.h |   1 +
 protocol.c |  72 
 protocol.h |  15 ++
 t/interop/i5700-protocol-transition.sh |  68 
 t/lib-httpd/apache.conf|   7 +
 t/t5700-protocol-v1.sh | 292 +
 upload-pack.c  |  17 +-
 16 files changed, 655 insertions(+), 16 deletions(-)
 create mode 100644 protocol.c
 create mode 100644 protocol.h
 create mode 100755 t/interop/i5700-protocol-transition.sh
 create mode 100755 t/t5700-protocol-v1.sh

-- 
2.14.1.690.gbb1197296e-goog



[PATCH 3/8] daemon: recognize hidden request arguments

2017-09-13 Thread Brandon Williams
A normal request to git-daemon is structured as
"command path/to/repo\0host=..\0" and due to a bug in an old version of
git-daemon 73bb33a94 (daemon: Strictly parse the "extra arg" part of the
command, 2009-06-04) we aren't able to place any extra args (separated
by NULs) besides the host.

In order to get around this limitation teach git-daemon to recognize
additional request arguments hidden behind a second NUL byte.  Requests
can then be structured like:
"command path/to/repo\0host=..\0\0version=1\0key=value\0".  git-daemon
can then parse out the extra arguments and set 'GIT_PROTOCOL'
accordingly.

Signed-off-by: Brandon Williams 
---
 daemon.c | 71 +++-
 1 file changed, 61 insertions(+), 10 deletions(-)

diff --git a/daemon.c b/daemon.c
index 30747075f..250dbf82c 100644
--- a/daemon.c
+++ b/daemon.c
@@ -282,7 +282,7 @@ static const char *path_ok(const char *directory, struct 
hostinfo *hi)
return NULL;/* Fallthrough. Deny by default */
 }
 
-typedef int (*daemon_service_fn)(void);
+typedef int (*daemon_service_fn)(const struct argv_array *env);
 struct daemon_service {
const char *name;
const char *config_name;
@@ -363,7 +363,7 @@ static int run_access_hook(struct daemon_service *service, 
const char *dir,
 }
 
 static int run_service(const char *dir, struct daemon_service *service,
-  struct hostinfo *hi)
+  struct hostinfo *hi, const struct argv_array *env)
 {
const char *path;
int enabled = service->enabled;
@@ -422,7 +422,7 @@ static int run_service(const char *dir, struct 
daemon_service *service,
 */
signal(SIGTERM, SIG_IGN);
 
-   return service->fn();
+   return service->fn(env);
 }
 
 static void copy_to_log(int fd)
@@ -462,25 +462,34 @@ static int run_service_command(struct child_process *cld)
return finish_command(cld);
 }
 
-static int upload_pack(void)
+static int upload_pack(const struct argv_array *env)
 {
struct child_process cld = CHILD_PROCESS_INIT;
argv_array_pushl(, "upload-pack", "--strict", NULL);
argv_array_pushf(, "--timeout=%u", timeout);
+
+   argv_array_pushv(_array, env->argv);
+
return run_service_command();
 }
 
-static int upload_archive(void)
+static int upload_archive(const struct argv_array *env)
 {
struct child_process cld = CHILD_PROCESS_INIT;
argv_array_push(, "upload-archive");
+
+   argv_array_pushv(_array, env->argv);
+
return run_service_command();
 }
 
-static int receive_pack(void)
+static int receive_pack(const struct argv_array *env)
 {
struct child_process cld = CHILD_PROCESS_INIT;
argv_array_push(, "receive-pack");
+
+   argv_array_pushv(_array, env->argv);
+
return run_service_command();
 }
 
@@ -574,7 +583,7 @@ static void canonicalize_client(struct strbuf *out, const 
char *in)
 /*
  * Read the host as supplied by the client connection.
  */
-static void parse_host_arg(struct hostinfo *hi, char *extra_args, int buflen)
+static char *parse_host_arg(struct hostinfo *hi, char *extra_args, int buflen)
 {
char *val;
int vallen;
@@ -602,6 +611,39 @@ static void parse_host_arg(struct hostinfo *hi, char 
*extra_args, int buflen)
if (extra_args < end && *extra_args)
die("Invalid request");
}
+
+   return extra_args;
+}
+
+static void parse_extra_args(struct argv_array *env, const char *extra_args,
+int buflen)
+{
+   const char *end = extra_args + buflen;
+   struct strbuf git_protocol = STRBUF_INIT;
+
+   for (; extra_args < end; extra_args += strlen(extra_args) + 1) {
+   const char *arg = extra_args;
+
+   /*
+* Parse the extra arguments, adding most to 'git_protocol'
+* which will be used to set the 'GIT_PROTOCOL' envvar in the
+* service that will be run.
+*
+* If there ends up being a particular arg in the future that
+* git-daemon needs to parse specificly (like the 'host' arg)
+* then it can be parsed here and not added to 'git_protocol'.
+*/
+   if (*arg) {
+   if (git_protocol.len > 0)
+   strbuf_addch(_protocol, ':');
+   strbuf_addstr(_protocol, arg);
+   }
+   }
+
+   if (git_protocol.len > 0)
+   argv_array_pushf(env, GIT_PROTOCOL_ENVIRONMENT "=%s",
+git_protocol.buf);
+   strbuf_release(_protocol);
 }
 
 /*
@@ -695,6 +737,7 @@ static int execute(void)
int pktlen, len, i;
char *addr = getenv("REMOTE_ADDR"), *port = getenv("REMOTE_PORT");
struct hostinfo hi;
+   struct argv_array env = ARGV_ARRAY_INIT;
 
hostinfo_init();
 

[PATCH 2/8] protocol: introduce protocol extention mechanisms

2017-09-13 Thread Brandon Williams
Create protocol.{c,h} and provide functions which future servers and
clients can use to determine which protocol to use or is being used.

Also introduce the 'GIT_PROTOCOL' environment variable which will be
used to communicate a colon separated list of keys with optional values
to a server.  Unknown keys and values must be tolerated.  This mechanism
is used to communicate which version of the wire protocol a client would
like to use with a server.

Signed-off-by: Brandon Williams 
---
 Documentation/config.txt | 16 +++
 Documentation/git.txt|  5 
 Makefile |  1 +
 cache.h  |  7 +
 protocol.c   | 72 
 protocol.h   | 15 ++
 6 files changed, 116 insertions(+)
 create mode 100644 protocol.c
 create mode 100644 protocol.h

diff --git a/Documentation/config.txt b/Documentation/config.txt
index dc4e3f58a..d5b28a32c 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -2517,6 +2517,22 @@ The protocol names currently used by git are:
 `hg` to allow the `git-remote-hg` helper)
 --
 
+protocol.version::
+   If set, clients will attempt to communicate with a server using
+   the specified protocol version.  If unset, no attempt will be
+   made by the client to communicate using a particular protocol
+   version, this results in protocol version 0 being used.
+   Supported versions:
++
+--
+
+* `0` - the original wire protocol.
+
+* `1` - the original wire protocol with the addition of a version string
+  in the initial respose from the server.
+
+--
+
 pull.ff::
By default, Git does not create an extra merge commit when merging
a commit that is a descendant of the current commit. Instead, the
diff --git a/Documentation/git.txt b/Documentation/git.txt
index 6e3a6767e..299f75c7b 100644
--- a/Documentation/git.txt
+++ b/Documentation/git.txt
@@ -697,6 +697,11 @@ of clones and fetches.
which feed potentially-untrusted URLS to git commands.  See
linkgit:git-config[1] for more details.
 
+`GIT_PROTOCOL`::
+   For internal use only.  Used in handshaking the wire protocol.
+   Contains a colon ':' separated list of keys with optional values
+   'key[=value]'.  Presence of unknown keys must be tolerated.
+
 Discussion[[Discussion]]
 
 
diff --git a/Makefile b/Makefile
index f2bb7f2f6..1f300bd6b 100644
--- a/Makefile
+++ b/Makefile
@@ -837,6 +837,7 @@ LIB_OBJS += pretty.o
 LIB_OBJS += prio-queue.o
 LIB_OBJS += progress.o
 LIB_OBJS += prompt.o
+LIB_OBJS += protocol.o
 LIB_OBJS += quote.o
 LIB_OBJS += reachable.o
 LIB_OBJS += read-cache.o
diff --git a/cache.h b/cache.h
index a916bc79e..8839b1ed4 100644
--- a/cache.h
+++ b/cache.h
@@ -444,6 +444,13 @@ static inline enum object_type object_type(unsigned int 
mode)
 #define GIT_ICASE_PATHSPECS_ENVIRONMENT "GIT_ICASE_PATHSPECS"
 #define GIT_QUARANTINE_ENVIRONMENT "GIT_QUARANTINE_PATH"
 
+/*
+ * Environment variable used in handshaking the wire protocol.
+ * Contains a colon ':' separated list of keys with optional values
+ * 'key[=value]'.  Presence of unknown keys must be tolerated.
+ */
+#define GIT_PROTOCOL_ENVIRONMENT "GIT_PROTOCOL"
+
 /*
  * This environment variable is expected to contain a boolean indicating
  * whether we should or should not treat:
diff --git a/protocol.c b/protocol.c
new file mode 100644
index 0..1b16c7b9a
--- /dev/null
+++ b/protocol.c
@@ -0,0 +1,72 @@
+#include "cache.h"
+#include "config.h"
+#include "protocol.h"
+
+enum protocol_version parse_protocol_version(const char *value)
+{
+   if (!strcmp(value, "0"))
+   return protocol_v0;
+   else if (!strcmp(value, "1"))
+   return protocol_v1;
+   else
+   return protocol_unknown_version;
+}
+
+enum protocol_version get_protocol_version_config(void)
+{
+   const char *value;
+   if (!git_config_get_string_const("protocol.version", )) {
+   enum protocol_version version = parse_protocol_version(value);
+
+   if (version == protocol_unknown_version)
+   die("unknown value for config 'protocol.version': %s",
+   value);
+
+   return version;
+   }
+
+   return protocol_v0;
+}
+
+enum protocol_version determine_protocol_version_server(void)
+{
+   const char *git_protocol = getenv(GIT_PROTOCOL_ENVIRONMENT);
+   enum protocol_version version = protocol_v0;
+
+   if (git_protocol) {
+   struct string_list list = STRING_LIST_INIT_DUP;
+   const struct string_list_item *item;
+   string_list_split(, git_protocol, ':', -1);
+
+   for_each_string_list_item(item, ) {
+   const char *value;
+   enum protocol_version v;
+
+   if (skip_prefix(item->string, "version=", )) {
+ 

Re: RFC v3: Another proposed hash function transition plan

2017-09-13 Thread Junio C Hamano
Jonathan Nieder  writes:

> Treating generation numbers as derived data (as in Jeff King's
> preferred design, if I have understood his replies correctly) would
> also be possible but it does not interact well with shallow clone or
> narrow clone.

Just like we have skewed committer timestamps, there is no reason to
believe that generation numbers embedded in objects are trustable,
and there is no way for narrow clients to even verify their correctness.

So I agree with Peff that having generation numbers in object is
pointless; I agree any other derivables like corresponding sha-1
name is also pointless to have.

This is a tangent, but it may be fine for a shallow clone to treat
the cut-off points in the history as if they are root commits and
compute generation numbers locally, just like everybody else does.
As generation numbers won't have to be global (because we will not
be embedding them in objects), nobody gets hurt if they do not match
across repositories---just like often-mentioned rename detection
cache, it can be kept as a mere local performance aid and does not
have to participate in the object model.

> All that said, for simplicity I still lean against including
> generation numbers as part of a hash function transition.

Good.

> This is unrelated to Brandon's message, except for his use of SHA3 as
> a placeholder for "the next hash function".
>
> My assumption based on previous conversations (and other external
> conversations like [1]) is that we are going to use SHA2-256 and have
> a pretty strong consensus for that.  Don't worry!

Hmph, I actually re-read the thread recently, and my impression was
that we didn't quite have a consensus but were leaning towards
SHA3-256.

I do not personally have a strong preference myself and I would say
that anything will do as long as it is with good longevity and
availability.  SHA2 family would be a fine choice due to its age on
both counts, being scrutinized longer and having a chance to be
implemented in many places, even though its age itself may have to
be subtracted from the longevity factor.

Thanks.


Re: [PATCH 00/20] Read `packed-refs` using mmap()

2017-09-13 Thread Junio C Hamano
Michael Haggerty  writes:

> Following lots of work to extract the `packed_ref_store` into a
> separate module and decouple it from the `files_ref_store`, it is now
> possible to fundamentally change how the `packed-refs` file is read.
>
> * `mmap()` the whole file rather than `read()`ing it.
>
> * Instead of parsing the complete file at once into a `ref_cache`,
>   parse the references out of the file contents on demand.
>
> * Use a binary search to find, very quickly, the reference or group of
>   references that needs to be read. Parse *only* those references out
>   of the file contents, without creating in-memory data structures at
>   all.

Oooh, juicy and very exciting.

> ...
> This branch applies on top of mh/packed-ref-transactions. It can also
> be obtained from my git fork [1] as branch `mmap-packed-refs`.
>
> Michael
>
> [1] https://github.com/mhagger/git

Thanks for working on this. I expect that it will take more than a
few hours for me to pick it up, but I am looking forward to it.



Re: [PATCH v2] config: avoid "write_in_full(fd, buf, len) < len" pattern

2017-09-13 Thread Junio C Hamano
Jonathan Nieder  writes:

> Jeff King wrote:
>
>> What I missed is that copy_begin and copy_end here are actually size_t
>> variables, not the pointers. Sorry for the confusion, and here's an
>> updated version of the patch with this paragraph amended (the patch
>> itself is identical):
>
> Subtle.  The world makes more sense now.  Thanks for figuring it out.
> ...
> For what it's worth,
> Reviewed-by: Jonathan Nieder 

Thanks, both. It seems that I missed all the fun ;-)


Re: [PATCH 7/7] config: flip return value of store_write_*()

2017-09-13 Thread Jonathan Nieder
Hi,

Jeff King wrote:

> The store_write_section() and store_write_pairs() functions
> are basically high-level wrappers around write(). But their
> return values are flipped from our usual convention, using
> "1" for success and "0" for failure.
>
> Let's flip them to follow the usual write() conventions and
> update all callers. As these are local to config.c, it's
> unlikely that we'd have new callers in any topics in flight
> (which would be silently broken by our change). But just to
> be on the safe side, let's rename them to just
> write_section() and write_pairs().  That also accentuates
> their relationship with write().
>
> Signed-off-by: Jeff King 

The caller only cares if it succeeded, right?  Could this return
the customary 0 vs -1 instead of the number of bytes written?

That would look like the following (as a patch to squash in):

With or without that tweak,
Reviewed-by: Jonathan Nieder 

diff --git i/config.c w/config.c
index 272a32aac0..8f92d452bf 100644
--- i/config.c
+++ w/config.c
@@ -2297,11 +2297,10 @@ static int write_error(const char *filename)
return 4;
 }
 
-static ssize_t write_section(int fd, const char *key)
+static int write_section(int fd, const char *key)
 {
const char *dot;
-   int i;
-   ssize_t ret;
+   int i, ret;
struct strbuf sb = STRBUF_INIT;
 
dot = memchr(key, '.', store.baselen);
@@ -2317,16 +2316,15 @@ static ssize_t write_section(int fd, const char *key)
strbuf_addf(, "[%.*s]\n", store.baselen, key);
}
 
-   ret = write_in_full(fd, sb.buf, sb.len);
+   ret = write_in_full(fd, sb.buf, sb.len) < 0 ? -1 : 0;
strbuf_release();
 
return ret;
 }
 
-static ssize_t write_pair(int fd, const char *key, const char *value)
+static int write_pair(int fd, const char *key, const char *value)
 {
-   int i;
-   ssize_t ret;
+   int i, ret;
int length = strlen(key + store.baselen + 1);
const char *quote = "";
struct strbuf sb = STRBUF_INIT;
@@ -2366,7 +2364,7 @@ static ssize_t write_pair(int fd, const char *key, const 
char *value)
}
strbuf_addf(, "%s\n", quote);
 
-   ret = write_in_full(fd, sb.buf, sb.len);
+   ret = write_in_full(fd, sb.buf, sb.len) < 0 ? -1 : 0;
strbuf_release();
 
return ret;
@@ -2499,8 +2497,7 @@ int git_config_set_multivar_in_file_gently(const char 
*config_filename,
}
 
store.key = (char *)key;
-   if (write_section(fd, key) < 0 ||
-   write_pair(fd, key, value) < 0)
+   if (write_section(fd, key) || write_pair(fd, key, value))
goto write_err_out;
} else {
struct stat st;
@@ -2623,10 +2620,10 @@ int git_config_set_multivar_in_file_gently(const char 
*config_filename,
/* write the pair (value == NULL means unset) */
if (value != NULL) {
if (store.state == START) {
-   if (write_section(fd, key) < 0)
+   if (write_section(fd, key))
goto write_err_out;
}
-   if (write_pair(fd, key, value) < 0)
+   if (write_pair(fd, key, value))
goto write_err_out;
}
 
@@ -2820,7 +2817,7 @@ int git_config_rename_section_in_file(const char 
*config_filename,
continue;
}
store.baselen = strlen(new_name);
-   if (write_section(out_fd, new_name) < 0) {
+   if (write_section(out_fd, new_name)) {
ret = 
write_error(get_lock_file_path(lock));
goto out;
}


Re: [PATCH 6/7] notes-merge: use ssize_t for write_in_full() return value

2017-09-13 Thread Jonathan Nieder
Jeff King wrote:

> We store the return value of write_in_full() in a long,
> though the return is actually an ssize_t. This probably
> doesn't matter much in practice (since the buffer size is
> alredy an unsigned long), but it might if the size if
> between what can be represented in "long" and "unsigned
> long", and if your size_t is larger than a "long" (as it is
> on 64-bit Windows).
>
> Signed-off-by: Jeff King 
> ---
>  notes-merge.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Yikes.  Good catch.

With or without the tweak below,
Reviewed-by: Jonathan Nieder 

> --- a/notes-merge.c
> +++ b/notes-merge.c
> @@ -302,7 +302,7 @@ static void write_buf_to_worktree(const struct object_id 
> *obj,
>   fd = xopen(path, O_WRONLY | O_EXCL | O_CREAT, 0666);
>  
>   while (size > 0) {
> - long ret = write_in_full(fd, buf, size);
> + ssize_t ret = write_in_full(fd, buf, size);
>   if (ret < 0) {
>   /* Ignore epipe */
>   if (errno == EPIPE)
>   break;
>   die_errno("notes-merge");
>   } else if (!ret) {
>   die("notes-merge: disk full?");
>   }

These three lines are dead code.  How about the following, e.g. for
squashing in?

diff --git i/notes-merge.c w/notes-merge.c
index 597d43f65c..4352c34a6e 100644
--- i/notes-merge.c
+++ w/notes-merge.c
@@ -308,8 +308,6 @@ static void write_buf_to_worktree(const struct object_id 
*obj,
if (errno == EPIPE)
break;
die_errno("notes-merge");
-   } else if (!ret) {
-   die("notes-merge: disk full?");
}
size -= ret;
buf += ret;


Re: [PATCH 5/7] pkt-line: check write_in_full() errors against "< 0"

2017-09-13 Thread Jonathan Nieder
Jeff King wrote:

> As with the previous two commits, we prefer to check
> write_in_full()'s return value to see if it is negative,
> rather than comparing it to the input length.
>
> These cases actually flip the logic to check for success,
> making conversion a little different than in other cases. We
> could of course write:
>
>   if (write_in_full(...) >= 0)
>   return 0;
>  return error(...);
>
> But our usual method of spelling write() error checks is
> just "< 0". So let's flip the logic for each of these
> conditionals to our usual style.
>
> Signed-off-by: Jeff King 
> ---
>  pkt-line.c | 29 ++---
>  1 file changed, 14 insertions(+), 15 deletions(-)

Reviewed-by: Jonathan Nieder 


Re: [PATCH 4/7] convert less-trivial versions of "write_in_full() != len"

2017-09-13 Thread Jonathan Nieder
Jeff King wrote:

> The prior commit converted many sites to check the return
> value of write_in_full() for negativity, rather than a
> mismatch with the input length. This patch covers similar
> cases, but where the return value is stored in an
> intermediate variable. These should get the same treatment,
> but they need to be reviewed more carefully since it would
> be a bug if the return value is stored in an unsigned type
> (which indeed, it is in one of the cases).
>
> Signed-off-by: Jeff King 
> ---
>  entry.c  | 5 +++--
>  refs/files-backend.c | 2 +-
>  streaming.c  | 2 +-
>  3 files changed, 5 insertions(+), 4 deletions(-)

Reviewed-by: Jonathan Nieder 


Re: [PATCH 3/7] avoid "write_in_full(fd, buf, len) != len" pattern

2017-09-13 Thread Jonathan Nieder
Jeff King wrote:

> The return value of write_in_full() is either "-1", or the
> requested number of bytes[1]. If we make a partial write
> before seeing an error, we still return -1, not a partial
> value. This goes back to f6aa66cb95 (write_in_full: really
> write in full or return error on disk full., 2007-01-11).
>
> So checking anything except "was the return value negative"
> is pointless. And there are a couple of reasons not to do
> so:
>
>   1. It can do a funny signed/unsigned comparison. If your
[...]
>   2. Checking for a negative value is shorter to type,
>  especially when the length is an expression.
>
>   3. Linus says so. In d34cf19b89 (Clean up write_in_full()
>  users, 2007-01-11), right after the write_in_full()
>  semantics were changed, he wrote:
>
>I really wish every "write_in_full()" user would just
>check against "<0" now, but this fixes the nasty and
>stupid ones.

Ok, you convinced me.

Should we add a comment to cache.h as well encouraging this?

[...]
> [1] A careful reader may notice there is one way that
> write_in_full() can return a different value. If we ask
> write() to write N bytes and get a return value that is
> _larger_ than N, we could return a larger total. But
> besides the fact that this would imply a totally broken
> version of write(), it would already invoke undefined
> behavior. Our internal remaining counter is an unsigned
> size_t, which means that subtracting too many byte will
> wrap it around to a very large number. So we'll instantly
> begin reading off the end of the buffer, trying to write
> gigabytes (or petabytes) of data.

This footnote just leaves me more confused, since as you mention,
write() never would return a value greater than N.  Are you saying we
need to defend against a broken platform where that isn't true?

> Signed-off-by: Jeff King 
> ---
>  builtin/receive-pack.c | 2 +-
>  builtin/rerere.c   | 2 +-
>  builtin/unpack-file.c  | 2 +-
>  config.c   | 4 ++--
>  diff.c | 2 +-
>  fast-import.c  | 2 +-
>  http-backend.c | 4 ++--
>  ll-merge.c | 2 +-
>  read-cache.c   | 6 +++---
>  refs.c | 2 +-
>  refs/files-backend.c   | 8 
>  rerere.c   | 2 +-
>  shallow.c  | 6 +++---
>  t/helper/test-delta.c  | 2 +-
>  transport-helper.c | 5 ++---
>  wrapper.c  | 2 +-
>  16 files changed, 26 insertions(+), 27 deletions(-)

All of these look correctly done.

Reviewed-by: Jonathan Nieder 


Re: [PATCH 2/7] get-tar-commit-id: check write_in_full() return against 0

2017-09-13 Thread Jonathan Nieder
Hi,

Jeff King wrote:

> We ask to write 41 bytes and make sure that the return value
> is at least 41. This is the same "dangerous" pattern that
> was fixed in the prior commit (wherein a negative return
> value is promoted to unsigned), though it is not dangerous
> here because our "41" is a constant, not an unsigned
> variable.
>
> But we should convert it anyway to avoid modeling a
> dangerous construct.
>
> Signed-off-by: Jeff King 
> ---
>  builtin/get-tar-commit-id.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)

I kind of disagree with calling this dangerous (and I think that is
what you alluded to above by putting it in quotes), but I like the
postimage more than the preimage.

The variable 'n' could be eliminated to simplify this further.  I
realize that would go against the spirit of this patch, but (1) it's
on-topic for the patch, since it is another ssize_t vs constant
comparison and (2) as mentioned elsewhere in this thread, it's a very
common idiom with read_in_full.  If we want to eliminate it then we
could introduce a separate helper to distinguish between
read_this_much_i_mean_it and read_this_much_or_to_eof.

Anyway, I think the below is an improvement.

With or without this tweak,
Reviewed-by: Jonathan Nieder 

Thanks.

diff --git i/builtin/get-tar-commit-id.c w/builtin/get-tar-commit-id.c
index 6d9a79f9b3..03ef7c5ba4 100644
--- i/builtin/get-tar-commit-id.c
+++ w/builtin/get-tar-commit-id.c
@@ -20,13 +20,11 @@ int cmd_get_tar_commit_id(int argc, const char **argv, 
const char *prefix)
struct ustar_header *header = (struct ustar_header *)buffer;
char *content = buffer + RECORDSIZE;
const char *comment;
-   ssize_t n;
 
if (argc != 1)
usage(builtin_get_tar_commit_id_usage);
 
-   n = read_in_full(0, buffer, HEADERSIZE);
-   if (n < HEADERSIZE)
+   if (read_in_full(0, buffer, HEADERSIZE) < HEADERSIZE)
die("git get-tar-commit-id: read error");
if (header->typeflag[0] != 'g')
return 1;


Re: [PATCH v2] config: avoid "write_in_full(fd, buf, len) < len" pattern

2017-09-13 Thread Jonathan Nieder
Jeff King wrote:

> AFAIK there's no way to turn it on for specific functions, but I'm far
> from a gcc-warning guru. Even if you could, though, the error may be far
> from the function (e.g., if we store the result in an ssize_t and then
> compare that with a size_t).

It turns out that yes, we have two of those.  Both handle errors
separately already, so they should be safe.

While investigating the second, I noticed that read_in_full can
overflow its return value.  malloc doesn't typically allow allocating
a buffer with size greater than SSIZE_MAX so this should be safe, but
it would be confusing to static analyzers.

Combining these observations yields the following (just for
illustration):

diff --git i/bulk-checkin.c w/bulk-checkin.c
index 9a1f6c49ab..f8e3060041 100644
--- i/bulk-checkin.c
+++ w/bulk-checkin.c
@@ -114,7 +114,7 @@ static int stream_to_pack(struct bulk_checkin_state *state,
unsigned char ibuf[16384];
 
if (size && !s.avail_in) {
-   ssize_t rsize = size < sizeof(ibuf) ? size : 
sizeof(ibuf);
+   size_t rsize = size < sizeof(ibuf) ? size : 
sizeof(ibuf);
if (read_in_full(fd, ibuf, rsize) != rsize)
die("failed to read %d bytes from '%s'",
(int)rsize, path);
diff --git i/combine-diff.c w/combine-diff.c
index 9e163d5ada..e966d4f393 100644
--- i/combine-diff.c
+++ w/combine-diff.c
@@ -1026,7 +1026,7 @@ static void show_patch_diff(struct combine_diff_path 
*elem, int num_parent,
result_size = fill_textconv(textconv, df, );
free_filespec(df);
} else if (0 <= (fd = open(elem->path, O_RDONLY))) {
-   size_t len = xsize_t(st.st_size);
+   ssize_t len = xssize_t(st.st_size);
ssize_t done;
int is_file, i;
 
@@ -1040,6 +1040,8 @@ static void show_patch_diff(struct combine_diff_path 
*elem, int num_parent,
if (!is_file)
elem->mode = canon_mode(S_IFLNK);
 
+   if (len > ULONG_MAX)
+   die("cannot handle files this big");
result_size = len;
result = xmallocz(len);
 
diff --git i/git-compat-util.h w/git-compat-util.h
index 6678b488cc..20fea81589 100644
--- i/git-compat-util.h
+++ w/git-compat-util.h
@@ -903,6 +903,13 @@ static inline size_t xsize_t(off_t len)
return (size_t)len;
 }
 
+static inline ssize_t xssize_t(off_t len)
+{
+   if (len > SSIZE_MAX)
+   die("cannot handle files this big");
+   return (ssize_t)len;
+}
+
 __attribute__((format (printf, 3, 4)))
 extern int xsnprintf(char *dst, size_t max, const char *fmt, ...);
 
diff --git i/wrapper.c w/wrapper.c
index 36630e5d18..2b52b7367d 100644
--- i/wrapper.c
+++ w/wrapper.c
@@ -314,6 +314,9 @@ ssize_t read_in_full(int fd, void *buf, size_t count)
char *p = buf;
ssize_t total = 0;
 
+   if (count > SSIZE_MAX)
+   BUG("read_in_full called with absurdly high count %"PRIuMAX,
+   (uintmax_t) count);
while (count > 0) {
ssize_t loaded = xread(fd, p, count);
if (loaded < 0)


Re: [PATCH] test-lib: don't use ulimit in test prerequisites on cygwin

2017-09-13 Thread Jonathan Nieder
Ramsay Jones wrote:

> On cygwin (and MinGW), the 'ulimit' built-in bash command does not have
> the desired effect of limiting the resources of new processes, at least
> for the stack and file descriptors. However, it always returns success
> and leads to several test prerequisites being erroneously set to true.
>
> Add a check for cygwin and MinGW to the prerequisite expressions, using
> 'uname -s', and return false instead of (indirectly) using 'ulimit'.
> This affects the prerequisite expressions for the ULIMIT_STACK_SIZE,
> CMDLINE_LIMIT and ULIMIT_FILE_DESCRIPTORS prerequisites.
>
> Signed-off-by: Ramsay Jones 
> ---
>  t/t1400-update-ref.sh | 11 ++-
>  t/t6120-describe.sh   |  1 -
>  t/t7004-tag.sh|  1 -
>  t/test-lib.sh | 22 --
>  4 files changed, 30 insertions(+), 5 deletions(-)

Reviewed-by: Jonathan Nieder 

Thanks.

An alternative would be to do some more explicit feature-based test
like using "ulimit" to set a limit and then reading back the limit in
a separate process, but that feels like overkill.

Sincerely,
Jonathan


Re: [PATCH v2] config: avoid "write_in_full(fd, buf, len) < len" pattern

2017-09-13 Thread Jonathan Nieder
Jeff King wrote:
> On Wed, Sep 13, 2017 at 11:24:31AM -0700, Jonathan Nieder wrote:

>> Compilers' signed/unsigned comparison warning can be noisy, but I'm
>> starting to feel it's worth the suppression noise to turn it on when
>> DEVELOPER=1 anyway.  What do you think?  Is there a way to turn it on
>> selectively for certain functions on the LHS (like read() and write()
>> style functions)?
>
> Obviously we couldn't turn them on for -Werror at this point. Let me see
> how bad "-Wsign-compare -Wno-error=sign-compare" is.
>
>   $ make 2>&1 | grep -c warning:
>   1391
>
> Ouch. I'm afraid that's probably not going to be very helpful. Even if I
> were introducing a new problem, I'm not likely to see it amidst the mass
> of existing complaints.

Yup.

Something like http://coccinelle.lip6.fr/rules/find_unsigned.html (but
not that one precisely, since it's not smart enough to follow typedefs
like size_t!) can do this for specific functions.  Then "make
coccicheck" would report new problematic usages.  Let me know if this
seems worth pursuing, and I'll send a patch based on the spatch I used
to review your patch 8.

Using coccinelle for this is a kind of oversized hammer.  I wonder if
e.g. sparse would be able to do a better job.

> AFAIK there's no way to turn it on for specific functions, but I'm far
> from a gcc-warning guru. Even if you could, though, the error may be far
> from the function (e.g., if we store the result in an ssize_t and then
> compare that with a size_t).

If we have a check that catches the obvious cases then I'm already
pretty happy.

Though checking ssize_t vs size_t comparisons in general might also
not be a bad idea, since it would be narrower than the general
-Wsign-compare check.

Thanks,
Jonathan


Re: [PATCH 8/7] read_pack_header: handle signed/unsigned comparison in read result

2017-09-13 Thread Jonathan Nieder
Jeff King wrote:

> Subject: [PATCH] read_pack_header: handle signed/unsigned comparison in read  
> result
>
> The result of read_in_full() may be -1 if we saw an error.
> But in comparing it to a sizeof() result, that "-1" will be
> promoted to size_t. In fact, the largest possible size_t
> which is much bigger than our struct size. This means that
> our "< sizeof(header)" error check won't trigger.
>
> In practice, we'd go on to read uninitialized memory and
> compare it to the PACK signature, which is likely to fail.
> But we shouldn't get there.
>
> We can fix this by making a direct "!=" comparison to the
> requested size, rather than "<". This means that errors get
> lumped in with short reads, but that's sufficient for our
> purposes here. There's no PH_ERROR tp represent our case.
> And anyway, this function reads from pipes and network
> sockets. A network error may racily appear as EOF to us
> anyway if there's data left in the socket buffers.
>
> Signed-off-by: Jeff King 
> ---
>  sha1_file.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

This patch 8 (but not patches 2-7) is
Reviewed-by: Jonathan Nieder 

Thanks.

An alternative would be to use something like

< (int) sizeof(*header)

to force it to be signed, but I think this is clearer.

Using the following semantic patch, I find this and the example from
patch 1 and no others:

  @@
  expression fd;
  expression buf;
  expression len;
  size_t rhs;
  @@
  -read_in_full(fd, buf, len) < rhs
  +ERROR()

  @@
  expression fd;
  expression buf;
  expression len;
  size_t rhs;
  @@
  -write_in_full(fd, buf, len) < rhs
  +ERROR()

Thanks,
Jonathan

> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -1850,7 +1850,7 @@ int index_path(struct object_id *oid, const char *path, 
> struct stat *st, unsigne
>  
>  int read_pack_header(int fd, struct pack_header *header)
>  {
> - if (read_in_full(fd, header, sizeof(*header)) < sizeof(*header))
> + if (read_in_full(fd, header, sizeof(*header)) != sizeof(*header))
>   /* "eof before pack header was fully read" */
>   return PH_ERROR_EOF;
>  


[PATCH] test-lib: don't use ulimit in test prerequisites on cygwin

2017-09-13 Thread Ramsay Jones

On cygwin (and MinGW), the 'ulimit' built-in bash command does not have
the desired effect of limiting the resources of new processes, at least
for the stack and file descriptors. However, it always returns success
and leads to several test prerequisites being erroneously set to true.

Add a check for cygwin and MinGW to the prerequisite expressions, using
'uname -s', and return false instead of (indirectly) using 'ulimit'.
This affects the prerequisite expressions for the ULIMIT_STACK_SIZE,
CMDLINE_LIMIT and ULIMIT_FILE_DESCRIPTORS prerequisites.

Signed-off-by: Ramsay Jones 
---

Hi Michael,

So, I finally got around to testing ulimit with open files and
found that it does not limit them on cygwin (it actually fails
on the hard limit of 3200). Having seen Johannes email earlier,
I took a chance and added MinGW to this patch as well. (Hopefully
this won't cause any screams!) ;-)

So, this patch was developed on top of the 'pu' branch, but it
also applies cleanly to your 'mg/name-rev-tests-with-short-stack'
branch (ie. on top of commit 31625b34c0).

If you are going to re-roll your branch, could you please add this
on top (after squashing the two comment deletions into your earlier
patches, maybe).

Thanks!

ATB,
Ramsay Jones

 t/t1400-update-ref.sh | 11 ++-
 t/t6120-describe.sh   |  1 -
 t/t7004-tag.sh|  1 -
 t/test-lib.sh | 22 --
 4 files changed, 30 insertions(+), 5 deletions(-)

diff --git a/t/t1400-update-ref.sh b/t/t1400-update-ref.sh
index dc98b4bc6..49415074f 100755
--- a/t/t1400-update-ref.sh
+++ b/t/t1400-update-ref.sh
@@ -1253,7 +1253,16 @@ run_with_limited_open_files () {
(ulimit -n 32 && "$@")
 }
 
-test_lazy_prereq ULIMIT_FILE_DESCRIPTORS 'run_with_limited_open_files true'
+test_lazy_prereq ULIMIT_FILE_DESCRIPTORS '
+   case $(uname -s) in
+   CYGWIN*|MINGW*)
+   false
+   ;;
+   *)
+   run_with_limited_open_files true
+   ;;
+   esac
+'
 
 test_expect_success ULIMIT_FILE_DESCRIPTORS 'large transaction creating 
branches does not burst open file limit' '
 (
diff --git a/t/t6120-describe.sh b/t/t6120-describe.sh
index dd6dd9df9..3d45dc295 100755
--- a/t/t6120-describe.sh
+++ b/t/t6120-describe.sh
@@ -279,7 +279,6 @@ test_expect_success 'describe ignoring a borken submodule' '
grep broken out
 '
 
-# we require ulimit, this excludes Windows
 test_expect_failure ULIMIT_STACK_SIZE 'name-rev works in a deep repo' '
i=1 &&
while test $i -lt 8000
diff --git a/t/t7004-tag.sh b/t/t7004-tag.sh
index 5bf5ace56..b545c33f8 100755
--- a/t/t7004-tag.sh
+++ b/t/t7004-tag.sh
@@ -1863,7 +1863,6 @@ test_expect_success 'version sort with very long 
prerelease suffix' '
git tag -l --sort=version:refname
 '
 
-# we require ulimit, this excludes Windows
 test_expect_success ULIMIT_STACK_SIZE '--contains and --no-contains work in a 
deep repo' '
>expect &&
i=1 &&
diff --git a/t/test-lib.sh b/t/test-lib.sh
index 83f5d3dd2..250ba5e9b 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -1170,13 +1170,31 @@ run_with_limited_cmdline () {
(ulimit -s 128 && "$@")
 }
 
-test_lazy_prereq CMDLINE_LIMIT 'run_with_limited_cmdline true'
+test_lazy_prereq CMDLINE_LIMIT '
+   case $(uname -s) in
+   CYGWIN*|MINGW*)
+   false
+   ;;
+   *)
+   run_with_limited_cmdline true
+   ;;
+   esac
+'
 
 run_with_limited_stack () {
(ulimit -s 128 && "$@")
 }
 
-test_lazy_prereq ULIMIT_STACK_SIZE 'run_with_limited_stack true'
+test_lazy_prereq ULIMIT_STACK_SIZE '
+   case $(uname -s) in
+   CYGWIN*|MINGW*)
+   false
+   ;;
+   *)
+   run_with_limited_stack true
+   ;;
+   esac
+'
 
 build_option () {
git version --build-options |
-- 
2.14.0


Re: [PATCH v2] config: avoid "write_in_full(fd, buf, len) < len" pattern

2017-09-13 Thread Jeff King
On Wed, Sep 13, 2017 at 11:24:31AM -0700, Jonathan Nieder wrote:

> For what it's worth,
> Reviewed-by: Jonathan Nieder 

Thanks, and thank you for questioning the ptrdiff_t bits that didn't
make sense. I _thought_ they felt like nonsense, but couldn't quite
puzzle it out.

> Compilers' signed/unsigned comparison warning can be noisy, but I'm
> starting to feel it's worth the suppression noise to turn it on when
> DEVELOPER=1 anyway.  What do you think?  Is there a way to turn it on
> selectively for certain functions on the LHS (like read() and write()
> style functions)?

Obviously we couldn't turn them on for -Werror at this point. Let me see
how bad "-Wsign-compare -Wno-error=sign-compare" is.

  $ make 2>&1 | grep -c warning:
  1391

Ouch. I'm afraid that's probably not going to be very helpful. Even if I
were introducing a new problem, I'm not likely to see it amidst the mass
of existing complaints.

AFAIK there's no way to turn it on for specific functions, but I'm far
from a gcc-warning guru. Even if you could, though, the error may be far
from the function (e.g., if we store the result in an ssize_t and then
compare that with a size_t).

Interestingly in the write_in_full() case the code was actually
unreachable! But I don't think the compiler is smart enough to know
that, because it would have to realize that write_in_full() can only
return either -1 or the original value (and not a positive value in
between). Coverity _might_ be smart enough, but figuring that out does
take some loop analysis (including the assumption that write() never
returns more bytes than we asked it to write).

-Peff


BUSINESS PROPOSAL

2017-09-13 Thread LING LUNG
Please I   like you to keep this proposal as a top secret and delete it if you 
are not interested and get back to 
me if you are interested for  details as regards to the transfer of $24,500,000 
to you. This money initially 
belongs to a Libyan client who died in the libya crisis  and had no next of kin 
in his account-opening 
package in my bank here in Hong kong where I am a bank director. In other to 
achieve this, I shall 
require your full name, and telephone number to reach you.Most importantly,  a 
confirmation of 
acceptance from you will be needed after which I shall furnish you with the 
full details of this transaction.

Kindly reply to linglung0...@gmail.com

Respectfully,

Ling Lung


Re: [PATCH 2/7] get-tar-commit-id: check write_in_full() return against 0

2017-09-13 Thread Jeff King
On Wed, Sep 13, 2017 at 02:02:32PM -0400, Jeff King wrote:

> > Does read_in_full need a similar treatment?
> 
> It might actually return fewer than the requested number of bytes, so it
> can't just use "< 0" in the same way (nor be adapted to return 0 on
> success).  But I think it's still a bug to do:
> 
>   char buf[20];
>   size_t len = sizeof(buf);
>   if (read_in_full(fd, buf, len) < len)
>   die(...);
> 
> since that will promote the -1 to a size_t. So it's probably worth
> auditing.

I looked it over and found one potentially buggy case. In
read-pack_header(), we do:

  if (read_in_full(fd, header, sizeof(*header)) < sizeof(*header))
  /* "eof before pack header was fully read" */
  return PH_ERROR_EOF;

which will treat a read error as a silent success. I don't think it's
harmful in practice because the next line checks that the header
matches the PACK signature (which yes, means we're reading uninitialized
bytes, but the chances are 1 in 2^32 that they match the signature.
Unless the buffer had an old PACK signature in it, of course ;) ).

There's one other harmless "< len" check. There are a bunch of "!="
checks which are OK as-is. Converting them to something equally short is
hard, though we could introduce a helper similar to your write_fully(),
like:

  int read_exactly(int fd, char *buf, size_t len)
  {
ssize_t ret = read_in_full(fd, buf, len);
if (ret < 0)
return -1; /* real error */
else if (ret < len)
return -1; /* early eof */
return 0;
  }

But the trouble is that a _good_ caller actually handles those cases
separately to provide better error messages (by doing that same
if-cascade themselves). If those if's were turned into error() or die()
calls, then we'd actually be improving the callsites. But of course not
all of them actually print an error or die. And when they do, they
usually say something specific about _what_ they were trying to read.

So it may not be worth trying to do a mass-cleanup in the same way here.

-Peff


Re: [PATCH v2] config: avoid "write_in_full(fd, buf, len) < len" pattern

2017-09-13 Thread Jonathan Nieder
Jeff King wrote:

> What I missed is that copy_begin and copy_end here are actually size_t
> variables, not the pointers. Sorry for the confusion, and here's an
> updated version of the patch with this paragraph amended (the patch
> itself is identical):

Subtle.  The world makes more sense now.  Thanks for figuring it out.

> -- >8 --
> Subject: [PATCH] config: avoid "write_in_full(fd, buf, len) < len" pattern
> 
> The return type of write_in_full() is a signed ssize_t,
> because we may return "-1" on failure (even if we succeeded
> in writing some bytes). But "len" itself is may be an
> unsigned type (the function takes a size_t, but of course we
> may have something else in the calling function). So while
> it seems like:
>
>   if (write_in_full(fd, buf, len) < len)
>   die_errno("write error");
>
> would trigger on error, it won't if "len" is unsigned.  The
> compiler sees a signed/unsigned comparison and promotes the
> signed value, resulting in (size_t)-1, the highest possible
> size_t (or again, whatever type the caller has). This cannot
> possibly be smaller than "len", and so the conditional can
> never trigger.
>
> I scoured the code base for cases of this, but it turns out
> that these two in git_config_set_multivar_in_file_gently()
> are the only ones. Here our "len" is the difference between
> two size_t variables, making the result an unsigned size_t.
> We can fix this by just checking for a negative return value
> directly, as write_in_full() will never return any value
> except -1 or the full count.
[...]
> Reported-by: demerphq 
> Signed-off-by: Jeff King 
> ---

For what it's worth,
Reviewed-by: Jonathan Nieder 

Thank you.

Compilers' signed/unsigned comparison warning can be noisy, but I'm
starting to feel it's worth the suppression noise to turn it on when
DEVELOPER=1 anyway.  What do you think?  Is there a way to turn it on
selectively for certain functions on the LHS (like read() and write()
style functions)?

Sincerely,
Jonathan


[PATCH v2] config: avoid "write_in_full(fd, buf, len) < len" pattern

2017-09-13 Thread Jeff King
On Wed, Sep 13, 2017 at 01:11:04PM -0400, Jeff King wrote:

> I scoured the code base for cases of this, but it turns out
> that these two in git_config_set_multivar_in_file_gently()
> are the only ones. This case is actually quite interesting:
> we don't have a size_t, but rather use the subtraction of
> two pointers. Which you might think would be a signed
> ptrdiff_t, but clearly both gcc and clang treat it as
> unsigned (possibly because the conditional just above
> guarantees that the result is greater than zero).

Sorry, this is totally wrong. As Jonathan pointed out (and as I was
puzzled over here), the result of a pointer subtraction _is_ a signed
type, and works fine.

What I missed is that copy_begin and copy_end here are actually size_t
variables, not the pointers. Sorry for the confusion, and here's an
updated version of the patch with this paragraph amended (the patch
itself is identical):

-- >8 --
Subject: [PATCH] config: avoid "write_in_full(fd, buf, len) < len" pattern

The return type of write_in_full() is a signed ssize_t,
because we may return "-1" on failure (even if we succeeded
in writing some bytes). But "len" itself is may be an
unsigned type (the function takes a size_t, but of course we
may have something else in the calling function). So while
it seems like:

  if (write_in_full(fd, buf, len) < len)
die_errno("write error");

would trigger on error, it won't if "len" is unsigned.  The
compiler sees a signed/unsigned comparison and promotes the
signed value, resulting in (size_t)-1, the highest possible
size_t (or again, whatever type the caller has). This cannot
possibly be smaller than "len", and so the conditional can
never trigger.

I scoured the code base for cases of this, but it turns out
that these two in git_config_set_multivar_in_file_gently()
are the only ones. Here our "len" is the difference between
two size_t variables, making the result an unsigned size_t.
We can fix this by just checking for a negative return value
directly, as write_in_full() will never return any value
except -1 or the full count.

There's no addition to the test suite here, since you need
to convince write() to fail in order to see the problem. The
simplest reproduction recipe I came up with is to trigger
ENOSPC:

  # make a limited-size filesystem
  dd if=/dev/zero of=small.disk bs=1M count=1
  mke2fs small.disk
  mkdir mnt
  sudo mount -o loop small.disk mnt
  cd mnt
  sudo chown $USER:$USER .

  # make a config file with some content
  git config --file=config one.key value
  git config --file=config two.key value

  # now fill up the disk
  dd if=/dev/zero of=fill

  # and try to delete a key, which requires copying the rest
  # of the file to config.lock, and will fail on write()
  git config --file=config --unset two.key

That final command should (and does after this patch)
produce an error message due to the failed write, and leave
the file intact. Instead, it silently ignores the failure
and renames config.lock into place, leaving you with a
totally empty config file!

Reported-by: demerphq 
Signed-off-by: Jeff King 
---
 config.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/config.c b/config.c
index d0d8ce823a..eee4ac0355 100644
--- a/config.c
+++ b/config.c
@@ -2608,8 +2608,7 @@ int git_config_set_multivar_in_file_gently(const char 
*config_filename,
/* write the first part of the config */
if (copy_end > copy_begin) {
if (write_in_full(fd, contents + copy_begin,
- copy_end - copy_begin) <
-   copy_end - copy_begin)
+ copy_end - copy_begin) < 0)
goto write_err_out;
if (new_line &&
write_str_in_full(fd, "\n") != 1)
@@ -2631,8 +2630,7 @@ int git_config_set_multivar_in_file_gently(const char 
*config_filename,
/* write the rest of the config */
if (copy_begin < contents_sz)
if (write_in_full(fd, contents + copy_begin,
- contents_sz - copy_begin) <
-   contents_sz - copy_begin)
+ contents_sz - copy_begin) < 0)
goto write_err_out;
 
munmap(contents, contents_sz);
-- 
2.14.1.874.ge7b2e05270



Re: [PATCH 1/7] config: avoid "write_in_full(fd, buf, len) < len" pattern

2017-09-13 Thread Jeff King
On Wed, Sep 13, 2017 at 10:59:30AM -0700, Jonathan Nieder wrote:

> I can confidentally say the intent in C99 in that passage is to
> describe the type of the expression, not just the type of a variable
> that can hold it.

Doh. The problem is that I'm a moron. The copy_begin and copy_end values
are _not_ pointers, they're size_t. That's why we have to use:

  contents + copy_begin

as the buffer. So there is no ptrdiff_t involved at all, just a size_t.

-Peff


Re: [PATCH 2/7] get-tar-commit-id: check write_in_full() return against 0

2017-09-13 Thread Jeff King
On Wed, Sep 13, 2017 at 10:53:57AM -0700, Jonathan Nieder wrote:

> Jeff King wrote:
> 
> > We ask to write 41 bytes and make sure that the return value
> > is at least 41. This is the same "dangerous" pattern that
> > was fixed in the prior commit (wherein a negative return
> > value is promoted to unsigned), though it is not dangerous
> > here because our "41" is a constant, not an unsigned
> > variable.
> >
> > But we should convert it anyway to avoid modeling a
> > dangerous construct.
> 
> If the above logic is correct, then I suspect this series does not go
> far enough.  write_in_full() would be one of those APIs that is easy
> to misuse and difficult to use correctly, and if so we should fix that
> at the source instead of trying to teach callers not to hold it wrong.

Yes, this series is just removing bad examples. It doesn't do anything
to make write_in_full() less potentially dangerous.

On the other hand, it's no more or less dangerous than write(), which
has the same return-value semantics.

> E.g. what would you think of
> 
>  1. Introduce a write_fully (sorry, I am bad at names) function
> that returns 0 on success and a coccinelle semantic patch in
> contrib/coccinelle to migrate callers in "make coccicheck":

Yes, I considered that, though some callers really do care about
assigning the number of bytes written. The fact that write() has the
same problem, plus the fact that there were only 2 buggy instances
across the code base made me think there's not a huge gain to that extra
step.

> @@
> expression E;
> expression F;
> expression G;
> @@
> -write_in_full(E, F, G) < G
> +write_fully(E, F, G)
> 
>  2. Run "make coccicheck" and apply the result.
>  3. Remove the write_in_full function.

There's a step between those where you have to update all of the
write_in_full() callers that store the result. Some of them would be
trivial conversions, but some of them actually care about the length
E.g., the one in imap-send.c, which is the only one I didn't convert
away from "!= len" because it's half of an #ifdef with SSL_write()
(which uses an "int" as the return value!).

> Does read_in_full need a similar treatment?

It might actually return fewer than the requested number of bytes, so it
can't just use "< 0" in the same way (nor be adapted to return 0 on
success).  But I think it's still a bug to do:

  char buf[20];
  size_t len = sizeof(buf);
  if (read_in_full(fd, buf, len) < len)
  die(...);

since that will promote the -1 to a size_t. So it's probably worth
auditing.

-Peff


Re: [PATCH 1/7] config: avoid "write_in_full(fd, buf, len) < len" pattern

2017-09-13 Thread Jonathan Nieder
Jeff King wrote:
> On Wed, Sep 13, 2017 at 10:47:28AM -0700, Jonathan Nieder wrote:
>> Jeff King wrote:

>>> I scoured the code base for cases of this, but it turns out
>>> that these two in git_config_set_multivar_in_file_gently()
>>> are the only ones. This case is actually quite interesting:
>>> we don't have a size_t, but rather use the subtraction of
>>> two pointers. Which you might think would be a signed
>>> ptrdiff_t, but clearly both gcc and clang treat it as
>>> unsigned (possibly because the conditional just above
>>> guarantees that the result is greater than zero).
>>
>> Do you have more detail about this?  I get worried when I read
>> something like this that sounds like a compiler bug.
>>
>> C99 sayeth:
>>
>>  When two pointers are subtracted, both shall point to elements
>>  of the same array object, or one past the last element of the
>>  array object; the result is the difference of the subscripts
>>  of the two array elements. The size of the result is
>>  implementation-defined, and its type (a signed integer type)
>>  is ptrdiff_t defined in the  header.
>
> I'm not sure if it's a compiler bug or not. I read the bits about
> ptrdiff_t, and it wasn't entirely clear to me if a pointer difference
> _is_ an actual ptrdiff_t, or if it can generally be stored in one. Right
> below that text it also says:
>
>   If the result is not representable in an object of that type, the
>   behavior is undefined.

I can confidentally say the intent in C99 in that passage is to
describe the type of the expression, not just the type of a variable
that can hold it.

> That said, I might be wrong that unsigned promotion is the culprit. I
> didn't look at the generated assembly. But I also can't see what else
> would be causing the problem here. We're clearly returning "-1" and the
> condition doesn't trigger.
>
>> How can I reproduce the problem?
>
> I gave a recipe in the commit message, which is the best I came up with.
> You could probably use a fault-injection library to convince write() to
> fail. Or just tweak the source code to have write_in_full() return -1.

I wonder if a new test helper in t/helper/ would be able to do it (since
then it could e.g. control the filename that write_in_full writes to).

>>> There's no addition to the test suite here, since you need
>>> to convince write() to fail in order to see the problem. The
>>> simplest reproduction recipe I came up with is to trigger
>>> ENOSPC (this only works on Linux, obviously):
>>
>> Does /dev/full make it simpler to reproduce?
>
> I don't think so, because the write() failure is to the lockfile, which
> is created with O_EXCL. So even if you could convince "config.lock" to
> be the right device type, the open() would fail.

Hm, you're convincing me that it would indeed be worth hooking into a
fault injection framework (that e.g. uses LD_PRELOAD), but that's a
topic for another day.

Thanks,
Jonathan


Re: [PATCH 2/7] get-tar-commit-id: check write_in_full() return against 0

2017-09-13 Thread Jonathan Nieder
Jeff King wrote:

> We ask to write 41 bytes and make sure that the return value
> is at least 41. This is the same "dangerous" pattern that
> was fixed in the prior commit (wherein a negative return
> value is promoted to unsigned), though it is not dangerous
> here because our "41" is a constant, not an unsigned
> variable.
>
> But we should convert it anyway to avoid modeling a
> dangerous construct.

If the above logic is correct, then I suspect this series does not go
far enough.  write_in_full() would be one of those APIs that is easy
to misuse and difficult to use correctly, and if so we should fix that
at the source instead of trying to teach callers not to hold it wrong.

E.g. what would you think of

 1. Introduce a write_fully (sorry, I am bad at names) function
that returns 0 on success and a coccinelle semantic patch in
contrib/coccinelle to migrate callers in "make coccicheck":

@@
expression E;
expression F;
expression G;
@@
-write_in_full(E, F, G) < G
+write_fully(E, F, G)

 2. Run "make coccicheck" and apply the result.
 3. Remove the write_in_full function.

Does read_in_full need a similar treatment?

Thanks,
Jonathan


Re: [PATCH 1/7] config: avoid "write_in_full(fd, buf, len) < len" pattern

2017-09-13 Thread Jeff King
On Wed, Sep 13, 2017 at 10:47:28AM -0700, Jonathan Nieder wrote:

> Jeff King wrote:
> 
> > I scoured the code base for cases of this, but it turns out
> > that these two in git_config_set_multivar_in_file_gently()
> > are the only ones. This case is actually quite interesting:
> > we don't have a size_t, but rather use the subtraction of
> > two pointers. Which you might think would be a signed
> > ptrdiff_t, but clearly both gcc and clang treat it as
> > unsigned (possibly because the conditional just above
> > guarantees that the result is greater than zero).
> 
> Do you have more detail about this?  I get worried when I read
> something like this that sounds like a compiler bug.
> 
> C99 sayeth:
> 
>   When two pointers are subtracted, both shall point to elements
>   of the same array object, or one past the last element of the
>   array object; the result is the difference of the subscripts
>   of the two array elements. The size of the result is
>   implementation-defined, and its type (a signed integer type)
>   is ptrdiff_t defined in the  header.

I'm not sure if it's a compiler bug or not. I read the bits about
ptrdiff_t, and it wasn't entirely clear to me if a pointer difference
_is_ an actual ptrdiff_t, or if it can generally be stored in one. Right
below that text it also says:

  If the result is not representable in an object of that type, the
  behavior is undefined.

That said, I might be wrong that unsigned promotion is the culprit. I
didn't look at the generated assembly. But I also can't see what else
would be causing the problem here. We're clearly returning "-1" and the
condition doesn't trigger.

> How can I reproduce the problem?

I gave a recipe in the commit message, which is the best I came up with.
You could probably use a fault-injection library to convince write() to
fail. Or just tweak the source code to have write_in_full() return -1.

> > There's no addition to the test suite here, since you need
> > to convince write() to fail in order to see the problem. The
> > simplest reproduction recipe I came up with is to trigger
> > ENOSPC (this only works on Linux, obviously):
> 
> Does /dev/full make it simpler to reproduce?

I don't think so, because the write() failure is to the lockfile, which
is created with O_EXCL. So even if you could convince "config.lock" to
be the right device type, the open() would fail.

-Peff


Re: [PATCH 1/7] config: avoid "write_in_full(fd, buf, len) < len" pattern

2017-09-13 Thread Jonathan Nieder
Hi,

Jeff King wrote:

> I scoured the code base for cases of this, but it turns out
> that these two in git_config_set_multivar_in_file_gently()
> are the only ones. This case is actually quite interesting:
> we don't have a size_t, but rather use the subtraction of
> two pointers. Which you might think would be a signed
> ptrdiff_t, but clearly both gcc and clang treat it as
> unsigned (possibly because the conditional just above
> guarantees that the result is greater than zero).

Do you have more detail about this?  I get worried when I read
something like this that sounds like a compiler bug.

C99 sayeth:

When two pointers are subtracted, both shall point to elements
of the same array object, or one past the last element of the
array object; the result is the difference of the subscripts
of the two array elements. The size of the result is
implementation-defined, and its type (a signed integer type)
is ptrdiff_t defined in the  header.

How can I reproduce the problem?

> We can avoid the whole question by just checking for a
> negative return value directly, as write_in_full() will
> never return any value except -1 or the full count.
>
> There's no addition to the test suite here, since you need
> to convince write() to fail in order to see the problem. The
> simplest reproduction recipe I came up with is to trigger
> ENOSPC (this only works on Linux, obviously):

Does /dev/full make it simpler to reproduce?

Thanks,
Jonathan


[PATCH 7/7] config: flip return value of store_write_*()

2017-09-13 Thread Jeff King
The store_write_section() and store_write_pairs() functions
are basically high-level wrappers around write(). But their
return values are flipped from our usual convention, using
"1" for success and "0" for failure.

Let's flip them to follow the usual write() conventions and
update all callers. As these are local to config.c, it's
unlikely that we'd have new callers in any topics in flight
(which would be silently broken by our change). But just to
be on the safe side, let's rename them to just
write_section() and write_pairs().  That also accentuates
their relationship with write().

Signed-off-by: Jeff King 
---
 config.c | 28 +++-
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/config.c b/config.c
index daf093db45..f8b7b81445 100644
--- a/config.c
+++ b/config.c
@@ -2297,10 +2297,11 @@ static int write_error(const char *filename)
return 4;
 }
 
-static int store_write_section(int fd, const char *key)
+static ssize_t write_section(int fd, const char *key)
 {
const char *dot;
-   int i, success;
+   int i;
+   ssize_t ret;
struct strbuf sb = STRBUF_INIT;
 
dot = memchr(key, '.', store.baselen);
@@ -2316,15 +2317,16 @@ static int store_write_section(int fd, const char *key)
strbuf_addf(, "[%.*s]\n", store.baselen, key);
}
 
-   success = write_in_full(fd, sb.buf, sb.len) == sb.len;
+   ret = write_in_full(fd, sb.buf, sb.len);
strbuf_release();
 
-   return success;
+   return ret;
 }
 
-static int store_write_pair(int fd, const char *key, const char *value)
+static ssize_t write_pair(int fd, const char *key, const char *value)
 {
-   int i, success;
+   int i;
+   ssize_t ret;
int length = strlen(key + store.baselen + 1);
const char *quote = "";
struct strbuf sb = STRBUF_INIT;
@@ -2364,10 +2366,10 @@ static int store_write_pair(int fd, const char *key, 
const char *value)
}
strbuf_addf(, "%s\n", quote);
 
-   success = write_in_full(fd, sb.buf, sb.len) == sb.len;
+   ret = write_in_full(fd, sb.buf, sb.len);
strbuf_release();
 
-   return success;
+   return ret;
 }
 
 static ssize_t find_beginning_of_line(const char *contents, size_t size,
@@ -2497,8 +2499,8 @@ int git_config_set_multivar_in_file_gently(const char 
*config_filename,
}
 
store.key = (char *)key;
-   if (!store_write_section(fd, key) ||
-   !store_write_pair(fd, key, value))
+   if (write_section(fd, key) < 0 ||
+   write_pair(fd, key, value) < 0)
goto write_err_out;
} else {
struct stat st;
@@ -2620,10 +2622,10 @@ int git_config_set_multivar_in_file_gently(const char 
*config_filename,
/* write the pair (value == NULL means unset) */
if (value != NULL) {
if (store.state == START) {
-   if (!store_write_section(fd, key))
+   if (write_section(fd, key) < 0)
goto write_err_out;
}
-   if (!store_write_pair(fd, key, value))
+   if (write_pair(fd, key, value) < 0)
goto write_err_out;
}
 
@@ -2816,7 +2818,7 @@ int git_config_rename_section_in_file(const char 
*config_filename,
continue;
}
store.baselen = strlen(new_name);
-   if (!store_write_section(out_fd, new_name)) {
+   if (write_section(out_fd, new_name) < 0) {
ret = 
write_error(get_lock_file_path(lock));
goto out;
}
-- 
2.14.1.874.ge7b2e05270


[PATCH 5/7] pkt-line: check write_in_full() errors against "< 0"

2017-09-13 Thread Jeff King
As with the previous two commits, we prefer to check
write_in_full()'s return value to see if it is negative,
rather than comparing it to the input length.

These cases actually flip the logic to check for success,
making conversion a little different than in other cases. We
could of course write:

  if (write_in_full(...) >= 0)
  return 0;
  return error(...);

But our usual method of spelling write() error checks is
just "< 0". So let's flip the logic for each of these
conditionals to our usual style.

Signed-off-by: Jeff King 
---
This will have a minor textual conflict with ma/pkt-line-leakfix which
hasn't quite made it into master yet. The resolution should be
straight-forward, though.

 pkt-line.c | 29 ++---
 1 file changed, 14 insertions(+), 15 deletions(-)

diff --git a/pkt-line.c b/pkt-line.c
index 7db9119573..4823d3bb9d 100644
--- a/pkt-line.c
+++ b/pkt-line.c
@@ -94,9 +94,9 @@ void packet_flush(int fd)
 int packet_flush_gently(int fd)
 {
packet_trace("", 4, 1);
-   if (write_in_full(fd, "", 4) == 4)
-   return 0;
-   return error("flush packet write failed");
+   if (write_in_full(fd, "", 4) < 0)
+   return error("flush packet write failed");
+   return 0;
 }
 
 void packet_buf_flush(struct strbuf *buf)
@@ -137,18 +137,17 @@ static int packet_write_fmt_1(int fd, int gently,
  const char *fmt, va_list args)
 {
struct strbuf buf = STRBUF_INIT;
-   ssize_t count;
 
format_packet(, fmt, args);
-   count = write_in_full(fd, buf.buf, buf.len);
-   if (count == buf.len)
-   return 0;
-
-   if (!gently) {
-   check_pipe(errno);
-   die_errno("packet write with format failed");
+   if (write_in_full(fd, buf.buf, buf.len) < 0) {
+   if (!gently) {
+   check_pipe(errno);
+   die_errno("packet write with format failed");
+   }
+   return error("packet write with format failed");
}
-   return error("packet write with format failed");
+
+   return 0;
 }
 
 void packet_write_fmt(int fd, const char *fmt, ...)
@@ -183,9 +182,9 @@ static int packet_write_gently(const int fd_out, const char 
*buf, size_t size)
packet_size = size + 4;
set_packet_header(packet_write_buffer, packet_size);
memcpy(packet_write_buffer + 4, buf, size);
-   if (write_in_full(fd_out, packet_write_buffer, packet_size) == 
packet_size)
-   return 0;
-   return error("packet write failed");
+   if (write_in_full(fd_out, packet_write_buffer, packet_size) < 0)
+   return error("packet write failed");
+   return 0;
 }
 
 void packet_buf_write(struct strbuf *buf, const char *fmt, ...)
-- 
2.14.1.874.ge7b2e05270



[PATCH 6/7] notes-merge: use ssize_t for write_in_full() return value

2017-09-13 Thread Jeff King
We store the return value of write_in_full() in a long,
though the return is actually an ssize_t. This probably
doesn't matter much in practice (since the buffer size is
alredy an unsigned long), but it might if the size if
between what can be represented in "long" and "unsigned
long", and if your size_t is larger than a "long" (as it is
on 64-bit Windows).

Signed-off-by: Jeff King 
---
 notes-merge.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/notes-merge.c b/notes-merge.c
index b04d2f2131..597d43f65c 100644
--- a/notes-merge.c
+++ b/notes-merge.c
@@ -302,7 +302,7 @@ static void write_buf_to_worktree(const struct object_id 
*obj,
fd = xopen(path, O_WRONLY | O_EXCL | O_CREAT, 0666);
 
while (size > 0) {
-   long ret = write_in_full(fd, buf, size);
+   ssize_t ret = write_in_full(fd, buf, size);
if (ret < 0) {
/* Ignore epipe */
if (errno == EPIPE)
-- 
2.14.1.874.ge7b2e05270



[PATCH 16/20] ref_store: implement `refs_peel_ref()` generically

2017-09-13 Thread Michael Haggerty
We're about to stop storing packed refs in a `ref_cache`. That means
that the only way we have left to optimize `peel_ref()` is by checking
whether the reference being peeled is the one currently being iterated
over (in `current_ref_iter`), and if so, using `ref_iterator_peel()`.
But this can be done generically; it doesn't have to be implemented
per-backend.

So implement `refs_peel_ref()` in `refs.c` and remove the `peel_ref()`
method from the refs API.

This removes the last callers of a couple of functions, so delete
them. More cleanup to come...

Signed-off-by: Michael Haggerty 
---
 refs.c| 18 +-
 refs/files-backend.c  | 38 --
 refs/packed-backend.c | 36 
 refs/refs-internal.h  |  3 ---
 4 files changed, 17 insertions(+), 78 deletions(-)

diff --git a/refs.c b/refs.c
index 101c107ee8..c5e6f79c77 100644
--- a/refs.c
+++ b/refs.c
@@ -1735,7 +1735,23 @@ int refs_pack_refs(struct ref_store *refs, unsigned int 
flags)
 int refs_peel_ref(struct ref_store *refs, const char *refname,
  unsigned char *sha1)
 {
-   return refs->be->peel_ref(refs, refname, sha1);
+   int flag;
+   unsigned char base[20];
+
+   if (current_ref_iter && current_ref_iter->refname == refname) {
+   struct object_id peeled;
+
+   if (ref_iterator_peel(current_ref_iter, ))
+   return -1;
+   hashcpy(sha1, peeled.hash);
+   return 0;
+   }
+
+   if (refs_read_ref_full(refs, refname,
+  RESOLVE_REF_READING, base, ))
+   return -1;
+
+   return peel_object(base, sha1);
 }
 
 int peel_ref(const char *refname, unsigned char *sha1)
diff --git a/refs/files-backend.c b/refs/files-backend.c
index 35648c89fc..7d12de88d0 100644
--- a/refs/files-backend.c
+++ b/refs/files-backend.c
@@ -655,43 +655,6 @@ static int lock_raw_ref(struct files_ref_store *refs,
return ret;
 }
 
-static int files_peel_ref(struct ref_store *ref_store,
- const char *refname, unsigned char *sha1)
-{
-   struct files_ref_store *refs =
-   files_downcast(ref_store, REF_STORE_READ | REF_STORE_ODB,
-  "peel_ref");
-   int flag;
-   unsigned char base[20];
-
-   if (current_ref_iter && current_ref_iter->refname == refname) {
-   struct object_id peeled;
-
-   if (ref_iterator_peel(current_ref_iter, ))
-   return -1;
-   hashcpy(sha1, peeled.hash);
-   return 0;
-   }
-
-   if (refs_read_ref_full(ref_store, refname,
-  RESOLVE_REF_READING, base, ))
-   return -1;
-
-   /*
-* If the reference is packed, read its ref_entry from the
-* cache in the hope that we already know its peeled value.
-* We only try this optimization on packed references because
-* (a) forcing the filling of the loose reference cache could
-* be expensive and (b) loose references anyway usually do not
-* have REF_KNOWS_PEELED.
-*/
-   if (flag & REF_ISPACKED &&
-   !refs_peel_ref(refs->packed_ref_store, refname, sha1))
-   return 0;
-
-   return peel_object(base, sha1);
-}
-
 struct files_ref_iterator {
struct ref_iterator base;
 
@@ -3012,7 +2975,6 @@ struct ref_storage_be refs_be_files = {
files_initial_transaction_commit,
 
files_pack_refs,
-   files_peel_ref,
files_create_symref,
files_delete_refs,
files_rename_ref,
diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index 9b10532eb3..88242a49fe 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -780,26 +780,6 @@ static struct packed_ref_cache 
*get_packed_ref_cache(struct packed_ref_store *re
return refs->cache;
 }
 
-static struct ref_dir *get_packed_ref_dir(struct packed_ref_cache 
*packed_ref_cache)
-{
-   return get_ref_dir(packed_ref_cache->cache->root);
-}
-
-static struct ref_dir *get_packed_refs(struct packed_ref_store *refs)
-{
-   return get_packed_ref_dir(get_packed_ref_cache(refs));
-}
-
-/*
- * Return the ref_entry for the given refname from the packed
- * references.  If it does not exist, return NULL.
- */
-static struct ref_entry *get_packed_ref(struct packed_ref_store *refs,
-   const char *refname)
-{
-   return find_ref_entry(get_packed_refs(refs), refname);
-}
-
 static int packed_read_raw_ref(struct ref_store *ref_store,
   const char *refname, unsigned char *sha1,
   struct strbuf *referent, unsigned int *type)
@@ -826,21 +806,6 @@ static int packed_read_raw_ref(struct ref_store *ref_store,
return 0;
 }
 
-static int packed_peel_ref(struct ref_store *ref_store,
-

[PATCH 13/20] read_packed_refs(): ensure that references are ordered when read

2017-09-13 Thread Michael Haggerty
It doesn't actually matter now, because the references are only
iterated over to fill the associated `ref_cache`, which itself puts
them in the correct order. But we want to get rid of the `ref_cache`,
so we want to be able to iterate directly over the `packed-refs`
buffer, and then the iteration will need to be ordered correctly.

In fact, we already write the `packed-refs` file sorted, but it is
possible that other Git clients don't get it right. So let's not
assume that a `packed-refs` file is sorted unless it is explicitly
declared to be so via a `sorted` trait in its header line.

If it is *not* declared to be sorted, then scan quickly through the
file to check. If it is found to be out of order, then sort the
records into a new memory-only copy rather than using the mmapped file
contents. This checking and sorting is done quickly, without parsing
the full file contents. However, it needs a little bit of care to
avoid reading past the end of the buffer even if the `packed-refs`
file is corrupt.

If `packed-refs` is sorted (i.e., its file header contains the
`sorted` trait or our check shows that it is), then use the mmapped
file contents directly.

Since we write the file correctly sorted, include that trait when we
write or rewrite a `packed-refs` file.

Signed-off-by: Michael Haggerty 
---
 refs/packed-backend.c | 222 ++
 1 file changed, 206 insertions(+), 16 deletions(-)

diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index d209c8e5a0..2a36dd9afb 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -25,9 +25,12 @@ struct packed_ref_cache {
int fd;
 
/*
-* The range of memory to which the `packed-refs` file is
-* mmapped. If there were no contents (e.g., because the file
-* didn't exist), `buf` and `eof` are both NULL.
+* The contents of the `packed-refs` file. If the file was
+* already sorted, this points at the mmapped contents of the
+* file. If not, this points at heap-allocated memory
+* containing the contents, sorted. If there were no contents
+* (e.g., because the file didn't exist), `buf` and `eof` are
+* both NULL.
 */
char *buf, *eof;
 
@@ -93,22 +96,24 @@ static void acquire_packed_ref_cache(struct 
packed_ref_cache *packed_refs)
 }
 
 /*
- * If the buffer in `packed_refs` is active, munmap the memory, close the
- * file, and set the buffer pointers to NULL.
+ * If the buffer in `packed_refs` is active, then either munmap the
+ * memory and close the file, or free the memory. Then set the buffer
+ * pointers to NULL.
  */
 static void release_packed_ref_buffer(struct packed_ref_cache *packed_refs)
 {
-   if (packed_refs->buf) {
+   if (packed_refs->fd >= 0) {
if (munmap(packed_refs->buf,
   packed_refs->eof - packed_refs->buf))
die_errno("error ummapping packed-refs file %s",
  packed_refs->refs->path);
-   packed_refs->buf = packed_refs->eof = NULL;
-   packed_refs->header_len = 0;
-   }
-
-   if (packed_refs->fd >= 0)
close(packed_refs->fd);
+   packed_refs->fd = -1;
+   } else {
+   free(packed_refs->buf);
+   }
+   packed_refs->buf = packed_refs->eof = NULL;
+   packed_refs->header_len = 0;
 }
 
 /*
@@ -329,7 +334,7 @@ struct ref_iterator *mmapped_ref_iterator_begin(
if (!packed_refs->buf)
return empty_ref_iterator_begin();
 
-   base_ref_iterator_init(ref_iterator, _ref_iterator_vtable, 0);
+   base_ref_iterator_init(ref_iterator, _ref_iterator_vtable, 1);
 
iter->packed_refs = packed_refs;
acquire_packed_ref_cache(iter->packed_refs);
@@ -342,6 +347,170 @@ struct ref_iterator *mmapped_ref_iterator_begin(
return ref_iterator;
 }
 
+struct packed_ref_entry {
+   const char *start;
+   size_t len;
+};
+
+static int cmp_packed_ref_entries(const void *v1, const void *v2)
+{
+   const struct packed_ref_entry *e1 = v1, *e2 = v2;
+   const char *r1 = e1->start + GIT_SHA1_HEXSZ + 1;
+   const char *r2 = e2->start + GIT_SHA1_HEXSZ + 1;
+
+   while (1) {
+   if (*r1 == '\n')
+   return *r2 == '\n' ? 0 : -1;
+   if (*r1 != *r2) {
+   if (*r2 == '\n')
+   return 1;
+   else
+   return (unsigned char)*r1 < (unsigned char)*r2 
? -1 : +1;
+   }
+   r1++;
+   r2++;
+   }
+}
+
+/*
+ * `packed_refs->buf` is not known to be sorted. Check whether it is,
+ * and if not, sort it into new memory and munmap/free the old
+ * storage.
+ */
+static void sort_packed_refs(struct packed_ref_cache *packed_refs)
+{
+   struct packed_ref_entry *entries = NULL;
+   

[PATCH 08/20] read_packed_refs(): read references with minimal copying

2017-09-13 Thread Michael Haggerty
Instead of copying data from the `packed-refs` file one line at time
and then processing it, process the data in place as much as possible.

Also, instead of processing one line per iteration of the main loop,
process a reference line plus its corresponding peeled line (if
present) together.

Note that this change slightly tightens up the parsing of the
`parse-ref` file. Previously, the parser would have accepted multiple
"peeled" lines for a single reference (ignoring all but the last one).
Now it would reject that.

Signed-off-by: Michael Haggerty 
---
 refs/packed-backend.c | 101 --
 1 file changed, 40 insertions(+), 61 deletions(-)

diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index a45e3ff92f..2b80f244c8 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -134,33 +134,6 @@ static void clear_packed_ref_cache(struct packed_ref_store 
*refs)
}
 }
 
-/* The length of a peeled reference line in packed-refs, including EOL: */
-#define PEELED_LINE_LENGTH 42
-
-/*
- * Parse one line from a packed-refs file.  Write the SHA1 to sha1.
- * Return a pointer to the refname within the line (null-terminated),
- * or NULL if there was a problem.
- */
-static const char *parse_ref_line(struct strbuf *line, struct object_id *oid)
-{
-   const char *ref;
-
-   if (parse_oid_hex(line->buf, oid, ) < 0)
-   return NULL;
-   if (!isspace(*ref++))
-   return NULL;
-
-   if (isspace(*ref))
-   return NULL;
-
-   if (line->buf[line->len - 1] != '\n')
-   return NULL;
-   line->buf[--line->len] = 0;
-
-   return ref;
-}
-
 static NORETURN void die_unterminated_line(const char *path,
   const char *p, size_t len)
 {
@@ -221,8 +194,7 @@ static struct packed_ref_cache *read_packed_refs(struct 
packed_ref_store *refs)
size_t size;
char *buf;
const char *pos, *eol, *eof;
-   struct ref_entry *last = NULL;
-   struct strbuf line = STRBUF_INIT;
+   struct strbuf tmp = STRBUF_INIT;
enum { PEELED_NONE, PEELED_TAGS, PEELED_FULLY } peeled = PEELED_NONE;
struct ref_dir *dir;
 
@@ -264,9 +236,9 @@ static struct packed_ref_cache *read_packed_refs(struct 
packed_ref_store *refs)
if (!eol)
die_unterminated_line(refs->path, pos, eof - pos);
 
-   strbuf_add(, pos, eol - pos);
+   strbuf_add(, pos, eol - pos);
 
-   if (!skip_prefix(line.buf, "# pack-refs with:", (const char 
**)))
+   if (!skip_prefix(tmp.buf, "# pack-refs with:", (const char 
**)))
die_invalid_line(refs->path, pos, eof - pos);
 
string_list_split_in_place(, p, ' ', -1);
@@ -281,61 +253,68 @@ static struct packed_ref_cache *read_packed_refs(struct 
packed_ref_store *refs)
pos = eol + 1;
 
string_list_clear(, 0);
-   strbuf_reset();
+   strbuf_reset();
}
 
dir = get_ref_dir(packed_refs->cache->root);
while (pos < eof) {
+   const char *p = pos;
struct object_id oid;
const char *refname;
+   int flag = REF_ISPACKED;
+   struct ref_entry *entry = NULL;
 
-   eol = memchr(pos, '\n', eof - pos);
+   if (eof - pos < GIT_SHA1_HEXSZ + 2 ||
+   parse_oid_hex(p, , ) ||
+   !isspace(*p++))
+   die_invalid_line(refs->path, pos, eof - pos);
+
+   eol = memchr(p, '\n', eof - p);
if (!eol)
die_unterminated_line(refs->path, pos, eof - pos);
 
-   strbuf_add(, pos, eol + 1 - pos);
+   strbuf_add(, p, eol - p);
+   refname = tmp.buf;
+
+   if (check_refname_format(refname, REFNAME_ALLOW_ONELEVEL)) {
+   if (!refname_is_safe(refname))
+   die("packed refname is dangerous: %s", refname);
+   oidclr();
+   flag |= REF_BAD_NAME | REF_ISBROKEN;
+   }
+   if (peeled == PEELED_FULLY ||
+   (peeled == PEELED_TAGS && starts_with(refname, 
"refs/tags/")))
+   flag |= REF_KNOWS_PEELED;
+   entry = create_ref_entry(refname, , flag);
+   add_ref_entry(dir, entry);
 
-   refname = parse_ref_line(, );
-   if (refname) {
-   int flag = REF_ISPACKED;
+   pos = eol + 1;
+
+   if (pos < eof && *pos == '^') {
+   p = pos + 1;
+   if (eof - p < GIT_SHA1_HEXSZ + 1 ||
+   parse_oid_hex(p, >u.value.peeled, ) ||
+   *p++ != '\n')
+   

[PATCH 04/20] die_unterminated_line(), die_invalid_line(): new functions

2017-09-13 Thread Michael Haggerty
Extract some helper functions for reporting errors. While we're at it,
prevent them from spewing unlimited output to the terminal. These
functions will soon have more callers.

These functions accept the problematic line as a `(ptr, len)` pair
rather than a NUL-terminated string, and `die_invalid_line()` checks
for an EOL itself, because these calling conventions will be
convenient for future callers. (Efficiency is not a concern here
because these functions are only ever called if the `packed-refs` file
is corrupt.)

Signed-off-by: Michael Haggerty 
---
 refs/packed-backend.c | 28 +---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index a3d9210cb0..5c50c223ef 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -161,6 +161,29 @@ static const char *parse_ref_line(struct strbuf *line, 
struct object_id *oid)
return ref;
 }
 
+static NORETURN void die_unterminated_line(const char *path,
+  const char *p, size_t len)
+{
+   if (len < 80)
+   die("unterminated line in %s: %.*s", path, (int)len, p);
+   else
+   die("unterminated line in %s: %.75s...", path, p);
+}
+
+static NORETURN void die_invalid_line(const char *path,
+ const char *p, size_t len)
+{
+   const char *eol = memchr(p, '\n', len);
+
+   if (!eol)
+   die_unterminated_line(path, p, len);
+   else if (eol - p < 80)
+   die("unexpected line in %s: %.*s", path, (int)(eol - p), p);
+   else
+   die("unexpected line in %s: %.75s...", path, p);
+
+}
+
 /*
  * Read from the `packed-refs` file into a newly-allocated
  * `packed_ref_cache` and return it. The return value will already
@@ -227,7 +250,7 @@ static struct packed_ref_cache *read_packed_refs(struct 
packed_ref_store *refs)
const char *traits;
 
if (!line.len || line.buf[line.len - 1] != '\n')
-   die("unterminated line in %s: %s", refs->path, 
line.buf);
+   die_unterminated_line(refs->path, line.buf, line.len);
 
if (skip_prefix(line.buf, "# pack-refs with:", )) {
if (strstr(traits, " fully-peeled "))
@@ -266,8 +289,7 @@ static struct packed_ref_cache *read_packed_refs(struct 
packed_ref_store *refs)
 */
last->flag |= REF_KNOWS_PEELED;
} else {
-   strbuf_setlen(, line.len - 1);
-   die("unexpected line in %s: %s", refs->path, line.buf);
+   die_invalid_line(refs->path, line.buf, line.len);
}
}
 
-- 
2.14.1



[PATCH 10/20] mmapped_ref_iterator: add iterator over a packed-refs file

2017-09-13 Thread Michael Haggerty
Add a new `mmapped_ref_iterator`, which can iterate over the
references in an mmapped `packed-refs` file directly. Use this
iterator from `read_packed_refs()` to fill the packed refs cache.

Note that we are not yet willing to promise that the new iterator
generates its output in order. That doesn't matter for now, because
the packed refs cache doesn't care what order it is filled.

This change adds a lot of boilerplate without providing any obvious
benefits. The benefits will come soon, when we get rid of the
`ref_cache` for packed references altogether.

Signed-off-by: Michael Haggerty 
---
 refs/packed-backend.c | 207 --
 1 file changed, 152 insertions(+), 55 deletions(-)

diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index ae276f3445..312116a99d 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -163,6 +163,141 @@ static NORETURN void die_invalid_line(const char *path,
 
 }
 
+/*
+ * An iterator over a packed-refs file that is currently mmapped.
+ */
+struct mmapped_ref_iterator {
+   struct ref_iterator base;
+
+   struct packed_ref_cache *packed_refs;
+
+   /* The current position in the mmapped file: */
+   const char *pos;
+
+   /* The end of the mmapped file: */
+   const char *eof;
+
+   struct object_id oid, peeled;
+
+   struct strbuf refname_buf;
+};
+
+static int mmapped_ref_iterator_advance(struct ref_iterator *ref_iterator)
+{
+   struct mmapped_ref_iterator *iter =
+   (struct mmapped_ref_iterator *)ref_iterator;
+   const char *p = iter->pos, *eol;
+
+   strbuf_reset(>refname_buf);
+
+   if (iter->pos == iter->eof)
+   return ref_iterator_abort(ref_iterator);
+
+   iter->base.flags = REF_ISPACKED;
+
+   if (iter->eof - p < GIT_SHA1_HEXSZ + 2 ||
+   parse_oid_hex(p, >oid, ) ||
+   !isspace(*p++))
+   die_invalid_line(iter->packed_refs->refs->path,
+iter->pos, iter->eof - iter->pos);
+
+   eol = memchr(p, '\n', iter->eof - p);
+   if (!eol)
+   die_unterminated_line(iter->packed_refs->refs->path,
+ iter->pos, iter->eof - iter->pos);
+
+   strbuf_add(>refname_buf, p, eol - p);
+   iter->base.refname = iter->refname_buf.buf;
+
+   if (check_refname_format(iter->base.refname, REFNAME_ALLOW_ONELEVEL)) {
+   if (!refname_is_safe(iter->base.refname))
+   die("packed refname is dangerous: %s",
+   iter->base.refname);
+   oidclr(>oid);
+   iter->base.flags |= REF_BAD_NAME | REF_ISBROKEN;
+   }
+   if (iter->packed_refs->peeled == PEELED_FULLY ||
+   (iter->packed_refs->peeled == PEELED_TAGS &&
+starts_with(iter->base.refname, "refs/tags/")))
+   iter->base.flags |= REF_KNOWS_PEELED;
+
+   iter->pos = eol + 1;
+
+   if (iter->pos < iter->eof && *iter->pos == '^') {
+   p = iter->pos + 1;
+   if (iter->eof - p < GIT_SHA1_HEXSZ + 1 ||
+   parse_oid_hex(p, >peeled, ) ||
+   *p++ != '\n')
+   die_invalid_line(iter->packed_refs->refs->path,
+iter->pos, iter->eof - iter->pos);
+   iter->pos = p;
+
+   /*
+* Regardless of what the file header said, we
+* definitely know the value of *this* reference:
+*/
+   iter->base.flags |= REF_KNOWS_PEELED;
+   } else {
+   oidclr(>peeled);
+   }
+
+   return ITER_OK;
+}
+
+static int mmapped_ref_iterator_peel(struct ref_iterator *ref_iterator,
+   struct object_id *peeled)
+{
+   struct mmapped_ref_iterator *iter =
+   (struct mmapped_ref_iterator *)ref_iterator;
+
+   if ((iter->base.flags & REF_KNOWS_PEELED)) {
+   oidcpy(peeled, >peeled);
+   return is_null_oid(>peeled) ? -1 : 0;
+   } else if ((iter->base.flags & (REF_ISBROKEN | REF_ISSYMREF))) {
+   return -1;
+   } else {
+   return !!peel_object(iter->oid.hash, peeled->hash);
+   }
+}
+
+static int mmapped_ref_iterator_abort(struct ref_iterator *ref_iterator)
+{
+   struct mmapped_ref_iterator *iter =
+   (struct mmapped_ref_iterator *)ref_iterator;
+
+   release_packed_ref_cache(iter->packed_refs);
+   strbuf_release(>refname_buf);
+   base_ref_iterator_free(ref_iterator);
+   return ITER_DONE;
+}
+
+static struct ref_iterator_vtable mmapped_ref_iterator_vtable = {
+   mmapped_ref_iterator_advance,
+   mmapped_ref_iterator_peel,
+   mmapped_ref_iterator_abort
+};
+
+struct ref_iterator *mmapped_ref_iterator_begin(
+   const char *packed_refs_file,
+   struct packed_ref_cache 

[PATCH 15/20] packed_read_raw_ref(): read the reference from the mmapped buffer

2017-09-13 Thread Michael Haggerty
Instead of reading the reference from the `ref_cache`, read it
directly from the mmapped buffer.

Signed-off-by: Michael Haggerty 
---
 refs/packed-backend.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index a018a6c8ad..9b10532eb3 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -806,18 +806,22 @@ static int packed_read_raw_ref(struct ref_store 
*ref_store,
 {
struct packed_ref_store *refs =
packed_downcast(ref_store, REF_STORE_READ, "read_raw_ref");
-
-   struct ref_entry *entry;
+   struct packed_ref_cache *packed_refs = get_packed_ref_cache(refs);
+   const char *rec;
 
*type = 0;
 
-   entry = get_packed_ref(refs, refname);
-   if (!entry) {
+   rec = find_reference_location(packed_refs, refname, 1);
+
+   if (!rec) {
+   /* refname is not a packed reference. */
errno = ENOENT;
return -1;
}
 
-   hashcpy(sha1, entry->u.value.oid.hash);
+   if (get_sha1_hex(rec, sha1))
+   die_invalid_line(refs->path, rec, packed_refs->eof - rec);
+
*type = REF_ISPACKED;
return 0;
 }
-- 
2.14.1



[PATCH 18/20] ref_cache: remove support for storing peeled values

2017-09-13 Thread Michael Haggerty
Now that the `packed-refs` backend doesn't use `ref_cache`, there is
nobody left who might want to store peeled values of references in
`ref_cache`. So remove that feature.

Signed-off-by: Michael Haggerty 
---
 refs/packed-backend.c |  9 -
 refs/ref-cache.c  | 42 +-
 refs/ref-cache.h  | 32 ++--
 3 files changed, 11 insertions(+), 72 deletions(-)

diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index 9868a5c198..d76051c7f5 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -2,7 +2,6 @@
 #include "../config.h"
 #include "../refs.h"
 #include "refs-internal.h"
-#include "ref-cache.h"
 #include "packed-backend.h"
 #include "../iterator.h"
 #include "../lockfile.h"
@@ -201,6 +200,14 @@ static NORETURN void die_invalid_line(const char *path,
 
 }
 
+/*
+ * This value is set in `base.flags` if the peeled value of the
+ * current reference is known. In that case, `peeled` contains the
+ * correct peeled value for the reference, which might be `null_sha1`
+ * if the reference is not a tag or if it is broken.
+ */
+#define REF_KNOWS_PEELED 0x40
+
 /*
  * An iterator over a packed-refs file that is currently mmapped.
  */
diff --git a/refs/ref-cache.c b/refs/ref-cache.c
index 54dfb5218c..4f850e1b5c 100644
--- a/refs/ref-cache.c
+++ b/refs/ref-cache.c
@@ -38,7 +38,6 @@ struct ref_entry *create_ref_entry(const char *refname,
 
FLEX_ALLOC_STR(ref, name, refname);
oidcpy(>u.value.oid, oid);
-   oidclr(>u.value.peeled);
ref->flag = flag;
return ref;
 }
@@ -491,49 +490,10 @@ static int cache_ref_iterator_advance(struct ref_iterator 
*ref_iterator)
}
 }
 
-enum peel_status peel_entry(struct ref_entry *entry, int repeel)
-{
-   enum peel_status status;
-
-   if (entry->flag & REF_KNOWS_PEELED) {
-   if (repeel) {
-   entry->flag &= ~REF_KNOWS_PEELED;
-   oidclr(>u.value.peeled);
-   } else {
-   return is_null_oid(>u.value.peeled) ?
-   PEEL_NON_TAG : PEEL_PEELED;
-   }
-   }
-   if (entry->flag & REF_ISBROKEN)
-   return PEEL_BROKEN;
-   if (entry->flag & REF_ISSYMREF)
-   return PEEL_IS_SYMREF;
-
-   status = peel_object(entry->u.value.oid.hash, 
entry->u.value.peeled.hash);
-   if (status == PEEL_PEELED || status == PEEL_NON_TAG)
-   entry->flag |= REF_KNOWS_PEELED;
-   return status;
-}
-
 static int cache_ref_iterator_peel(struct ref_iterator *ref_iterator,
   struct object_id *peeled)
 {
-   struct cache_ref_iterator *iter =
-   (struct cache_ref_iterator *)ref_iterator;
-   struct cache_ref_iterator_level *level;
-   struct ref_entry *entry;
-
-   level = >levels[iter->levels_nr - 1];
-
-   if (level->index == -1)
-   die("BUG: peel called before advance for cache iterator");
-
-   entry = level->dir->entries[level->index];
-
-   if (peel_entry(entry, 0))
-   return -1;
-   oidcpy(peeled, >u.value.peeled);
-   return 0;
+   return peel_object(ref_iterator->oid->hash, peeled->hash);
 }
 
 static int cache_ref_iterator_abort(struct ref_iterator *ref_iterator)
diff --git a/refs/ref-cache.h b/refs/ref-cache.h
index a082bfb06c..eda65e73ed 100644
--- a/refs/ref-cache.h
+++ b/refs/ref-cache.h
@@ -38,14 +38,6 @@ struct ref_value {
 * referred to by the last reference in the symlink chain.
 */
struct object_id oid;
-
-   /*
-* If REF_KNOWS_PEELED, then this field holds the peeled value
-* of this reference, or null if the reference is known not to
-* be peelable.  See the documentation for peel_ref() for an
-* exact definition of "peelable".
-*/
-   struct object_id peeled;
 };
 
 /*
@@ -97,21 +89,14 @@ struct ref_dir {
  * public values; see refs.h.
  */
 
-/*
- * The field ref_entry->u.value.peeled of this value entry contains
- * the correct peeled value for the reference, which might be
- * null_sha1 if the reference is not a tag or if it is broken.
- */
-#define REF_KNOWS_PEELED 0x10
-
 /* ref_entry represents a directory of references */
-#define REF_DIR 0x20
+#define REF_DIR 0x10
 
 /*
  * Entry has not yet been read from disk (used only for REF_DIR
  * entries representing loose references)
  */
-#define REF_INCOMPLETE 0x40
+#define REF_INCOMPLETE 0x20
 
 /*
  * A ref_entry represents either a reference or a "subdirectory" of
@@ -252,17 +237,4 @@ struct ref_iterator *cache_ref_iterator_begin(struct 
ref_cache *cache,
  const char *prefix,
  int prime_dir);
 
-/*
- * Peel the entry (if possible) and return its new peel_status.  If
- * repeel is true, re-peel the entry even if there is an old 

[PATCH 19/20] mmapped_ref_iterator: inline into `packed_ref_iterator`

2017-09-13 Thread Michael Haggerty
Since `packed_ref_iterator` is now delegating to
`mmapped_ref_iterator` rather than `cache_ref_iterator` to do the
heavy lifting, there is no need to keep the two iterators separate. So
"inline" `mmapped_ref_iterator` into `packed_ref_iterator`. This
removes a bunch of boilerplate.

Signed-off-by: Michael Haggerty 
---
 refs/packed-backend.c | 284 --
 1 file changed, 114 insertions(+), 170 deletions(-)

diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index d76051c7f5..a3f8a19b9b 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -200,157 +200,6 @@ static NORETURN void die_invalid_line(const char *path,
 
 }
 
-/*
- * This value is set in `base.flags` if the peeled value of the
- * current reference is known. In that case, `peeled` contains the
- * correct peeled value for the reference, which might be `null_sha1`
- * if the reference is not a tag or if it is broken.
- */
-#define REF_KNOWS_PEELED 0x40
-
-/*
- * An iterator over a packed-refs file that is currently mmapped.
- */
-struct mmapped_ref_iterator {
-   struct ref_iterator base;
-
-   struct packed_ref_cache *packed_refs;
-
-   /* The current position in the mmapped file: */
-   const char *pos;
-
-   /* The end of the mmapped file: */
-   const char *eof;
-
-   struct object_id oid, peeled;
-
-   struct strbuf refname_buf;
-};
-
-static int mmapped_ref_iterator_advance(struct ref_iterator *ref_iterator)
-{
-   struct mmapped_ref_iterator *iter =
-   (struct mmapped_ref_iterator *)ref_iterator;
-   const char *p = iter->pos, *eol;
-
-   strbuf_reset(>refname_buf);
-
-   if (iter->pos == iter->eof)
-   return ref_iterator_abort(ref_iterator);
-
-   iter->base.flags = REF_ISPACKED;
-
-   if (iter->eof - p < GIT_SHA1_HEXSZ + 2 ||
-   parse_oid_hex(p, >oid, ) ||
-   !isspace(*p++))
-   die_invalid_line(iter->packed_refs->refs->path,
-iter->pos, iter->eof - iter->pos);
-
-   eol = memchr(p, '\n', iter->eof - p);
-   if (!eol)
-   die_unterminated_line(iter->packed_refs->refs->path,
- iter->pos, iter->eof - iter->pos);
-
-   strbuf_add(>refname_buf, p, eol - p);
-   iter->base.refname = iter->refname_buf.buf;
-
-   if (check_refname_format(iter->base.refname, REFNAME_ALLOW_ONELEVEL)) {
-   if (!refname_is_safe(iter->base.refname))
-   die("packed refname is dangerous: %s",
-   iter->base.refname);
-   oidclr(>oid);
-   iter->base.flags |= REF_BAD_NAME | REF_ISBROKEN;
-   }
-   if (iter->packed_refs->peeled == PEELED_FULLY ||
-   (iter->packed_refs->peeled == PEELED_TAGS &&
-starts_with(iter->base.refname, "refs/tags/")))
-   iter->base.flags |= REF_KNOWS_PEELED;
-
-   iter->pos = eol + 1;
-
-   if (iter->pos < iter->eof && *iter->pos == '^') {
-   p = iter->pos + 1;
-   if (iter->eof - p < GIT_SHA1_HEXSZ + 1 ||
-   parse_oid_hex(p, >peeled, ) ||
-   *p++ != '\n')
-   die_invalid_line(iter->packed_refs->refs->path,
-iter->pos, iter->eof - iter->pos);
-   iter->pos = p;
-
-   /*
-* Regardless of what the file header said, we
-* definitely know the value of *this* reference. But
-* we suppress it if the reference is broken:
-*/
-   if ((iter->base.flags & REF_ISBROKEN)) {
-   oidclr(>peeled);
-   iter->base.flags &= ~REF_KNOWS_PEELED;
-   } else {
-   iter->base.flags |= REF_KNOWS_PEELED;
-   }
-   } else {
-   oidclr(>peeled);
-   }
-
-   return ITER_OK;
-}
-
-static int mmapped_ref_iterator_peel(struct ref_iterator *ref_iterator,
-   struct object_id *peeled)
-{
-   struct mmapped_ref_iterator *iter =
-   (struct mmapped_ref_iterator *)ref_iterator;
-
-   if ((iter->base.flags & REF_KNOWS_PEELED)) {
-   oidcpy(peeled, >peeled);
-   return is_null_oid(>peeled) ? -1 : 0;
-   } else if ((iter->base.flags & (REF_ISBROKEN | REF_ISSYMREF))) {
-   return -1;
-   } else {
-   return !!peel_object(iter->oid.hash, peeled->hash);
-   }
-}
-
-static int mmapped_ref_iterator_abort(struct ref_iterator *ref_iterator)
-{
-   struct mmapped_ref_iterator *iter =
-   (struct mmapped_ref_iterator *)ref_iterator;
-
-   release_packed_ref_cache(iter->packed_refs);
-   strbuf_release(>refname_buf);
-   base_ref_iterator_free(ref_iterator);
-   return ITER_DONE;
-}
-
-static struct 

[PATCH 14/20] packed_ref_iterator_begin(): iterate using `mmapped_ref_iterator`

2017-09-13 Thread Michael Haggerty
Now that we have an efficient way to iterate, in order, over the
mmapped contents of the `packed-refs` file, we can use that directly
to implement reference iteration for the `packed_ref_store`, rather
than iterating over the `ref_cache`. This is the next step towards
getting rid of the `ref_cache` entirely.

Signed-off-by: Michael Haggerty 
---
 refs/packed-backend.c | 109 --
 1 file changed, 106 insertions(+), 3 deletions(-)

diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index 2a36dd9afb..a018a6c8ad 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -372,6 +372,27 @@ static int cmp_packed_ref_entries(const void *v1, const 
void *v2)
}
 }
 
+/*
+ * Compare a packed-refs record pointed to by `rec` to the specified
+ * NUL-terminated refname.
+ */
+static int cmp_entry_to_refname(const char *rec, const char *refname)
+{
+   const char *r1 = rec + GIT_SHA1_HEXSZ + 1;
+   const char *r2 = refname;
+
+   while (1) {
+   if (*r1 == '\n')
+   return *r2 ? -1 : 0;
+   if (!*r2)
+   return 1;
+   if (*r1 != *r2)
+   return (unsigned char)*r1 < (unsigned char)*r2 ? -1 : 
+1;
+   r1++;
+   r2++;
+   }
+}
+
 /*
  * `packed_refs->buf` is not known to be sorted. Check whether it is,
  * and if not, sort it into new memory and munmap/free the old
@@ -478,6 +499,17 @@ static const char *find_start_of_record(const char *buf, 
const char *p)
return p;
 }
 
+/*
+ * Return a pointer to the start of the record following the record
+ * that contains `*p`. If none is found before `end`, return `end`.
+ */
+static const char *find_end_of_record(const char *p, const char *end)
+{
+   while (++p < end && (p[-1] != '\n' || p[0] == '^'))
+   ;
+   return p;
+}
+
 /*
  * We want to be able to compare mmapped reference records quickly,
  * without totally parsing them. We can do so because the records are
@@ -511,6 +543,65 @@ static void verify_buffer_safe(struct packed_ref_cache 
*packed_refs)
 last_line, eof - last_line);
 }
 
+/*
+ * Find the place in `cache->buf` where the start of the record for
+ * `refname` starts. If `mustexist` is true and the reference doesn't
+ * exist, then return NULL. If `mustexist` is false and the reference
+ * doesn't exist, then return the point where that reference would be
+ * inserted. In the latter mode, `refname` doesn't have to be a proper
+ * reference name; for example, one could search for "refs/replace/"
+ * to find the start of any replace references.
+ *
+ * The record is sought using a binary search, so `cache->buf` must be
+ * sorted.
+ */
+static const char *find_reference_location(struct packed_ref_cache *cache,
+  const char *refname, int mustexist)
+{
+   /*
+* This is not *quite* a garden-variety binary search, because
+* the data we're searching is made up of records, and we
+* always need to find the beginning of a record to do a
+* comparison. A "record" here is one line for the reference
+* itself and zero or one peel lines that start with '^'. Our
+* loop invariant is described in the next two comments.
+*/
+
+   /*
+* A pointer to the character at the start of a record whose
+* preceding records all have reference names that come
+* *before* `refname`.
+*/
+   const char *lo = cache->buf + cache->header_len;
+
+   /*
+* A pointer to a the first character of a record whose
+* reference name comes *after* `refname`.
+*/
+   const char *hi = cache->eof;
+
+   while (lo < hi) {
+   const char *mid, *rec;
+   int cmp;
+
+   mid = lo + (hi - lo) / 2;
+   rec = find_start_of_record(lo, mid);
+   cmp = cmp_entry_to_refname(rec, refname);
+   if (cmp < 0) {
+   lo = find_end_of_record(mid, hi);
+   } else if (cmp > 0) {
+   hi = rec;
+   } else {
+   return rec;
+   }
+   }
+
+   if (mustexist)
+   return NULL;
+   else
+   return lo;
+}
+
 /*
  * Read from the `packed-refs` file into a newly-allocated
  * `packed_ref_cache` and return it. The return value will already
@@ -818,6 +909,8 @@ static struct ref_iterator *packed_ref_iterator_begin(
const char *prefix, unsigned int flags)
 {
struct packed_ref_store *refs;
+   struct packed_ref_cache *packed_refs;
+   const char *start;
struct packed_ref_iterator *iter;
struct ref_iterator *ref_iterator;
unsigned int required_flags = REF_STORE_READ;
@@ -835,13 +928,23 @@ static struct ref_iterator 

[PATCH 17/20] packed_ref_store: get rid of the `ref_cache` entirely

2017-09-13 Thread Michael Haggerty
Now that everything has been changed to read what it needs directly
out of the `packed-refs` file, `packed_ref_store` doesn't need to
maintain a `ref_cache` at all. So get rid of it.

First of all, this will save a lot of memory and lots of little
allocations. Instead of needing to store complicated parsed data
structures in memory, we just mmap the file (potentially sharing
memory with other processes) and parse only what we need.

Moreover, since the mmapped access to the file reads only the parts of
the file that it needs, this might save reading all of the data from
disk at all (at least if the file starts out sorted).

Signed-off-by: Michael Haggerty 
---
 refs/packed-backend.c | 29 ++---
 1 file changed, 2 insertions(+), 27 deletions(-)

diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index 88242a49fe..9868a5c198 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -16,8 +16,6 @@ struct packed_ref_cache {
 */
struct packed_ref_store *refs;
 
-   struct ref_cache *cache;
-
/*
 * The file descriptor of the `packed-refs` file (open in
 * read-only mode), or -1 if it is not open.
@@ -123,7 +121,6 @@ static void release_packed_ref_buffer(struct 
packed_ref_cache *packed_refs)
 static int release_packed_ref_cache(struct packed_ref_cache *packed_refs)
 {
if (!--packed_refs->referrers) {
-   free_ref_cache(packed_refs->cache);
stat_validity_clear(_refs->validity);
release_packed_ref_buffer(packed_refs);
free(packed_refs);
@@ -640,15 +637,10 @@ static struct packed_ref_cache *read_packed_refs(struct 
packed_ref_store *refs)
struct packed_ref_cache *packed_refs = xcalloc(1, sizeof(*packed_refs));
struct stat st;
size_t size;
-   struct ref_dir *dir;
-   struct ref_iterator *iter;
int sorted = 0;
-   int ok;
 
packed_refs->refs = refs;
acquire_packed_ref_cache(packed_refs);
-   packed_refs->cache = create_ref_cache(NULL, NULL);
-   packed_refs->cache->root->flag &= ~REF_INCOMPLETE;
packed_refs->peeled = PEELED_NONE;
 
packed_refs->fd = open(refs->path, O_RDONLY);
@@ -730,23 +722,6 @@ static struct packed_ref_cache *read_packed_refs(struct 
packed_ref_store *refs)
verify_buffer_safe(packed_refs);
}
 
-   dir = get_ref_dir(packed_refs->cache->root);
-   iter = mmapped_ref_iterator_begin(
-   packed_refs,
-   packed_refs->buf + packed_refs->header_len,
-   packed_refs->eof);
-   while ((ok = ref_iterator_advance(iter)) == ITER_OK) {
-   struct ref_entry *entry =
-   create_ref_entry(iter->refname, iter->oid, iter->flags);
-
-   if ((iter->flags & REF_KNOWS_PEELED))
-   ref_iterator_peel(iter, >u.value.peeled);
-   add_ref_entry(dir, entry);
-   }
-
-   if (ok != ITER_DONE)
-   die("error reading packed-refs file %s", refs->path);
-
return packed_refs;
 }
 
@@ -905,8 +880,8 @@ static struct ref_iterator *packed_ref_iterator_begin(
else
start = packed_refs->buf + packed_refs->header_len;
 
-   iter->iter0 = mmapped_ref_iterator_begin(
-   packed_refs, start, packed_refs->eof);
+   iter->iter0 = mmapped_ref_iterator_begin(packed_refs,
+start, packed_refs->eof);
 
iter->flags = flags;
 
-- 
2.14.1



[PATCH 20/20] packed-backend.c: rename a bunch of things and update comments

2017-09-13 Thread Michael Haggerty
We've made huge changes to this file, and some of the old names and
comments are no longer very fitting. So rename a bunch of things:

* `struct packed_ref_cache` → `struct snapshot`
* `acquire_packed_ref_cache()` → `acquire_snapshot()`
* `release_packed_ref_buffer()` → `clear_snapshot_buffer()`
* `release_packed_ref_cache()` → `release_snapshot()`
* `clear_packed_ref_cache()` → `clear_snapshot()`
* `struct packed_ref_entry` → `struct snapshot_record`
* `cmp_packed_ref_entries()` → `cmp_packed_ref_records()`
* `cmp_entry_to_refname()` → `cmp_record_to_refname()`
* `sort_packed_refs()` → `sort_snapshot()`
* `read_packed_refs()` → `create_snapshot()`
* `validate_packed_ref_cache()` → `validate_snapshot()`
* `get_packed_ref_cache()` → `get_snapshot()`
* Renamed local variables and struct members accordingly.

Also update a bunch of comments to reflect the renaming and the
accumulated changes that the code has undergone.

Signed-off-by: Michael Haggerty 
---
 refs/packed-backend.c | 392 --
 1 file changed, 217 insertions(+), 175 deletions(-)

diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index a3f8a19b9b..8235ac8506 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -8,10 +8,30 @@
 
 struct packed_ref_store;
 
-struct packed_ref_cache {
+/*
+ * A `snapshot` represents one snapshot of a `packed-refs` file.
+ *
+ * Normally, this will be a mmapped view of the contents of the
+ * `packed-refs` file at the time the snapshot was created. However,
+ * if the `packed-refs` file was not sorted, this might point at heap
+ * memory holding the contents of the `packed-refs` file with its
+ * records sorted by refname.
+ *
+ * `snapshot` instances are reference counted (via
+ * `acquire_snapshot()` and `release_snapshot()`). This is to prevent
+ * an instance from disappearing while an iterator is still iterating
+ * over it. Instances are garbage collected when their `referrers`
+ * count goes to zero.
+ *
+ * The most recent `snapshot`, if available, is referenced by the
+ * `packed_ref_store`. Its freshness is checked whenever
+ * `get_snapshot()` is called; if the existing snapshot is obsolete, a
+ * new snapshot is taken.
+ */
+struct snapshot {
/*
 * A back-pointer to the packed_ref_store with which this
-* cache is associated:
+* snapshot is associated:
 */
struct packed_ref_store *refs;
 
@@ -35,26 +55,42 @@ struct packed_ref_cache {
size_t header_len;
 
/*
-* What is the peeled state of this cache? (This is usually
-* determined from the header of the "packed-refs" file.)
+* What is the peeled state of the `packed-refs` file that
+* this snapshot represents? (This is usually determined from
+* the file's header.)
 */
enum { PEELED_NONE, PEELED_TAGS, PEELED_FULLY } peeled;
 
/*
-* Count of references to the data structure in this instance,
-* including the pointer from files_ref_store::packed if any.
-* The data will not be freed as long as the reference count
-* is nonzero.
+* Count of references to this instance, including the pointer
+* from `packed_ref_store::snapshot`, if any. The instance
+* will not be freed as long as the reference count is
+* nonzero.
 */
unsigned int referrers;
 
-   /* The metadata from when this packed-refs cache was read */
+   /*
+* The metadata of the `packed-refs` file from which this
+* snapshot was created, used to tell if the file has been
+* replaced since we read it.
+*/
struct stat_validity validity;
 };
 
 /*
- * A container for `packed-refs`-related data. It is not (yet) a
- * `ref_store`.
+ * A `ref_store` representing references stored in a `packed-refs`
+ * file. It implements the `ref_store` interface, though it has some
+ * limitations:
+ *
+ * - It cannot store symbolic references.
+ *
+ * - It cannot store reflogs.
+ *
+ * - It does not support reference renaming (though it could).
+ *
+ * On the other hand, it can be locked outside of a reference
+ * transaction. In that case, it remains locked even after the
+ * transaction is done and the new `packed-refs` file is activated.
  */
 struct packed_ref_store {
struct ref_store base;
@@ -65,10 +101,10 @@ struct packed_ref_store {
char *path;
 
/*
-* A cache of the values read from the `packed-refs` file, if
-* it might still be current; otherwise, NULL.
+* A snapshot of the values read from the `packed-refs` file,
+* if it might still be current; otherwise, NULL.
 */
-   struct packed_ref_cache *cache;
+   struct snapshot *snapshot;
 
/*
 * Lock used for the "packed-refs" file. Note that this (and
@@ -85,44 +121,43 @@ struct packed_ref_store {
 };
 
 /*
- * Increment the reference 

[PATCH 09/20] packed_ref_cache: remember the file-wide peeling state

2017-09-13 Thread Michael Haggerty
Rather than store the peeling state (i.e., the one defined by traits
in the `packed-refs` file header line) in a local variable in
`read_packed_refs()`, store it permanently in `packed_ref_cache`. This
will be needed when we stop reading all packed refs at once.

Signed-off-by: Michael Haggerty 
---
 refs/packed-backend.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index 2b80f244c8..ae276f3445 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -18,6 +18,12 @@ struct packed_ref_cache {
 
struct ref_cache *cache;
 
+   /*
+* What is the peeled state of this cache? (This is usually
+* determined from the header of the "packed-refs" file.)
+*/
+   enum { PEELED_NONE, PEELED_TAGS, PEELED_FULLY } peeled;
+
/*
 * Count of references to the data structure in this instance,
 * including the pointer from files_ref_store::packed if any.
@@ -195,13 +201,13 @@ static struct packed_ref_cache *read_packed_refs(struct 
packed_ref_store *refs)
char *buf;
const char *pos, *eol, *eof;
struct strbuf tmp = STRBUF_INIT;
-   enum { PEELED_NONE, PEELED_TAGS, PEELED_FULLY } peeled = PEELED_NONE;
struct ref_dir *dir;
 
packed_refs->refs = refs;
acquire_packed_ref_cache(packed_refs);
packed_refs->cache = create_ref_cache(NULL, NULL);
packed_refs->cache->root->flag &= ~REF_INCOMPLETE;
+   packed_refs->peeled = PEELED_NONE;
 
fd = open(refs->path, O_RDONLY);
if (fd < 0) {
@@ -244,9 +250,9 @@ static struct packed_ref_cache *read_packed_refs(struct 
packed_ref_store *refs)
string_list_split_in_place(, p, ' ', -1);
 
if (unsorted_string_list_has_string(, "fully-peeled"))
-   peeled = PEELED_FULLY;
+   packed_refs->peeled = PEELED_FULLY;
else if (unsorted_string_list_has_string(, "peeled"))
-   peeled = PEELED_TAGS;
+   packed_refs->peeled = PEELED_TAGS;
/* perhaps other traits later as well */
 
/* The "+ 1" is for the LF character. */
@@ -282,8 +288,9 @@ static struct packed_ref_cache *read_packed_refs(struct 
packed_ref_store *refs)
oidclr();
flag |= REF_BAD_NAME | REF_ISBROKEN;
}
-   if (peeled == PEELED_FULLY ||
-   (peeled == PEELED_TAGS && starts_with(refname, 
"refs/tags/")))
+   if (packed_refs->peeled == PEELED_FULLY ||
+   (packed_refs->peeled == PEELED_TAGS &&
+starts_with(refname, "refs/tags/")))
flag |= REF_KNOWS_PEELED;
entry = create_ref_entry(refname, , flag);
add_ref_entry(dir, entry);
-- 
2.14.1



[PATCH 12/20] packed_ref_cache: keep the `packed-refs` file open and mmapped

2017-09-13 Thread Michael Haggerty
As long as a `packed_ref_cache` object is in use, keep the
corresponding `packed-refs` file open and mmapped.

This is still pointless, because we only read the file immediately
after instantiating the `packed_ref_cache`. But that will soon change.

Signed-off-by: Michael Haggerty 
---
 refs/packed-backend.c | 137 --
 1 file changed, 89 insertions(+), 48 deletions(-)

diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index 724c88631d..d209c8e5a0 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -18,6 +18,22 @@ struct packed_ref_cache {
 
struct ref_cache *cache;
 
+   /*
+* The file descriptor of the `packed-refs` file (open in
+* read-only mode), or -1 if it is not open.
+*/
+   int fd;
+
+   /*
+* The range of memory to which the `packed-refs` file is
+* mmapped. If there were no contents (e.g., because the file
+* didn't exist), `buf` and `eof` are both NULL.
+*/
+   char *buf, *eof;
+
+   /* The size of the header line, if any; otherwise, 0: */
+   size_t header_len;
+
/*
 * What is the peeled state of this cache? (This is usually
 * determined from the header of the "packed-refs" file.)
@@ -36,30 +52,6 @@ struct packed_ref_cache {
struct stat_validity validity;
 };
 
-/*
- * Increment the reference count of *packed_refs.
- */
-static void acquire_packed_ref_cache(struct packed_ref_cache *packed_refs)
-{
-   packed_refs->referrers++;
-}
-
-/*
- * Decrease the reference count of *packed_refs.  If it goes to zero,
- * free *packed_refs and return true; otherwise return false.
- */
-static int release_packed_ref_cache(struct packed_ref_cache *packed_refs)
-{
-   if (!--packed_refs->referrers) {
-   free_ref_cache(packed_refs->cache);
-   stat_validity_clear(_refs->validity);
-   free(packed_refs);
-   return 1;
-   } else {
-   return 0;
-   }
-}
-
 /*
  * A container for `packed-refs`-related data. It is not (yet) a
  * `ref_store`.
@@ -92,6 +84,50 @@ struct packed_ref_store {
struct tempfile tempfile;
 };
 
+/*
+ * Increment the reference count of *packed_refs.
+ */
+static void acquire_packed_ref_cache(struct packed_ref_cache *packed_refs)
+{
+   packed_refs->referrers++;
+}
+
+/*
+ * If the buffer in `packed_refs` is active, munmap the memory, close the
+ * file, and set the buffer pointers to NULL.
+ */
+static void release_packed_ref_buffer(struct packed_ref_cache *packed_refs)
+{
+   if (packed_refs->buf) {
+   if (munmap(packed_refs->buf,
+  packed_refs->eof - packed_refs->buf))
+   die_errno("error ummapping packed-refs file %s",
+ packed_refs->refs->path);
+   packed_refs->buf = packed_refs->eof = NULL;
+   packed_refs->header_len = 0;
+   }
+
+   if (packed_refs->fd >= 0)
+   close(packed_refs->fd);
+}
+
+/*
+ * Decrease the reference count of *packed_refs.  If it goes to zero,
+ * free *packed_refs and return true; otherwise return false.
+ */
+static int release_packed_ref_cache(struct packed_ref_cache *packed_refs)
+{
+   if (!--packed_refs->referrers) {
+   free_ref_cache(packed_refs->cache);
+   stat_validity_clear(_refs->validity);
+   release_packed_ref_buffer(packed_refs);
+   free(packed_refs);
+   return 1;
+   } else {
+   return 0;
+   }
+}
+
 struct ref_store *packed_ref_store_create(const char *path,
  unsigned int store_flags)
 {
@@ -284,13 +320,15 @@ static struct ref_iterator_vtable 
mmapped_ref_iterator_vtable = {
 };
 
 struct ref_iterator *mmapped_ref_iterator_begin(
-   const char *packed_refs_file,
struct packed_ref_cache *packed_refs,
const char *pos, const char *eof)
 {
struct mmapped_ref_iterator *iter = xcalloc(1, sizeof(*iter));
struct ref_iterator *ref_iterator = >base;
 
+   if (!packed_refs->buf)
+   return empty_ref_iterator_begin();
+
base_ref_iterator_init(ref_iterator, _ref_iterator_vtable, 0);
 
iter->packed_refs = packed_refs;
@@ -336,11 +374,8 @@ struct ref_iterator *mmapped_ref_iterator_begin(
 static struct packed_ref_cache *read_packed_refs(struct packed_ref_store *refs)
 {
struct packed_ref_cache *packed_refs = xcalloc(1, sizeof(*packed_refs));
-   int fd;
struct stat st;
size_t size;
-   char *buf;
-   const char *pos, *eof;
struct ref_dir *dir;
struct ref_iterator *iter;
int ok;
@@ -351,13 +386,15 @@ static struct packed_ref_cache *read_packed_refs(struct 
packed_ref_store *refs)
packed_refs->cache->root->flag &= ~REF_INCOMPLETE;

[PATCH 05/20] read_packed_refs(): use mmap to read the `packed-refs` file

2017-09-13 Thread Michael Haggerty
It's still done in a pretty stupid way, involving more data copying
than necessary. That will improve in future commits.

Signed-off-by: Michael Haggerty 
---
 refs/packed-backend.c | 42 --
 1 file changed, 32 insertions(+), 10 deletions(-)

diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index 5c50c223ef..154abbd83a 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -215,8 +215,12 @@ static NORETURN void die_invalid_line(const char *path,
  */
 static struct packed_ref_cache *read_packed_refs(struct packed_ref_store *refs)
 {
-   FILE *f;
struct packed_ref_cache *packed_refs = xcalloc(1, sizeof(*packed_refs));
+   int fd;
+   struct stat st;
+   size_t size;
+   char *buf;
+   const char *pos, *eol, *eof;
struct ref_entry *last = NULL;
struct strbuf line = STRBUF_INIT;
enum { PEELED_NONE, PEELED_TAGS, PEELED_FULLY } peeled = PEELED_NONE;
@@ -227,8 +231,8 @@ static struct packed_ref_cache *read_packed_refs(struct 
packed_ref_store *refs)
packed_refs->cache = create_ref_cache(NULL, NULL);
packed_refs->cache->root->flag &= ~REF_INCOMPLETE;
 
-   f = fopen(refs->path, "r");
-   if (!f) {
+   fd = open(refs->path, O_RDONLY);
+   if (fd < 0) {
if (errno == ENOENT) {
/*
 * This is OK; it just means that no
@@ -241,16 +245,27 @@ static struct packed_ref_cache *read_packed_refs(struct 
packed_ref_store *refs)
}
}
 
-   stat_validity_update(_refs->validity, fileno(f));
+   stat_validity_update(_refs->validity, fd);
+
+   if (fstat(fd, ) < 0)
+   die_errno("couldn't stat %s", refs->path);
+
+   size = xsize_t(st.st_size);
+   buf = xmmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
+   pos = buf;
+   eof = buf + size;
 
dir = get_ref_dir(packed_refs->cache->root);
-   while (strbuf_getwholeline(, f, '\n') != EOF) {
+   while (pos < eof) {
struct object_id oid;
const char *refname;
const char *traits;
 
-   if (!line.len || line.buf[line.len - 1] != '\n')
-   die_unterminated_line(refs->path, line.buf, line.len);
+   eol = memchr(pos, '\n', eof - pos);
+   if (!eol)
+   die_unterminated_line(refs->path, pos, eof - pos);
+
+   strbuf_add(, pos, eol + 1 - pos);
 
if (skip_prefix(line.buf, "# pack-refs with:", )) {
if (strstr(traits, " fully-peeled "))
@@ -258,7 +273,7 @@ static struct packed_ref_cache *read_packed_refs(struct 
packed_ref_store *refs)
else if (strstr(traits, " peeled "))
peeled = PEELED_TAGS;
/* perhaps other traits later as well */
-   continue;
+   goto next_line;
}
 
refname = parse_ref_line(, );
@@ -291,11 +306,18 @@ static struct packed_ref_cache *read_packed_refs(struct 
packed_ref_store *refs)
} else {
die_invalid_line(refs->path, line.buf, line.len);
}
+
+   next_line:
+   /* The "+ 1" is for the LF character. */
+   pos = eol + 1;
+   strbuf_reset();
}
 
-   fclose(f);
-   strbuf_release();
+   if (munmap(buf, size))
+   die_errno("error ummapping packed-refs file");
+   close(fd);
 
+   strbuf_release();
return packed_refs;
 }
 
-- 
2.14.1



[PATCH 03/20] packed_ref_cache: add a backlink to the associated `packed_ref_store`

2017-09-13 Thread Michael Haggerty
It will prove convenient in upcoming patches.

Signed-off-by: Michael Haggerty 
---
 refs/packed-backend.c | 23 ---
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index e411501871..a3d9210cb0 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -7,7 +7,15 @@
 #include "../iterator.h"
 #include "../lockfile.h"
 
+struct packed_ref_store;
+
 struct packed_ref_cache {
+   /*
+* A back-pointer to the packed_ref_store with which this
+* cache is associated:
+*/
+   struct packed_ref_store *refs;
+
struct ref_cache *cache;
 
/*
@@ -154,7 +162,7 @@ static const char *parse_ref_line(struct strbuf *line, 
struct object_id *oid)
 }
 
 /*
- * Read from `packed_refs_file` into a newly-allocated
+ * Read from the `packed-refs` file into a newly-allocated
  * `packed_ref_cache` and return it. The return value will already
  * have its reference count incremented.
  *
@@ -182,7 +190,7 @@ static const char *parse_ref_line(struct strbuf *line, 
struct object_id *oid)
  *  compatibility with older clients, but we do not require it
  *  (i.e., "peeled" is a no-op if "fully-peeled" is set).
  */
-static struct packed_ref_cache *read_packed_refs(const char *packed_refs_file)
+static struct packed_ref_cache *read_packed_refs(struct packed_ref_store *refs)
 {
FILE *f;
struct packed_ref_cache *packed_refs = xcalloc(1, sizeof(*packed_refs));
@@ -191,11 +199,12 @@ static struct packed_ref_cache *read_packed_refs(const 
char *packed_refs_file)
enum { PEELED_NONE, PEELED_TAGS, PEELED_FULLY } peeled = PEELED_NONE;
struct ref_dir *dir;
 
+   packed_refs->refs = refs;
acquire_packed_ref_cache(packed_refs);
packed_refs->cache = create_ref_cache(NULL, NULL);
packed_refs->cache->root->flag &= ~REF_INCOMPLETE;
 
-   f = fopen(packed_refs_file, "r");
+   f = fopen(refs->path, "r");
if (!f) {
if (errno == ENOENT) {
/*
@@ -205,7 +214,7 @@ static struct packed_ref_cache *read_packed_refs(const char 
*packed_refs_file)
 */
return packed_refs;
} else {
-   die_errno("couldn't read %s", packed_refs_file);
+   die_errno("couldn't read %s", refs->path);
}
}
 
@@ -218,7 +227,7 @@ static struct packed_ref_cache *read_packed_refs(const char 
*packed_refs_file)
const char *traits;
 
if (!line.len || line.buf[line.len - 1] != '\n')
-   die("unterminated line in %s: %s", packed_refs_file, 
line.buf);
+   die("unterminated line in %s: %s", refs->path, 
line.buf);
 
if (skip_prefix(line.buf, "# pack-refs with:", )) {
if (strstr(traits, " fully-peeled "))
@@ -258,7 +267,7 @@ static struct packed_ref_cache *read_packed_refs(const char 
*packed_refs_file)
last->flag |= REF_KNOWS_PEELED;
} else {
strbuf_setlen(, line.len - 1);
-   die("unexpected line in %s: %s", packed_refs_file, 
line.buf);
+   die("unexpected line in %s: %s", refs->path, line.buf);
}
}
 
@@ -293,7 +302,7 @@ static struct packed_ref_cache *get_packed_ref_cache(struct 
packed_ref_store *re
validate_packed_ref_cache(refs);
 
if (!refs->cache)
-   refs->cache = read_packed_refs(refs->path);
+   refs->cache = read_packed_refs(refs);
 
return refs->cache;
 }
-- 
2.14.1



[PATCH 07/20] read_packed_refs(): make parsing of the header line more robust

2017-09-13 Thread Michael Haggerty
The old code parsed the traits in the `packed-refs` header by looking
for the string " trait " (i.e., the name of the trait with a space on
either side) in the header line. This is fragile, because if any other
implementation of Git forgets to write the trailing space, the last
trait would silently be ignored (and the error might never be
noticed).

So instead, use `string_list_split_in_place()` to split the traits
into tokens then use `unsorted_string_list_has_string()` to look for
the tokens we are interested in. This means that we can read the
traits correctly even if the header line is missing a trailing
space (or indeed, if it is missing the space after the colon, or if it
has multiple spaces somewhere).

However, older Git clients (and perhaps other Git implementations)
still require the surrounding spaces, so we still have to output the
header with a trailing space.

Signed-off-by: Michael Haggerty 
---
 refs/packed-backend.c | 21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index 141f02b9c8..a45e3ff92f 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -257,25 +257,30 @@ static struct packed_ref_cache *read_packed_refs(struct 
packed_ref_store *refs)
 
/* If the file has a header line, process it: */
if (pos < eof && *pos == '#') {
-   const char *traits;
+   char *p;
+   struct string_list traits = STRING_LIST_INIT_NODUP;
 
eol = memchr(pos, '\n', eof - pos);
if (!eol)
die_unterminated_line(refs->path, pos, eof - pos);
 
-   strbuf_add(, pos, eol + 1 - pos);
+   strbuf_add(, pos, eol - pos);
 
-   if (!skip_prefix(line.buf, "# pack-refs with:", ))
+   if (!skip_prefix(line.buf, "# pack-refs with:", (const char 
**)))
die_invalid_line(refs->path, pos, eof - pos);
 
-   if (strstr(traits, " fully-peeled "))
+   string_list_split_in_place(, p, ' ', -1);
+
+   if (unsorted_string_list_has_string(, "fully-peeled"))
peeled = PEELED_FULLY;
-   else if (strstr(traits, " peeled "))
+   else if (unsorted_string_list_has_string(, "peeled"))
peeled = PEELED_TAGS;
/* perhaps other traits later as well */
 
/* The "+ 1" is for the LF character. */
pos = eol + 1;
+
+   string_list_clear(, 0);
strbuf_reset();
}
 
@@ -610,7 +615,11 @@ int packed_refs_is_locked(struct ref_store *ref_store)
 
 /*
  * The packed-refs header line that we write out.  Perhaps other
- * traits will be added later.  The trailing space is required.
+ * traits will be added later.
+ *
+ * Note that earlier versions of Git used to parse these traits by
+ * looking for " trait " in the line. For this reason, the space after
+ * the colon and the trailing space are required.
  */
 static const char PACKED_REFS_HEADER[] =
"# pack-refs with: peeled fully-peeled \n";
-- 
2.14.1



[PATCH 06/20] read_packed_refs(): only check for a header at the top of the file

2017-09-13 Thread Michael Haggerty
This tightens up the parsing a bit; previously, stray header-looking
lines would have been processed.

Signed-off-by: Michael Haggerty 
---
 refs/packed-backend.c | 35 ---
 1 file changed, 24 insertions(+), 11 deletions(-)

diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index 154abbd83a..141f02b9c8 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -255,11 +255,34 @@ static struct packed_ref_cache *read_packed_refs(struct 
packed_ref_store *refs)
pos = buf;
eof = buf + size;
 
+   /* If the file has a header line, process it: */
+   if (pos < eof && *pos == '#') {
+   const char *traits;
+
+   eol = memchr(pos, '\n', eof - pos);
+   if (!eol)
+   die_unterminated_line(refs->path, pos, eof - pos);
+
+   strbuf_add(, pos, eol + 1 - pos);
+
+   if (!skip_prefix(line.buf, "# pack-refs with:", ))
+   die_invalid_line(refs->path, pos, eof - pos);
+
+   if (strstr(traits, " fully-peeled "))
+   peeled = PEELED_FULLY;
+   else if (strstr(traits, " peeled "))
+   peeled = PEELED_TAGS;
+   /* perhaps other traits later as well */
+
+   /* The "+ 1" is for the LF character. */
+   pos = eol + 1;
+   strbuf_reset();
+   }
+
dir = get_ref_dir(packed_refs->cache->root);
while (pos < eof) {
struct object_id oid;
const char *refname;
-   const char *traits;
 
eol = memchr(pos, '\n', eof - pos);
if (!eol)
@@ -267,15 +290,6 @@ static struct packed_ref_cache *read_packed_refs(struct 
packed_ref_store *refs)
 
strbuf_add(, pos, eol + 1 - pos);
 
-   if (skip_prefix(line.buf, "# pack-refs with:", )) {
-   if (strstr(traits, " fully-peeled "))
-   peeled = PEELED_FULLY;
-   else if (strstr(traits, " peeled "))
-   peeled = PEELED_TAGS;
-   /* perhaps other traits later as well */
-   goto next_line;
-   }
-
refname = parse_ref_line(, );
if (refname) {
int flag = REF_ISPACKED;
@@ -307,7 +321,6 @@ static struct packed_ref_cache *read_packed_refs(struct 
packed_ref_store *refs)
die_invalid_line(refs->path, line.buf, line.len);
}
 
-   next_line:
/* The "+ 1" is for the LF character. */
pos = eol + 1;
strbuf_reset();
-- 
2.14.1



[PATCH 4/7] convert less-trivial versions of "write_in_full() != len"

2017-09-13 Thread Jeff King
The prior commit converted many sites to check the return
value of write_in_full() for negativity, rather than a
mismatch with the input length. This patch covers similar
cases, but where the return value is stored in an
intermediate variable. These should get the same treatment,
but they need to be reviewed more carefully since it would
be a bug if the return value is stored in an unsigned type
(which indeed, it is in one of the cases).

Signed-off-by: Jeff King 
---
 entry.c  | 5 +++--
 refs/files-backend.c | 2 +-
 streaming.c  | 2 +-
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/entry.c b/entry.c
index cb291aa88b..1c7e3c11d5 100644
--- a/entry.c
+++ b/entry.c
@@ -257,7 +257,8 @@ static int write_entry(struct cache_entry *ce,
char *new;
struct strbuf buf = STRBUF_INIT;
unsigned long size;
-   size_t wrote, newsize = 0;
+   ssize_t wrote;
+   size_t newsize = 0;
struct stat st;
const struct submodule *sub;
 
@@ -332,7 +333,7 @@ static int write_entry(struct cache_entry *ce,
fstat_done = fstat_output(fd, state, );
close(fd);
free(new);
-   if (wrote != size)
+   if (wrote < 0)
return error("unable to write file %s", path);
break;
case S_IFGITLINK:
diff --git a/refs/files-backend.c b/refs/files-backend.c
index f8b91fff3f..489471bbcf 100644
--- a/refs/files-backend.c
+++ b/refs/files-backend.c
@@ -1549,7 +1549,7 @@ static int log_ref_write_fd(int fd, const struct 
object_id *old_oid,
 
written = len <= maxlen ? write_in_full(fd, logrec, len) : -1;
free(logrec);
-   if (written != len)
+   if (written < 0)
return -1;
 
return 0;
diff --git a/streaming.c b/streaming.c
index 6f1c60f12b..5892b50bd8 100644
--- a/streaming.c
+++ b/streaming.c
@@ -540,7 +540,7 @@ int stream_blob_to_fd(int fd, const struct object_id *oid, 
struct stream_filter
kept = 0;
wrote = write_in_full(fd, buf, readlen);
 
-   if (wrote != readlen)
+   if (wrote < 0)
goto close_and_exit;
}
if (kept && (lseek(fd, kept - 1, SEEK_CUR) == (off_t) -1 ||
-- 
2.14.1.874.ge7b2e05270



[PATCH 11/20] mmapped_ref_iterator_advance(): no peeled value for broken refs

2017-09-13 Thread Michael Haggerty
If a reference is broken, suppress its peeled value.

Signed-off-by: Michael Haggerty 
---
 refs/packed-backend.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index 312116a99d..724c88631d 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -234,9 +234,15 @@ static int mmapped_ref_iterator_advance(struct 
ref_iterator *ref_iterator)
 
/*
 * Regardless of what the file header said, we
-* definitely know the value of *this* reference:
+* definitely know the value of *this* reference. But
+* we suppress it if the reference is broken:
 */
-   iter->base.flags |= REF_KNOWS_PEELED;
+   if ((iter->base.flags & REF_ISBROKEN)) {
+   oidclr(>peeled);
+   iter->base.flags &= ~REF_KNOWS_PEELED;
+   } else {
+   iter->base.flags |= REF_KNOWS_PEELED;
+   }
} else {
oidclr(>peeled);
}
-- 
2.14.1



[PATCH 01/20] ref_iterator: keep track of whether the iterator output is ordered

2017-09-13 Thread Michael Haggerty
References are iterated over in order by refname, but reflogs are not.
Some consumers of reference iteration care about the difference. Teach
each `ref_iterator` to keep track of whether its output is ordered.

`overlay_ref_iterator` is one of the picky consumers. Add a sanity
check in `overlay_ref_iterator_begin()` to verify that its inputs are
ordered.

Signed-off-by: Michael Haggerty 
---
 refs.c|  4 
 refs/files-backend.c  | 16 +---
 refs/iterator.c   | 15 ++-
 refs/packed-backend.c |  2 +-
 refs/ref-cache.c  |  2 +-
 refs/ref-cache.h  |  3 ++-
 refs/refs-internal.h  | 23 +++
 7 files changed, 46 insertions(+), 19 deletions(-)

diff --git a/refs.c b/refs.c
index b0106b8162..101c107ee8 100644
--- a/refs.c
+++ b/refs.c
@@ -1309,6 +1309,10 @@ struct ref_iterator *refs_ref_iterator_begin(
if (trim)
iter = prefix_ref_iterator_begin(iter, "", trim);
 
+   /* Sanity check for subclasses: */
+   if (!iter->ordered)
+   BUG("reference iterator is not ordered");
+
return iter;
 }
 
diff --git a/refs/files-backend.c b/refs/files-backend.c
index 961424a4ea..35648c89fc 100644
--- a/refs/files-backend.c
+++ b/refs/files-backend.c
@@ -762,7 +762,7 @@ static struct ref_iterator *files_ref_iterator_begin(
const char *prefix, unsigned int flags)
 {
struct files_ref_store *refs;
-   struct ref_iterator *loose_iter, *packed_iter;
+   struct ref_iterator *loose_iter, *packed_iter, *overlay_iter;
struct files_ref_iterator *iter;
struct ref_iterator *ref_iterator;
unsigned int required_flags = REF_STORE_READ;
@@ -772,10 +772,6 @@ static struct ref_iterator *files_ref_iterator_begin(
 
refs = files_downcast(ref_store, required_flags, "ref_iterator_begin");
 
-   iter = xcalloc(1, sizeof(*iter));
-   ref_iterator = >base;
-   base_ref_iterator_init(ref_iterator, _ref_iterator_vtable);
-
/*
 * We must make sure that all loose refs are read before
 * accessing the packed-refs file; this avoids a race
@@ -811,7 +807,13 @@ static struct ref_iterator *files_ref_iterator_begin(
refs->packed_ref_store, prefix, 0,
DO_FOR_EACH_INCLUDE_BROKEN);
 
-   iter->iter0 = overlay_ref_iterator_begin(loose_iter, packed_iter);
+   overlay_iter = overlay_ref_iterator_begin(loose_iter, packed_iter);
+
+   iter = xcalloc(1, sizeof(*iter));
+   ref_iterator = >base;
+   base_ref_iterator_init(ref_iterator, _ref_iterator_vtable,
+  overlay_iter->ordered);
+   iter->iter0 = overlay_iter;
iter->flags = flags;
 
return ref_iterator;
@@ -2084,7 +2086,7 @@ static struct ref_iterator 
*files_reflog_iterator_begin(struct ref_store *ref_st
struct ref_iterator *ref_iterator = >base;
struct strbuf sb = STRBUF_INIT;
 
-   base_ref_iterator_init(ref_iterator, _reflog_iterator_vtable);
+   base_ref_iterator_init(ref_iterator, _reflog_iterator_vtable, 0);
files_reflog_path(refs, , NULL);
iter->dir_iterator = dir_iterator_begin(sb.buf);
iter->ref_store = ref_store;
diff --git a/refs/iterator.c b/refs/iterator.c
index 4cf449ef66..c475360f0a 100644
--- a/refs/iterator.c
+++ b/refs/iterator.c
@@ -25,9 +25,11 @@ int ref_iterator_abort(struct ref_iterator *ref_iterator)
 }
 
 void base_ref_iterator_init(struct ref_iterator *iter,
-   struct ref_iterator_vtable *vtable)
+   struct ref_iterator_vtable *vtable,
+   int ordered)
 {
iter->vtable = vtable;
+   iter->ordered = !!ordered;
iter->refname = NULL;
iter->oid = NULL;
iter->flags = 0;
@@ -72,7 +74,7 @@ struct ref_iterator *empty_ref_iterator_begin(void)
struct empty_ref_iterator *iter = xcalloc(1, sizeof(*iter));
struct ref_iterator *ref_iterator = >base;
 
-   base_ref_iterator_init(ref_iterator, _ref_iterator_vtable);
+   base_ref_iterator_init(ref_iterator, _ref_iterator_vtable, 1);
return ref_iterator;
 }
 
@@ -205,6 +207,7 @@ static struct ref_iterator_vtable merge_ref_iterator_vtable 
= {
 };
 
 struct ref_iterator *merge_ref_iterator_begin(
+   int ordered,
struct ref_iterator *iter0, struct ref_iterator *iter1,
ref_iterator_select_fn *select, void *cb_data)
 {
@@ -219,7 +222,7 @@ struct ref_iterator *merge_ref_iterator_begin(
 * references through only if they exist in both iterators.
 */
 
-   base_ref_iterator_init(ref_iterator, _ref_iterator_vtable);
+   base_ref_iterator_init(ref_iterator, _ref_iterator_vtable, 
ordered);
iter->iter0 = iter0;
iter->iter1 = iter1;
iter->select = select;
@@ -268,9 +271,11 @@ struct ref_iterator *overlay_ref_iterator_begin(
} else if 

[PATCH 02/20] prefix_ref_iterator: break when we leave the prefix

2017-09-13 Thread Michael Haggerty
From: Jeff King 

If the underlying iterator is ordered, then `prefix_ref_iterator` can
stop as soon as it sees a refname that comes after the prefix. This
will rarely make a big difference now, because `ref_cache_iterator`
only iterates over the directory containing the prefix (and usually
the prefix will span a whole directory anyway). But if *hint, hint* a
future reference backend doesn't itself know where to stop the
iteration, then this optimization will be a big win.

Note that there is no guarantee that the underlying iterator doesn't
include output preceding the prefix, so we have to skip over any
unwanted references before we get to the ones that we want.

Signed-off-by: Jeff King 
Signed-off-by: Michael Haggerty 
---
Note that the implementation of `compare_prefix()` here is a bit
different than Peff's original version.

 refs/iterator.c | 32 +++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/refs/iterator.c b/refs/iterator.c
index c475360f0a..bd35da4e62 100644
--- a/refs/iterator.c
+++ b/refs/iterator.c
@@ -287,6 +287,20 @@ struct prefix_ref_iterator {
int trim;
 };
 
+/* Return -1, 0, 1 if refname is before, inside, or after the prefix. */
+static int compare_prefix(const char *refname, const char *prefix)
+{
+   while (*prefix) {
+   if (*refname != *prefix)
+   return ((unsigned char)*refname < (unsigned 
char)*prefix) ? -1 : +1;
+
+   refname++;
+   prefix++;
+   }
+
+   return 0;
+}
+
 static int prefix_ref_iterator_advance(struct ref_iterator *ref_iterator)
 {
struct prefix_ref_iterator *iter =
@@ -294,9 +308,25 @@ static int prefix_ref_iterator_advance(struct ref_iterator 
*ref_iterator)
int ok;
 
while ((ok = ref_iterator_advance(iter->iter0)) == ITER_OK) {
-   if (!starts_with(iter->iter0->refname, iter->prefix))
+   int cmp = compare_prefix(iter->iter0->refname, iter->prefix);
+
+   if (cmp < 0)
continue;
 
+   if (cmp > 0) {
+   /*
+* If the source iterator is ordered, then we
+* can stop the iteration as soon as we see a
+* refname that comes after the prefix:
+*/
+   if (iter->iter0->ordered) {
+   ok = ref_iterator_abort(iter->iter0);
+   break;
+   } else {
+   continue;
+   }
+   }
+
if (iter->trim) {
/*
 * It is nonsense to trim off characters that
-- 
2.14.1



[PATCH 00/20] Read `packed-refs` using mmap()

2017-09-13 Thread Michael Haggerty
Previously, any time we wanted to read even a single reference from
the `packed-refs` file, we parsed the whole file and stored it in an
elaborate structure in memory called a `ref_cache`. Subsequent
reference lookups or iterations over some or all of the references
could be done by reading from the `ref_cache`.

But for large `packed-refs` files, the time needed to parse the file,
and the memory needed to cache its contents, can be quite significant.
This is partly because building the cache costs lots of little memory
allocations. (And lest you think that most Git commands can be
executed by reading at most a couple of loose references, remember
that almost any command that reads objects has to look for replace
references (unless they are turned off in the config), and *that*
necessarily entails reading packed references.)

Following lots of work to extract the `packed_ref_store` into a
separate module and decouple it from the `files_ref_store`, it is now
possible to fundamentally change how the `packed-refs` file is read.

* `mmap()` the whole file rather than `read()`ing it.

* Instead of parsing the complete file at once into a `ref_cache`,
  parse the references out of the file contents on demand.

* Use a binary search to find, very quickly, the reference or group of
  references that needs to be read. Parse *only* those references out
  of the file contents, without creating in-memory data structures at
  all.

In rare cases this change might force parts of the `packed-refs` file
to be read multiple times, but that cost is far outweighed by the fact
that usually most of the `packed-refs` file doesn't have to be read
*at all*.

Note that the binary search optimization requires the `packed-refs`
file to be sorted by reference name. We have always written them
sorted, but just in case there are clients that don't, we assume the
file is unsorted unless its header lists a `sorted` trait. From now
on, we write the file with that trait. If the file is not sorted, it
is sorted on the fly in memory.

For a repository with only a couple thousand references and a warm
disk cache, this change doesn't make a very significant difference.
But for repositories with very large numbers of references, the
difference start to be significant:

A repository with 54k references (warm cache):

  git 2.13.1 this branch
git for-each-ref  464 ms  452 ms
git for-each-ref (no output)   66 ms   47 ms
git for-each-ref (0.6% of refs)47 ms9 ms
git for-each-ref (0.6%, no output) 41 ms2 ms
git rev-parse  32 ms2 ms

A repository (admittedly insane, but a real-life example) with 60M
references (warm cache):

  git 2.13.1 this branch
git for-each-ref (no output)84000 ms61000 ms
git rev-parse   4 ms2 ms

This branch applies on top of mh/packed-ref-transactions. It can also
be obtained from my git fork [1] as branch `mmap-packed-refs`.

Michael

[1] https://github.com/mhagger/git

Jeff King (1):
  prefix_ref_iterator: break when we leave the prefix

Michael Haggerty (19):
  ref_iterator: keep track of whether the iterator output is ordered
  packed_ref_cache: add a backlink to the associated `packed_ref_store`
  die_unterminated_line(), die_invalid_line(): new functions
  read_packed_refs(): use mmap to read the `packed-refs` file
  read_packed_refs(): only check for a header at the top of the file
  read_packed_refs(): make parsing of the header line more robust
  read_packed_refs(): read references with minimal copying
  packed_ref_cache: remember the file-wide peeling state
  mmapped_ref_iterator: add iterator over a packed-refs file
  mmapped_ref_iterator_advance(): no peeled value for broken refs
  packed_ref_cache: keep the `packed-refs` file open and mmapped
  read_packed_refs(): ensure that references are ordered when read
  packed_ref_iterator_begin(): iterate using `mmapped_ref_iterator`
  packed_read_raw_ref(): read the reference from the mmapped buffer
  ref_store: implement `refs_peel_ref()` generically
  packed_ref_store: get rid of the `ref_cache` entirely
  ref_cache: remove support for storing peeled values
  mmapped_ref_iterator: inline into `packed_ref_iterator`
  packed-backend.c: rename a bunch of things and update comments

 refs.c|  22 +-
 refs/files-backend.c  |  54 +--
 refs/iterator.c   |  47 ++-
 refs/packed-backend.c | 896 +-
 refs/ref-cache.c  |  44 +--
 refs/ref-cache.h  |  35 +-
 refs/refs-internal.h  |  26 +-
 7 files changed, 761 insertions(+), 363 deletions(-)

-- 
2.14.1



[PATCH 3/7] avoid "write_in_full(fd, buf, len) != len" pattern

2017-09-13 Thread Jeff King
The return value of write_in_full() is either "-1", or the
requested number of bytes[1]. If we make a partial write
before seeing an error, we still return -1, not a partial
value. This goes back to f6aa66cb95 (write_in_full: really
write in full or return error on disk full., 2007-01-11).

So checking anything except "was the return value negative"
is pointless. And there are a couple of reasons not to do
so:

  1. It can do a funny signed/unsigned comparison. If your
 "len" is signed (e.g., a size_t) then the compiler will
 promote the "-1" to its unsigned variant.

 This works out for "!= len" (unless you really were
 trying to write the maximum size_t bytes), but is a
 bug if you check "< len" (an example of which was fixed
 recently in config.c).

 We should avoid promoting the mental model that you
 need to check the length at all, so that new sites are
 not tempted to copy us.

  2. Checking for a negative value is shorter to type,
 especially when the length is an expression.

  3. Linus says so. In d34cf19b89 (Clean up write_in_full()
 users, 2007-01-11), right after the write_in_full()
 semantics were changed, he wrote:

   I really wish every "write_in_full()" user would just
   check against "<0" now, but this fixes the nasty and
   stupid ones.

 Appeals to authority aside, this makes it clear that
 writing it this way does not have an intentional
 benefit. It's a historical curiosity that we never
 bothered to clean up (and which was undoubtedly
 cargo-culted into new sites).

So let's convert these obviously-correct cases (this
includes write_str_in_full(), which is just a wrapper for
write_in_full()).

[1] A careful reader may notice there is one way that
write_in_full() can return a different value. If we ask
write() to write N bytes and get a return value that is
_larger_ than N, we could return a larger total. But
besides the fact that this would imply a totally broken
version of write(), it would already invoke undefined
behavior. Our internal remaining counter is an unsigned
size_t, which means that subtracting too many byte will
wrap it around to a very large number. So we'll instantly
begin reading off the end of the buffer, trying to write
gigabytes (or petabytes) of data.

Signed-off-by: Jeff King 
---
 builtin/receive-pack.c | 2 +-
 builtin/rerere.c   | 2 +-
 builtin/unpack-file.c  | 2 +-
 config.c   | 4 ++--
 diff.c | 2 +-
 fast-import.c  | 2 +-
 http-backend.c | 4 ++--
 ll-merge.c | 2 +-
 read-cache.c   | 6 +++---
 refs.c | 2 +-
 refs/files-backend.c   | 8 
 rerere.c   | 2 +-
 shallow.c  | 6 +++---
 t/helper/test-delta.c  | 2 +-
 transport-helper.c | 5 ++---
 wrapper.c  | 2 +-
 16 files changed, 26 insertions(+), 27 deletions(-)

diff --git a/builtin/receive-pack.c b/builtin/receive-pack.c
index 52c63ebfdc..dd06b3fb4f 100644
--- a/builtin/receive-pack.c
+++ b/builtin/receive-pack.c
@@ -743,7 +743,7 @@ static int run_and_feed_hook(const char *hook_name, feed_fn 
feed,
size_t n;
if (feed(feed_state, , ))
break;
-   if (write_in_full(proc.in, buf, n) != n)
+   if (write_in_full(proc.in, buf, n) < 0)
break;
}
close(proc.in);
diff --git a/builtin/rerere.c b/builtin/rerere.c
index ffb66e2907..0bc40298c2 100644
--- a/builtin/rerere.c
+++ b/builtin/rerere.c
@@ -18,7 +18,7 @@ static int outf(void *dummy, mmbuffer_t *ptr, int nbuf)
 {
int i;
for (i = 0; i < nbuf; i++)
-   if (write_in_full(1, ptr[i].ptr, ptr[i].size) != ptr[i].size)
+   if (write_in_full(1, ptr[i].ptr, ptr[i].size) < 0)
return -1;
return 0;
 }
diff --git a/builtin/unpack-file.c b/builtin/unpack-file.c
index 281ca1db6c..32e0155577 100644
--- a/builtin/unpack-file.c
+++ b/builtin/unpack-file.c
@@ -15,7 +15,7 @@ static char *create_temp_file(struct object_id *oid)
 
xsnprintf(path, sizeof(path), ".merge_file_XX");
fd = xmkstemp(path);
-   if (write_in_full(fd, buf, size) != size)
+   if (write_in_full(fd, buf, size) < 0)
die_errno("unable to write temp-file");
close(fd);
return path;
diff --git a/config.c b/config.c
index eee4ac0355..daf093db45 100644
--- a/config.c
+++ b/config.c
@@ -2611,7 +2611,7 @@ int git_config_set_multivar_in_file_gently(const char 
*config_filename,
  copy_end - copy_begin) < 0)
goto write_err_out;
if (new_line &&
-   write_str_in_full(fd, "\n") != 1)
+   write_str_in_full(fd, "\n") < 0)
   

[PATCH 2/7] get-tar-commit-id: check write_in_full() return against 0

2017-09-13 Thread Jeff King
We ask to write 41 bytes and make sure that the return value
is at least 41. This is the same "dangerous" pattern that
was fixed in the prior commit (wherein a negative return
value is promoted to unsigned), though it is not dangerous
here because our "41" is a constant, not an unsigned
variable.

But we should convert it anyway to avoid modeling a
dangerous construct.

Signed-off-by: Jeff King 
---
 builtin/get-tar-commit-id.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/builtin/get-tar-commit-id.c b/builtin/get-tar-commit-id.c
index e21c5416cd..6d9a79f9b3 100644
--- a/builtin/get-tar-commit-id.c
+++ b/builtin/get-tar-commit-id.c
@@ -33,8 +33,7 @@ int cmd_get_tar_commit_id(int argc, const char **argv, const 
char *prefix)
if (!skip_prefix(content, "52 comment=", ))
return 1;
 
-   n = write_in_full(1, comment, 41);
-   if (n < 41)
+   if (write_in_full(1, comment, 41) < 0)
die_errno("git get-tar-commit-id: write error");
 
return 0;
-- 
2.14.1.874.ge7b2e05270



[PATCH 1/7] config: avoid "write_in_full(fd, buf, len) < len" pattern

2017-09-13 Thread Jeff King
The return type of write_in_full() is a signed ssize_t,
because we may return "-1" on failure (even if we succeeded
in writing some bytes). But "len" itself may be an
unsigned type (the function takes a size_t, but of course we
may have something else in the calling function). So while
it seems like:

  if (write_in_full(fd, buf, len) < len)
die_errno("write error");

would notice an error, it won't if "len" is unsigned.  The
compiler sees a signed/unsigned comparison and promotes the
signed value, resulting in (size_t)-1, the highest possible
size_t (or again, whatever type the caller has). This cannot
possibly be smaller than "len", and so the conditional can
never trigger.

I scoured the code base for cases of this, but it turns out
that these two in git_config_set_multivar_in_file_gently()
are the only ones. This case is actually quite interesting:
we don't have a size_t, but rather use the subtraction of
two pointers. Which you might think would be a signed
ptrdiff_t, but clearly both gcc and clang treat it as
unsigned (possibly because the conditional just above
guarantees that the result is greater than zero).

We can avoid the whole question by just checking for a
negative return value directly, as write_in_full() will
never return any value except -1 or the full count.

There's no addition to the test suite here, since you need
to convince write() to fail in order to see the problem. The
simplest reproduction recipe I came up with is to trigger
ENOSPC (this only works on Linux, obviously):

  # make a limited-size filesystem
  dd if=/dev/zero of=small.disk bs=1M count=1
  mke2fs small.disk
  mkdir mnt
  sudo mount -o loop small.disk mnt
  cd mnt
  sudo chown $USER:$USER .

  # make a config file with some content
  git config --file=config one.key value
  git config --file=config two.key value

  # now fill up the disk
  dd if=/dev/zero of=fill

  # and try to delete a key, which requires copying the rest
  # of the file to config.lock, and will fail on write()
  git config --file=config --unset two.key

That final command should (and does after this patch)
produce an error message due to the failed write, and leave
the file intact. With the current code it silently ignores
the failure and renames config.lock into place, leaving you
with a totally empty config file!

Reported-by: demerphq 
Signed-off-by: Jeff King 
---
 config.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/config.c b/config.c
index d0d8ce823a..eee4ac0355 100644
--- a/config.c
+++ b/config.c
@@ -2608,8 +2608,7 @@ int git_config_set_multivar_in_file_gently(const char 
*config_filename,
/* write the first part of the config */
if (copy_end > copy_begin) {
if (write_in_full(fd, contents + copy_begin,
- copy_end - copy_begin) <
-   copy_end - copy_begin)
+ copy_end - copy_begin) < 0)
goto write_err_out;
if (new_line &&
write_str_in_full(fd, "\n") != 1)
@@ -2631,8 +2630,7 @@ int git_config_set_multivar_in_file_gently(const char 
*config_filename,
/* write the rest of the config */
if (copy_begin < contents_sz)
if (write_in_full(fd, contents + copy_begin,
- contents_sz - copy_begin) <
-   contents_sz - copy_begin)
+ contents_sz - copy_begin) < 0)
goto write_err_out;
 
munmap(contents, contents_sz);
-- 
2.14.1.874.ge7b2e05270



RE: [PATCH 1/1] reset: fix reset when using the sparse-checkout feature.

2017-09-13 Thread Kevin Willford
> From: Jacob Keller [mailto:jacob.kel...@gmail.com]
> Sent: Tuesday, September 12, 2017 7:39 PM
> 
> On Tue, Sep 12, 2017 at 4:30 PM, Kevin Willford  wrote:
> >
> > Sorry.  It was not in the sparse-checkout file.
> > $ git config core.sparsecheckout true
> > $ echo /i > .git/info/sparse-checkout
> > $ git checkout -f
> > was ran in the modified file case as well and "i" was the only file in the
> > working directory before reset.
> >
> 
> 
> I'm assuming then that you mean that some file "m" is modified by the
> commit, and do not mean to say that it has local modifications in the
> working tree? That is not what I had inferred from earlier, so I was
> very much confused.
> 

Yes

> In this example, the only file actually checked out is "i", as
> everything else will have the skip-worktree bit set..?
> 

Yes

> So doing git reset HEAD~1 will reset the branch back one commit, and
> now because of this reset is clearing the skip worktree flag, and
> since you never had a copy of it checked out it is notified as
> deleted, because it's no longer marked as skip-worktree?
> 

Correct

> 
> I think the real fix is to stop having reset clear skip-worktree, no?
> 
> By definition if you do a sparse checkout, you're telling git "I only
> care about the files in this sparse checkout, and do not concern me
> with anything else"... So the proposed fix is "since git cleared the
> skip-worktree flag, we should actually also copy the file out again."
> but I think the real problem is that we're clearing skip worktree to
> begin with?

This certainly is an option but I would have some questions with this
approach.  Does this statement, "I only care about the files in this
sparse checkout, and do not concern me with anything else", mean
that git should not change files outside the sparse-checkout whether
they are in the working directory or the index?  Or does that only
apply to the working directory and the index version of files can
change to whatever git decides?  So in the example above would
"m" be the HEAD~1 version of the file in the index or the HEAD
version before the reset?  Does this apply to all git commands,
merge, rebase, cherry-pick, etc?  What about when there is a merge
conflict with a file that is outside the sparse checkout?

Seems to me it is a lot more complex than only caring about the files
in the sparse checkout and no concern for anything else.  Personally
I would like to error on the side of letting the user decide what they
want to do, even if that means turning off the skip-worktree bit and
putting the working directory into an expected state.




[PATCH 0/7] config.c may fail to notice some write() failures

2017-09-13 Thread Jeff King
On Wed, Sep 13, 2017 at 01:59:17PM +0200, demerphq wrote:

> After being away for a while I saw the following message in one of my git 
> repos:
> 
> $ git status
> On branch yves/xxx
> Your branch is based on 'origin/yves/xxx', but the upstream is gone.
>   (use "git branch --unset-upstream" to fixup)
> 
> nothing to commit, working tree clean
> $ git branch --unset-upstream
> fatal: could not unset 'branch.yves/simple_projection.merge'
> 
> At this point my .git/config file was empty, and all of my config was lost.

Here's the fix, along with some related cleanups. The actual bugfix is
in the first patch, which I think should go to "maint". The rest are not
as important, so could go to master (but would also be fine for maint,
as there should be no user-visible changes).

  [1/7]: config: avoid "write_in_full(fd, buf, len) < len" pattern
  [2/7]: get-tar-commit-id: check write_in_full() return against 0
  [3/7]: avoid "write_in_full(fd, buf, len) != len" pattern
  [4/7]: convert less-trivial versions of "write_in_full() != len"
  [5/7]: pkt-line: check write_in_full() errors against "< 0"
  [6/7]: notes-merge: use ssize_t for write_in_full() return value
  [7/7]: config: flip return value of store_write_*()

 builtin/get-tar-commit-id.c |  3 +--
 builtin/receive-pack.c  |  2 +-
 builtin/rerere.c|  2 +-
 builtin/unpack-file.c   |  2 +-
 config.c| 38 +++---
 diff.c  |  2 +-
 entry.c |  5 +++--
 fast-import.c   |  2 +-
 http-backend.c  |  4 ++--
 ll-merge.c  |  2 +-
 notes-merge.c   |  2 +-
 pkt-line.c  | 29 ++---
 read-cache.c|  6 +++---
 refs.c  |  2 +-
 refs/files-backend.c| 10 +-
 rerere.c|  2 +-
 shallow.c   |  6 +++---
 streaming.c |  2 +-
 t/helper/test-delta.c   |  2 +-
 transport-helper.c  |  5 ++---
 wrapper.c   |  2 +-
 21 files changed, 64 insertions(+), 66 deletions(-)

-Peff


Re: RFC v3: Another proposed hash function transition plan

2017-09-13 Thread Jonathan Nieder
Hi Dscho,

Johannes Schindelin wrote:

> So even if the code to generate a bidirectional old <-> new hash mapping
> might be with us forever, it *definitely* should be optional ("optional"
> at least as in "config setting"), allowing developers who only work with
> new-hash repositories to save the time and electrons.

Agreed.  This is a good reason not to store the sha1 inside the
sha256-encoded objects.  I think that is exactly what Brandon was saying
in response to Junio --- did you read it differently?

[...]
> ... or Git would simply handle the absence of the generation number header
> gracefully, so that sha1-content == sha3-content...

Part of the sha1-content is references to other objects using their
sha1-name, so it is not possible to have sha1-content == sha3-content.

That said, I am also leaning against including generation numbers as
part of this design.

There is an argument for including generation numbers.  It is much
simpler to have generation numbers in *all* commit objects than only in
some, since it means the slop-based heuristics for faking generation
numbers using commit timestamp can be completely avoided for a
repository using such a format.  Including generation numbers in all
commit objects is a painless thing to do during a format change, since
it can happen without harming round-tripping.

Treating generation numbers as derived data (as in Jeff King's
preferred design, if I have understood his replies correctly) would
also be possible but it does not interact well with shallow clone or
narrow clone.

All that said, for simplicity I still lean against including
generation numbers as part of a hash function transition.  Nothing
stops us from having another format change later.

This is a particularly hard decision because I don't have a strong
preference.  That leads me to err on the side of simplicity.

I will make sure to discuss this issue in my patch to
Documentation/technical/, so we don't have to repeat the same
conversations again and again.

[...]
> Taking a step back, though, it may be a good idea to leave the generation
> number business for later, as much fun as it is to get side tracked and
> focus on relatively trivial stuff instead of the far more difficult and
> complex task to get the transition plan to a new hash ironed out.
>
> For example, I am still in favor of SHA-256 over SHA3-256, after learning
> some background details from in-house cryptographers: it provides
> essentially the same level of security, according to my sources, while
> hardware support seems to be coming to SHA-256 a lot sooner than to
> SHA3-256.
>
> Which hash algorithm to choose is a tough question to answer, and
> discussing generation numbers will sadly not help us answer it any quicker.

This is unrelated to Brandon's message, except for his use of SHA3 as
a placeholder for "the next hash function".

My assumption based on previous conversations (and other external
conversations like [1]) is that we are going to use SHA2-256 and have
a pretty strong consensus for that.  Don't worry!

As a side note, I am probably misreading, but I found this set of
paragraphs a bit condescending.  It sounds to me like you are saying
"You are making the wrong choice of hash function and everything else
you are describing is irrelevant when compared to that monumental
mistake.  Please stop working on things I don't consider important".
With that reading it is quite demotivating to read.

An alternative reading is that you are saying that the transition plan
described in this thread is not ironed out.  Can you spell that out
more?  What particular aspect of the transition plan (which is of
course orthogonal to the choice of hash function) are you discontent
with?

Thanks and hope that helps,
Jonathan

[1] https://www.imperialviolet.org/2017/05/31/skipsha3.html


Re: Bug: git branch --unset-upstream command can nuke config when disk is full.

2017-09-13 Thread demerphq
On 13 September 2017 at 17:22, Jeff King  wrote:
> On Wed, Sep 13, 2017 at 05:18:56PM +0200, demerphq wrote:
>
>> > Hmph. That is very disturbing. But with that information I should be
>> > able to track down the culprit. Thanks for digging.
>>
>> FWIW, I see that git_config_set_multivar_in_file_gently() uses
>> write_in_full() which in turn uses xwrite(), but the latter has the
>> following comment on it:
>>
>> /*
>>  * xwrite() is the same a write(), but it automatically restarts write()
>>  * operations with a recoverable error (EAGAIN and EINTR). xwrite() DOES NOT
>>  * GUARANTEE that "len" bytes is written even if the operation is successful.
>>  */
>>
>> I suspect that at this point I am not adding much value here, so I
>> will leave it at this.
>
> No, the problem is in this line:
>
>  if (write_in_full(fd, contents + copy_begin,
>copy_end - copy_begin) <
>  copy_end - copy_begin)
>   goto write_err_out;
>
> write_in_full() returns -1 on error (_not_ how many bytes were actually
> written). So its return is a signed ssize_t. But the result of the
> pointer subtraction "copy_end - copy_begin" is an unsigned ptrdiff_t.
> The compiler promotes the signed to an unsigned, so the condition can
> never be true (the "-1" becomes the highest possible value).

Bah. Good eye. I missed that entirely.

> I have the fix, but I'm searching the code base for other instances of
> the same error.

Yeah, I think there are few just in that file.

Thanks for fixing this!

cheers,
Yves

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"


Re: Bug: git branch --unset-upstream command can nuke config when disk is full.

2017-09-13 Thread Jeff King
On Wed, Sep 13, 2017 at 05:18:56PM +0200, demerphq wrote:

> > Hmph. That is very disturbing. But with that information I should be
> > able to track down the culprit. Thanks for digging.
> 
> FWIW, I see that git_config_set_multivar_in_file_gently() uses
> write_in_full() which in turn uses xwrite(), but the latter has the
> following comment on it:
> 
> /*
>  * xwrite() is the same a write(), but it automatically restarts write()
>  * operations with a recoverable error (EAGAIN and EINTR). xwrite() DOES NOT
>  * GUARANTEE that "len" bytes is written even if the operation is successful.
>  */
> 
> I suspect that at this point I am not adding much value here, so I
> will leave it at this.

No, the problem is in this line:

 if (write_in_full(fd, contents + copy_begin,
   copy_end - copy_begin) <
 copy_end - copy_begin)
  goto write_err_out;

write_in_full() returns -1 on error (_not_ how many bytes were actually
written). So its return is a signed ssize_t. But the result of the
pointer subtraction "copy_end - copy_begin" is an unsigned ptrdiff_t.
The compiler promotes the signed to an unsigned, so the condition can
never be true (the "-1" becomes the highest possible value).

I have the fix, but I'm searching the code base for other instances of
the same error.

-Peff


Re: Bug: git branch --unset-upstream command can nuke config when disk is full.

2017-09-13 Thread demerphq
On 13 September 2017 at 16:51, Jeff King  wrote:
> On Wed, Sep 13, 2017 at 04:49:45PM +0200, demerphq wrote:
>
>> On 13 September 2017 at 16:17, Jeff King  wrote:
>> > You're welcome to read over the function to double-check, but I just
>> > looked it over and couldn't find any unchecked writes.
>>
>> I can look, but I doubt I would notice something you did not.
>>
>> On the other hand the strace output does show that this is a case
>> where the writes failed, but we still renamed the empty config.lock
>> file into place:
>>
>>
>> write(3, "[core]\n\tsharedRepository = true\n"..., 288) = -1 ENOSPC
>> (No space left on device)
>> write(3, "merge = refs/heads/yves/"..., 51) = -1 ENOSPC (No
>> space left on device)
>> munmap(0x7f48d9b8c000, 363) = 0
>> close(3)= 0
>> rename("/usr/local/git_tree/main/.git/config.lock",
>> "/usr/local/git_tree/main/.git/config") = 0
>
> Hmph. That is very disturbing. But with that information I should be
> able to track down the culprit. Thanks for digging.

FWIW, I see that git_config_set_multivar_in_file_gently() uses
write_in_full() which in turn uses xwrite(), but the latter has the
following comment on it:

/*
 * xwrite() is the same a write(), but it automatically restarts write()
 * operations with a recoverable error (EAGAIN and EINTR). xwrite() DOES NOT
 * GUARANTEE that "len" bytes is written even if the operation is successful.
 */

I suspect that at this point I am not adding much value here, so I
will leave it at this.

>> I freed up space and things worked, so I somehow doubt the filesystem
>> is at fault. When I then filled up the disk and retried the error was
>> repeatable.
>
> Yeah, agreed. This really does look like a bug.

FWIW, where it bit me turned out to be harmless. So while no doubt
this could be a real PITA for someone it wasn't for me.

Yves


-- 
perl -Mre=debug -e "/just|another|perl|hacker/"


RE: merge-base not working as expected when base is ahead

2017-09-13 Thread Ekelhart Jakob
Dear Git,

git merge-base --fork-point "master" not working if master is already newer 
then my current branch.
Very oddly it seems to work whenever you had the expected commit checked out 
previously - what made it very tricky to detect this problem.

Example:
 - Clone "https://github.com/jekelhart/GitInfoTry;
 - Switch to branch "v1.0.0"
 - git merge-base --fork-point "master"
   - or: git merge-base --fork-point "origin/master" 
 - expected result: fork point "fbb1db34c6317a6e8b319c1ec261e97ca1672c22"
 - but result is empty

In the repo where we created this example tree in the first place the command 
returned the expected fork point. If you clone it new and fresh it does not 
return any result anymore.

Works, however, on branch "v2.0.0". Assumption: because "master" is older.(?)
I think it works locally because the command uses the reflog in addition(!), 
however, it should work without the local reflog as well. (since the history 
was not modified in any way)

BR, Jakob


Re: [PATCH 4/4] archive: queue directories for all types of pathspecs

2017-09-13 Thread René Scharfe
Am 13.09.2017 um 14:53 schrieb Jeff King:
> On Wed, Sep 13, 2017 at 12:43:57AM +0200, René Scharfe wrote:
>> Yet another way is have a few levels of nested subdirectories (e.g.
>> d1/d2/d3/file1) and ignoring the entries at the leaved (e.g. file1).
> 
> s/leaved/leaves/ ?

Indeed.

René


Re: Bug: git branch --unset-upstream command can nuke config when disk is full.

2017-09-13 Thread Jeff King
On Wed, Sep 13, 2017 at 04:49:45PM +0200, demerphq wrote:

> On 13 September 2017 at 16:17, Jeff King  wrote:
> > You're welcome to read over the function to double-check, but I just
> > looked it over and couldn't find any unchecked writes.
> 
> I can look, but I doubt I would notice something you did not.
> 
> On the other hand the strace output does show that this is a case
> where the writes failed, but we still renamed the empty config.lock
> file into place:
> 
> 
> write(3, "[core]\n\tsharedRepository = true\n"..., 288) = -1 ENOSPC
> (No space left on device)
> write(3, "merge = refs/heads/yves/"..., 51) = -1 ENOSPC (No
> space left on device)
> munmap(0x7f48d9b8c000, 363) = 0
> close(3)= 0
> rename("/usr/local/git_tree/main/.git/config.lock",
> "/usr/local/git_tree/main/.git/config") = 0

Hmph. That is very disturbing. But with that information I should be
able to track down the culprit. Thanks for digging.

> I freed up space and things worked, so I somehow doubt the filesystem
> is at fault. When I then filled up the disk and retried the error was
> repeatable.

Yeah, agreed. This really does look like a bug.

-Peff


Re: Bug: git branch --unset-upstream command can nuke config when disk is full.

2017-09-13 Thread demerphq
On 13 September 2017 at 16:17, Jeff King  wrote:
> You're welcome to read over the function to double-check, but I just
> looked it over and couldn't find any unchecked writes.

I can look, but I doubt I would notice something you did not.

On the other hand the strace output does show that this is a case
where the writes failed, but we still renamed the empty config.lock
file into place:


write(3, "[core]\n\tsharedRepository = true\n"..., 288) = -1 ENOSPC
(No space left on device)
write(3, "merge = refs/heads/yves/"..., 51) = -1 ENOSPC (No
space left on device)
munmap(0x7f48d9b8c000, 363) = 0
close(3)= 0
rename("/usr/local/git_tree/main/.git/config.lock",
"/usr/local/git_tree/main/.git/config") = 0

Full strace is below:

>> > Given that your output is consistent with it failing to find the key,
>> > and that the result is an empty file, it sounds like somehow the mmap'd
>> > input appeared empty (but neither open nor fstat nor mmap returned an
>> > error). You're not on any kind of exotic filesystem, are you?
>>
>> I don't think so, but I don't know. Is there a command I can run to check?

I freed up space and things worked, so I somehow doubt the filesystem
is at fault. When I then filled up the disk and retried the error was
repeatable.

>> BTW, with a bit of faffing I can probably recreate this problem.
>> Should I try? Is there something I could do during recreation that
>> would help?
>
> If you think you can reproduce, the output of "strace" on a failing
> invocation would be very interesting.

I can reproduce, see below. Preceded and suffixed by ls on the .git/config file.

I have munged the branch name for privacy reasons, hope that doesn't
invalidate the strace utility.

cheers,
yves

$ ls -la .git/config
-rw-rw-r-- 1 root users 363 Sep 13 16:36 .git/config
$ strace git branch --unset-upstream
execve("/usr/bin/git", ["git", "branch", "--unset-upstream"], [/* 39
vars */]) = 0
brk(0)  = 0x222c000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x7f48d9b94000
access("/etc/ld.so.preload", R_OK)  = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)  = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=50300, ...}) = 0
mmap(NULL, 50300, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f48d9b87000
close(3)= 0
open("/lib64/libpcre.so.0", O_RDONLY)   = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0@\25\0\0\0\0\0\0"...,
832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=183080, ...}) = 0
mmap(NULL, 2278264, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
0) = 0x7f48d9748000
mprotect(0x7f48d9774000, 2097152, PROT_NONE) = 0
mmap(0x7f48d9974000, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2c000) = 0x7f48d9974000
close(3)= 0
open("/lib64/libz.so.1", O_RDONLY)  = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0
!\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=88600, ...}) = 0
mmap(NULL, 2183696, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
0) = 0x7f48d9532000
mprotect(0x7f48d9547000, 2093056, PROT_NONE) = 0
mmap(0x7f48d9746000, 8192, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x14000) = 0x7f48d9746000
close(3)= 0
open("/lib64/libpthread.so.0", O_RDONLY) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\^\0\0\0\0\0\0"...,
832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=143280, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x7f48d9b86000
mmap(NULL, 2212848, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
0) = 0x7f48d9315000
mprotect(0x7f48d932c000, 2097152, PROT_NONE) = 0
mmap(0x7f48d952c000, 8192, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x17000) = 0x7f48d952c000
mmap(0x7f48d952e000, 13296, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f48d952e000
close(3)= 0
open("/lib64/librt.so.1", O_RDONLY) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240!\0\0\0\0\0\0"...,
832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=44472, ...}) = 0
mmap(NULL, 2128816, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
0) = 0x7f48d910d000
mprotect(0x7f48d9114000, 2093056, PROT_NONE) = 0
mmap(0x7f48d9313000, 8192, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x6000) = 0x7f48d9313000
close(3)= 0
open("/lib64/libc.so.6", O_RDONLY)  = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\\356\1\0\0\0\0\0"...,
832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1924768, ...}) = 0
mmap(NULL, 3750184, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
0) = 0x7f48d8d79000
mprotect(0x7f48d8f03000, 2097152, PROT_NONE) = 0
mmap(0x7f48d9103000, 24576, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 

Re: Bug: git branch --unset-upstream command can nuke config when disk is full.

2017-09-13 Thread Jeff King
On Wed, Sep 13, 2017 at 03:38:52PM +0200, demerphq wrote:

> I just double checked the terminal history and this is all i saw:
> 
> $ git status
> On branch yves/xxx
> Your branch is based on 'origin/yves/xxx', but the upstream is gone.
>   (use "git branch --unset-upstream" to fixup)
> 
> nothing to commit, working tree clean
> $ git branch --unset-upstream
> fatal: could not unset 'branch.yves/xxx.merge'
> $ git status
> On branch yves/xxx
> nothing to commit, working tree clean
> $ git fetch
> fatal: No remote repository specified.  Please, specify either a URL or a
> remote name from which new revisions should be fetched.

As a side note, I'm surprised that commands work at all when your
.git/config is empty. I'd expect check_respository_format() to complain
that you are not in a repository.  Looks like it is due to this block:

  452/*
  453 * For historical use of check_repository_format() in git-init,
  454 * we treat a missing config as a silent "ok", even when 
nongit_ok
  455 * is unset.
  456 */
  457   if (candidate.version < 0)
  458   return 0;

> > No, it writes the new content to "config.lock" and then renames it into
> > place.
> > All of the write() calls to the temporary file are checked.
> 
> I was going to say that perhaps the write was not checked... But if
> you are confident they are then...

You're welcome to read over the function to double-check, but I just
looked it over and couldn't find any unchecked writes.

> > Given that your output is consistent with it failing to find the key,
> > and that the result is an empty file, it sounds like somehow the mmap'd
> > input appeared empty (but neither open nor fstat nor mmap returned an
> > error). You're not on any kind of exotic filesystem, are you?
> 
> I don't think so, but I don't know. Is there a command I can run to check?
> 
> BTW, with a bit of faffing I can probably recreate this problem.
> Should I try? Is there something I could do during recreation that
> would help?

If you think you can reproduce, the output of "strace" on a failing
invocation would be very interesting.

-Peff


Re: [PATCH 4/4] packed refs: pass .git dir instead of packed-refs path to init_fn

2017-09-13 Thread Richard Maw
On Wed, Sep 13, 2017 at 10:45:45AM +0200, Michael Haggerty wrote:
> First of all there is a superficial naming inconsistency. This method is
> called `init` in the vtable, but the functions implementing it are
> called `*_ref_store_create()`. It actually initializes the data
> structures for dealing with the reference store, as opposed to
> initializing the reference store on disk (*that* virtual function is
> called `init_db`). It should maybe be called `open` instead.
> 
> But more fundamentally, what is a `ref_store`? Originally it always
> represented a *logical* place to store all of the references for a
> repository. In that sense, it made sense to have an "open" method that
> takes a `gitdir` as argument.
> 
> But its role is currently getting stretched. Now it sometimes represents
> a *physical* place to store references, which might be combined with
> other physical `ref_store`s to make a logical `ref_store`. There is not
> necessarily a 1:1 correspondence between gitdirs and physical
> `ref_store`s; more to the point you might want to initialize a physical
> refstore that doesn't correspond to `$GITDIR`. So requiring every
> `ref_store` to have a factory function with a gitdir argument is
> probably incorrect.
>
> For example, a subtree has a single logical reference store, but it is
> composed out of three physical reference stores: the loose references in
> the subtree's gitdir plus loose and packed references in the common gitdir.

One way to implement a namespacing backend would include
a loose references backend of the subtree of refs for the namespace
overlayed with a namespace translating backend backed onto packed references,
so there's benefits to not being strict about it being a gitdir.

> It seems to me that we need two separate concepts:
> 
> 1. Create an object representing a gitdir's *logical* `ref_store`. This
> "class method" could always have a single gitdir as parameter. This
> would be the method by which the repository initialization code
> instantiates its logical `ref_store`, for example by reading the type of
> the reference store from the repo's config, then looking up the class by
> its type, then calling its "open" method to create an instance.
> 
> 2. Create an object representing a physical `ref_store`. Different
> reference store classes might have different "constructor" arguments.
> For example, if it represents a namespace within a larger reference tree
> contained in a shared repository, its arguments might be
> `(shared_gitdir, namespace)`. (CC to Richard Maw for this angle.)

The distinction confused me while attempting to add the namespaced ref backend,
since I was unfamiliar with how the ref stores were meant to be defined.
In the end I just omitted the init function and create them separately.

I'm not sure drawing the line at "logical ref stores init" and
"physical ref stores are constructed differently" is the right division,
since the namespaced ref store is more of a local ref store
and its current interface is an existing ref store and a namespace.

Perhaps a better distinction would be frontend vs implementation ref stores
where frontend ones are registered to be found with find_ref_storage_backend,
while implementation ref stores don't use that
and hence don't need the next, name, init or init_db.

Then the physical vs logical distinction is whether they access refs directly
or only via another backend.

It wouldn't be implausible for the namespaced ref store backend
to be a frontend logical ref store that constructs the files ref store itself;
I just didn't have enough of a feel of how things went together to spot that.


Re: RFC v3: Another proposed hash function transition plan

2017-09-13 Thread demerphq
On 13 September 2017 at 14:05, Johannes Schindelin
 wrote:
> For example, I am still in favor of SHA-256 over SHA3-256, after learning
> some background details from in-house cryptographers: it provides
> essentially the same level of security, according to my sources, while
> hardware support seems to be coming to SHA-256 a lot sooner than to
> SHA3-256.

FWIW, and I know it is not worth much, as far as I can tell there is
at least some security/math basis to prefer SHA3-256 to SHA-256.

The SHA1 and SHA-256 hash functions, (iirc along with their older
cousins MD5 and MD2) all have a common design feature where they mix a
relatively large block size into a much smaller state *each block*. So
for instance SHA-256 mixes a 512 bit block into a 256 bit state with a
2:1 "leverage" between the block being read and the state. In SHA1
this was worse, mixing a 512 bit block into a 160 bit state, closer to
3:1 leverage.

SHA3 however uses a completely different design where it mixes a 1088
bit block into a 1600 bit state, for a leverage of 2:3, and the excess
is *preserved between each block*.

Assuming everything else is equal between SHA-256 and SHA3 this
difference alone would seem to justify choosing SHA3 over SHA-256. We
know that there MUST be collisions when compressing a 512 bit block
into a 256 bit space, however one cannot say the same about mixing
1088 bits into a 1600 bit state. The excess state which is not
directly modified by the input block makes a big difference when
reading the next block.

Of course in both cases we end up compressing the entire source
document down to the same number of bits, however SHA3 does that
*once*, in finalization only, whereas SHA-256 does it *every* block
read. So it seems to me that the opportunity for collisions is *much*
higher in SHA-256 than it is in SHA3-256. (Even if they should be
vanishingly rare regardless.)

For this reason if I had a vote I would definitely vote SHA3-256, or
even for SHA3-512. The latter has an impressive 1:2 leverage between
block and state, and much better theoretical security levels.

cheers,
Yves
Note: I am not a cryptographer, although I am probably pretty well
informed as far hobby-hash-function-enthusiasts go.
-- 
perl -Mre=debug -e "/just|another|perl|hacker/"


Re: Bug: git branch --unset-upstream command can nuke config when disk is full.

2017-09-13 Thread demerphq
On 13 September 2017 at 14:34, Jeff King  wrote:
> On Wed, Sep 13, 2017 at 01:59:17PM +0200, demerphq wrote:
>
>> After being away for a while I saw the following message in one of my git 
>> repos:
>>
>> $ git status
>> On branch yves/xxx
>> Your branch is based on 'origin/yves/xxx', but the upstream is gone.
>>   (use "git branch --unset-upstream" to fixup)
>>
>> nothing to commit, working tree clean
>> $ git branch --unset-upstream
>> fatal: could not unset 'branch.yves/simple_projection.merge'
>
> Hrm. I wonder what caused this failure. The error would be in
> git_config_set_multivar_in_file_gently(). Most errors there produce
> another error message before hitting the die(). In fact, the only case I
> see where it would not produce another message is if it found nothing to
> unset (but in that case, "branch" would never have called the function
> in the first place).

I just double checked the terminal history and this is all i saw:

$ git status
On branch yves/xxx
Your branch is based on 'origin/yves/xxx', but the upstream is gone.
  (use "git branch --unset-upstream" to fixup)

nothing to commit, working tree clean
$ git branch --unset-upstream
fatal: could not unset 'branch.yves/xxx.merge'
$ git status
On branch yves/xxx
nothing to commit, working tree clean
$ git fetch
fatal: No remote repository specified.  Please, specify either a URL or a
remote name from which new revisions should be fetched.

>> At this point my .git/config file was empty, and all of my config was lost.
>>
>> I assume that things that rewrite .git/config do not check for a
>> successful write before deleting the old version of the file.
>
> No, it writes the new content to "config.lock" and then renames it into
> place.
> All of the write() calls to the temporary file are checked.

I was going to say that perhaps the write was not checked... But if
you are confident they are then...

>The
> old data is copied over after having been ready by mmap (which is also
> error-checked).
>
> Given that your output is consistent with it failing to find the key,
> and that the result is an empty file, it sounds like somehow the mmap'd
> input appeared empty (but neither open nor fstat nor mmap returned an
> error). You're not on any kind of exotic filesystem, are you?

I don't think so, but I don't know. Is there a command I can run to check?

BTW, with a bit of faffing I can probably recreate this problem.
Should I try? Is there something I could do during recreation that
would help?

cheers,
Yves

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"


Re: Unexpected pass for t6120-describe.sh on cygwin

2017-09-13 Thread Johannes Schindelin
Hi Michael,

On Wed, 13 Sep 2017, Michael J Gruber wrote:

> Could you please try and report on the following (cygwin, MinGW):
> 
> ulimit -s
> ulimit -s 4096 # anything lower than the value from above
> ulimit -s
> bash -c "ulimit -s"

Git Bash (MINGW, well, not precisely [*1*]):

me@work MINGW64 ~
$ ulimit -s
2032

me@work MINGW64 ~
$ ulimit -s 4096 # anything lower than the value from above

me@work MINGW64 ~
$ ulimit -s
4096

me@work MINGW64 ~
$ bash -c "ulimit -s"
2032

Judging by your comment, 4096 should be replaced. So here goes again:

me@work MINGW64 ~
$ ulimit -s 1024

me@work MINGW64 ~
$ ulimit -s
1024

me@work MINGW64 ~
$ bash -c "ulimit -s"
2032

And here is the same output of my 64-bit Cygwin installation (just updated
to the current [*2*] one):

me@work  ~
$ ulimit -s
2032

me@work  ~
$ ulimit -s 1024

me@work  ~
$ ulimit -s
1024

me@work  ~
$ bash -c "ulimit -s"
2032

Ciao,
Dscho

Footnote *1*: I know it is confusing for Linux folks, there are two very
different classes of executables in Git for Windows: MSYS2 ones and MINGW
ones. The former implicitly link against the MSYS2 runtime, and therefore
can make use of its POSIX emulation layer, the latter do not, and
therefore they can use "only" what the Win32 API provides. For details,
see
https://github.com/git-for-windows/git/wiki/The-difference-between-MINGW-and-MSYS2


[PATCH v2] commit-template: change a message to be more intuitive

2017-09-13 Thread Kaartic Sivaraam
It's not good to use the phrase 'do not touch' to convey the information
that the cut-line should not be modified or removed as it could possibly
be mis-interpreted by a person who doesn't know that the word 'touch' has
the meaning of 'tamper with'. Further, it could make translations a little
difficult as it might not have the intended meaning in a few languages when
translated as such.

So, use more intuitive terms in the sentence.

Signed-off-by: Kaartic Sivaraam 
---
 wt-status.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/wt-status.c b/wt-status.c
index 77c27c51134d2..be53579760ee7 100644
--- a/wt-status.c
+++ b/wt-status.c
@@ -934,7 +934,7 @@ size_t wt_status_locate_end(const char *s, size_t len)
 
 void wt_status_add_cut_line(FILE *fp)
 {
-   const char *explanation = _("Do not touch the line above.\nEverything 
below will be removed.");
+   const char *explanation = _("Do not modify or remove the line 
above.\nEverything below will be removed.");
struct strbuf buf = STRBUF_INIT;
 
fprintf(fp, "%c %s", comment_line_char, cut_line);

--
https://github.com/git/git/pull/401


[PATCH v2] commit-template: change a message to be more intuitive

2017-09-13 Thread Kaartic Sivaraam

Changes in v2:

   - Changed the message as suggested by Jeff

   - Made the commit message to be even more clear


---

Kaartic



Re: [PATCH] format-patch: use raw format for notes

2017-09-13 Thread Jeff King
On Mon, Sep 11, 2017 at 02:27:38PM +1000, Sam Bobroff wrote:

> >  - It is very much intended to allow The "(foo)" after the "Notes"
> >label to show which notes ref the note comes from, because there
> >can be more than one notes refs that annotate the same commit.
> 
> Right, that makes perfect sense to me when it's being output locally.
> 
> But the ref names are local to my git repo and there is no reaason why
> they should be meaningful or even known to the recipients of the patch
> email.

I can see how your notes names might not be of interest to others. But I
can also see how they _could_ be. For instance, if you kept test result
annotations in a notes ref, you would want to mark them as such.

The idea of the current output is that you'd put general text into
"refs/notes/commits" (which shows up only as "Notes:"). And if you are
putting notes in another ref, you have some reason to do so, which
implies that it's worth showing that they're not in the default ref.

I grant that there are reasons to do so which might not be worth showing
(e.g., you might be pushing and fetching refs, and keep some hierarchy).
But I don't think "are we emailing them" is a robust determiner of "are
they worth showing". So this probably needs to be a separate option,
rather than tied to the output format.

Or possibly there should be a naming convention (e.g., that everything
that ends in "/commits" doesn't have its name shown, which would allow
multiple hierarchies). It's hard to say without knowing the reason you
chose a non-default refname in the first place.

-Peff


Re: [PATCH 06/34] clone: release strbuf after use in remove_junk()

2017-09-13 Thread Jeff King
On Mon, Sep 11, 2017 at 11:40:05PM +0200, René Scharfe wrote:

> > Yes, but I didn't want to touch each code site that creates a file,
> > which is a lot more invasive. I expect expanding to 4096 (or PATH_MAX)
> > would be sufficient in practice.
> 
> Perhaps it is in most cases.  Having certainty would be better.  We can
> leave unpack_trees() untouched and instead traverse the tree beforehand,
> in order to find the longest path.  Perhaps we can do something similar
> for init_db().

I wonder if it would be possible to just wrap open() in a transparent
way.

> > I'd also note that the current code isn't _remotely_ async safe and
> > nobody's complained. So I'm perfectly happy doing nothing, too. I care
> > a bit more about the tempfile.c interface because it's invoked a lot
> > more frequently.
> 
> I guess clones are not interrupted very often during checkout; same
> with worktree creation.  So perhaps nasty things could happen with a
> low probability, but nobody tried hard enough to hit that case?

Right, that's my guess. And "nasty" is quite likely to be "deadlocks on
a malloc or stdio lock". Which is not the end of the world.

> > No, I think that's how you'd have to do it. The implementation isn't all
> > that bad, but hitting every file-creation would be invasive. I'd
> > almost rather bail to exec-ing "rm -rf $dir", but we probably can't do
> > _that_ in a signal-handler either. :)
> 
> Well, fork(2), execve(2), and waitpid(2) are on the list, so actually
> you can.
> 
> mingw_spawnvpe(), which is used by start_command() on Windows,
> allocates memory, though.  Preparing environment and argument list
> in advance and just calling CreateProcess() in the signal handler
> might work.

Yeah, I imagine it's do-able with enough advance effort.

Given the lack of reports and the rapidly expanding complexity, I'm not
planning on looking further into this for now.

-Peff


Re: [PATCH 4/4] archive: queue directories for all types of pathspecs

2017-09-13 Thread Jeff King
On Wed, Sep 13, 2017 at 12:43:57AM +0200, René Scharfe wrote:

> -- >8 --
> Subject: [PATCH] archive: don't add empty directories to archives
> 
> While git doesn't track empty directories, git archive can be tricked
> into putting some into archives.  One way is to construct an empty tree
> object, as t5004 does.  While that is supported by the object database,
> it can't be represented in the index and thus it's unlikely to occur in
> the wild.
> 
> Another way is using the literal name of a directory in an exclude
> pathspec -- its contents are are excluded, but the directory stub is
> included.  That's inconsistent: exclude pathspecs containing wildcards
> don't leave empty directories in the archive.
> 
> Yet another way is have a few levels of nested subdirectories (e.g.
> d1/d2/d3/file1) and ignoring the entries at the leaved (e.g. file1).

s/leaved/leaves/ ?

> The directories with the ignored content are ignored as well (e.g. d3),
> but their empty parents are included (e.g. d2).
> 
> As empty directories are not supported by git, they should also not be
> written into archives.  If an empty directory is really needed then it
> can be tracked and archived by placing an empty .gitignore file in it.
> 
> There already is a mechanism in place for suppressing empty directories.
> When read_tree_recursive() encounters a directory excluded by a pathspec
> then it enters it anyway because it might contain included entries.  It
> calls the callback function before it is able to decide if the directory
> is actually needed.  For that reason git archive adds directories to a
> queue and writes entries for them only when it encounters the first
> child item -- but currently only if pathspecs with wildcards are used.
> 
> Queue *all* directories, no matter if there even are pathspecs present.
> This prevents git archive from writing entries for empty directories in
> all cases.

Nicely explained, and this seems like the right level to be handling it.
Simple, and it will catch the cases we know about _and_ and any new ones
which pop up.

> ---
>  archive.c   | 19 ++-
>  t/t5001-archive-attr.sh |  2 +-
>  t/t5002-archive-attr-pattern.sh |  2 +-
>  t/t5004-archive-corner-cases.sh |  4 ++--
>  4 files changed, 6 insertions(+), 21 deletions(-)

I'm not too familiar with this part of the archive code, but it seemed
pretty easy to follow. The patch looks good to me.

-Peff


Re: Bug: git branch --unset-upstream command can nuke config when disk is full.

2017-09-13 Thread Jeff King
On Wed, Sep 13, 2017 at 01:59:17PM +0200, demerphq wrote:

> After being away for a while I saw the following message in one of my git 
> repos:
> 
> $ git status
> On branch yves/xxx
> Your branch is based on 'origin/yves/xxx', but the upstream is gone.
>   (use "git branch --unset-upstream" to fixup)
> 
> nothing to commit, working tree clean
> $ git branch --unset-upstream
> fatal: could not unset 'branch.yves/simple_projection.merge'

Hrm. I wonder what caused this failure. The error would be in
git_config_set_multivar_in_file_gently(). Most errors there produce
another error message before hitting the die(). In fact, the only case I
see where it would not produce another message is if it found nothing to
unset (but in that case, "branch" would never have called the function
in the first place).

> At this point my .git/config file was empty, and all of my config was lost.
> 
> I assume that things that rewrite .git/config do not check for a
> successful write before deleting the old version of the file.

No, it writes the new content to "config.lock" and then renames it into
place. All of the write() calls to the temporary file are checked. The
old data is copied over after having been ready by mmap (which is also
error-checked).

Given that your output is consistent with it failing to find the key,
and that the result is an empty file, it sounds like somehow the mmap'd
input appeared empty (but neither open nor fstat nor mmap returned an
error). You're not on any kind of exotic filesystem, are you?

-Peff


Re: [PATCH] commit-template: change a message to be more intuitive

2017-09-13 Thread Kaartic Sivaraam

On Wednesday 13 September 2017 04:50 PM, Jeff King wrote:

I agree with both of your points. It is very clear to me as a native
speaker, but I can see how it might not be for everyone.

Interestingly, the change here:


-   const char *explanation = _("Do not touch the line above.\nEverything below 
will be removed.");
+   const char *explanation = _("Do not edit the line above.\nEverything below 
will be removed.");

actually seems less clear to me. I think of "edit" as "modify". But
obviously it also should not be removed. Perhaps

   Do not modify or remove the line above.

would be the most clear. Or perhaps it is overly verbose.

Seems to be a valid point. Further, it doesn't seem to be too verbose at 
least to me. So, I'll change the

sentence and revert back to the one I use if someone finds it to be verbose.

---
Kaartic


Re: [PATCH] commit-template: change a message to be more intuitive

2017-09-13 Thread Kaartic Sivaraam

On Wednesday 13 September 2017 03:59 PM, Kevin Daudt wrote:

Touching something can also mean to disturb or change something, which
is the meaning being used here, so it is not an incorrect use of the
word.


I do get your point but I should have been clearer in my commit message 
about the

fact that it's a word that could be misinterpreted.


So, make the sentence more intuitive.

I can understand however that it might be not so clear for people with
less fluency in English.


Not just that, it could also make the translations a little unclear. 
i.e., the when the words that convey
clear meaning in a particular context in english speaker get translated 
"as such" to another language,
they might turn out to become unclear or might even convey a totally 
different meaning. I'm not sure
if is currently happening as I am not familiar with any languages for 
which translations are currently
available but I'm pretty sure that the sentence "Do not touch the line 
above." would not give a clear
meaning when it gets translated "as such" to my native language (Tamil). 
Of course, it's only a problem
if the sentences get translated "as such" but I'm not sure if 
translators (human/non-human) try do
meaningful translations or try to preserve a one-to-one relationship 
with the words in the source

language, either.

---
Kaartic


  1   2   >