from:"David Turner"

git log --shortstat doesn't respect -c (but --stat does)

2018-07-10 Thread David Turner

This seems inconsistent:

$ git log --oneline --stat 91ccfb85176 -c
91ccfb8517 Merge branch 'sb/diff-color-move'

 diff.c | 28 +++-
 t/t4015-diff-whitespace.sh |  9 +
 2 files changed, 24 insertions(+), 13 deletions(-)
d1114d87c7 Merge branch 'js/rebase-i-final'

 sequencer.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
...

$ git log --oneline --shortstat 91ccfb85176 -c
91ccfb8517 Merge branch 'sb/diff-color-move'

d1114d87c7 Merge branch 'js/rebase-i-final'

4339c9f2df Merge branch 'jc/doc-checkout'
...

Especially since the manual for log -- shortstat says:

--shortstat
   Output only the last line of the --stat format containing total
   number of modified files, as well as number of added and deleted
   lines.

git push requires a worktree?

2018-05-30 Thread David Turner

I am doing a funny thing where I do git -C .git/modules/morx push
fleem:fleem.  This is failing in the case where I have a sparse
checkout and the worktree directory "morx" (which is where
.git/modules/morx/config's core.worktree points) doesn't exist.  

I don't know why git push cares about the worktree -- it'll happily
work in a bare repo with no worktree at all, or if the worktree is an
unrelated git repo or whatever.

I can work around it, but if there's a bug, I think we should fix it.

Re: Implementing reftable in Git

2018-05-11 Thread David Turner

On Fri, 2018-05-11 at 11:31 +0200, Michael Haggerty wrote:
> On Wed, May 9, 2018 at 4:33 PM, Christian Couder
> <christian.cou...@gmail.com> wrote:
> > I might start working on implementing reftable in Git soon.
> > [...]
> 
> Nice. It'll be great to have a reftable implementation in git core
> (and ideally libgit2, as well). It seems to me that it could someday
> become the new default reference storage method. The file format is
> considerably more complicated than the current loose/packed scheme,
> which is definitely a disadvantage (for example, for other Git
> implementations). But implementing it *with good performance and
> without races* might be no more complicated than the current scheme.

I am somewhat concerned about perf, because as I recall, we have a
bunch of code which effectively load all refs, which will be more
expensive with reftable than packed-refs (though maybe cheaper than
loose refs).  But maybe we have eliminated this code or can work around
it.

> Testing will be important. There are already many tests specifically
> about testing loose/packed reference storage. These will always have
> to run against repositories that are forced to use that reference
> scheme. And there will need to be new tests specifically about the
> reftable scheme. Both classes of tests should be run every time. That
> much is pretty obvious.
> 
> But currently, there are a lot of tests that assume the loose/packed
> reference format on disk even though the tests are not really related
> to references at all. ISTM that these should be converted to work at
> a
> higher level, for example using `for-each-ref`, `rev-parse`, etc. to
> examine references rather than reading reference files directly. That
> way the tests should run correctly regardless of which scheme is in
> use.

I agree with that, and I think some of my patches from years ago
attempted to do that.  I probably should have broken those out into a
separate series so that they could have been applied separately.

> And since it's too expensive to run the whole test suite with both
> reference storage schemes, it seems to me that the reference storage
> scheme that is used while running the scheme-neutral tests should be
> easy to choose at runtime.

I ran the whole suite with both schemes during my testing, and I think
it was quite valuable in flushing out bugs.

> David Turner did some analogous work for wiring up and testing his
> proposed LMDB ref storage backend that might be useful [1]. I'm CCing
> him, since he might have thoughts on this topic.

Inline, above.

Re: [PATCH 2/5] refs.c: do not die if locking fails in `write_pseudoref()`

2018-05-07 Thread David Turner



On May 6, 2018 9:56:31 AM MDT, "Martin Ågren" <martin.ag...@gmail.com> wrote:
>On 6 May 2018 at 17:48, David Turner <nova...@novalis.org> wrote:
>> On Sun, 2018-05-06 at 16:10 +0200, Martin Ågren wrote:
>>> While at it, make the lock non-static.
>
>> Re making the lock static, I wonder about the following case:
>>
>>   if (read_ref(pseudoref, _old_oid))
>>
>> die("could not read ref '%s'", pseudoref);
>>
>> I think this calls exit(), and then atexit tries to clean up the lock
>> files.  But since lock is no longer static, the stack may have been
>> destroyed (I don't actually know whether this is true, so maybe
>someone
>> else does).
>
>Right. After commit 076aa2cbda (tempfile: auto-allocate tempfiles on
>heap, 2017-09-05) this is safe though. Quite a few locks have already
>been moved to the stack, e.g., in 14bca6c63c (sequencer: make lockfiles
>non-static, 2018-02-27) and 02ae242fdd (checkout-index: simplify
>locking
>logic, 2017-10-05).  I could add a note to the commit message to make
>this clear, like "After 076aa2cbda, locks no longer need to be static."

I am going to reply now to keep the thread moving, but I am on my phone with 
bad connectivity (few cell towers in Bears Ears), so I can't really check the 
code. Feel free to disregard if I am still wrong.

I saw that patch, but I thought the new logic required that cleanup funtions be 
called before the lock goes out of scope.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH 3/5] refs.c: drop dead code checking lock status in `delete_pseudoref()`

2018-05-06 Thread David Turner

Same concern here about staticness.

On Sun, 2018-05-06 at 16:10 +0200, Martin Ågren wrote:
> After taking the lock we check whether we got it and die otherwise.
> But
> since we take the lock using `LOCK_DIE_ON_ERROR`, we would already
> have
> died.
> 
> Unlike in the previous patch, this function is not prepared for
> indicating errors via a `strbuf err`, so let's just drop the dead
> code.
> Any improved error-handling can be added later.
> 
> While at it, make the lock non-static and reduce its scope.
> 
> Signed-off-by: Martin Ågren 
> ---
>  refs.c | 7 ++-
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/refs.c b/refs.c
> index 8c50b8b139..7abd30dfe1 100644
> --- a/refs.c
> +++ b/refs.c
> @@ -689,20 +689,17 @@ static int write_pseudoref(const char
> *pseudoref, const struct object_id *oid,
>  
>  static int delete_pseudoref(const char *pseudoref, const struct
> object_id *old_oid)
>  {
> - static struct lock_file lock;
>   const char *filename;
>  
>   filename = git_path("%s", pseudoref);
>  
>   if (old_oid && !is_null_oid(old_oid)) {
> - int fd;
> + struct lock_file lock = LOCK_INIT;
>   struct object_id actual_old_oid;
>  
> - fd = hold_lock_file_for_update_timeout(
> + hold_lock_file_for_update_timeout(
>   , filename, LOCK_DIE_ON_ERROR,
>   get_files_ref_lock_timeout_ms());
> - if (fd < 0)
> - die_errno(_("Could not open '%s' for
> writing"), filename);
>   if (read_ref(pseudoref, _old_oid))
>   die("could not read ref '%s'", pseudoref);
>   if (oidcmp(_old_oid, old_oid)) {

Re: [PATCH 2/5] refs.c: do not die if locking fails in `write_pseudoref()`

2018-05-06 Thread David Turner

Re making the lock static, I wonder about the following case:
  
  if (read_ref(pseudoref, _old_oid))

die("could not read ref '%s'", pseudoref);

I think this calls exit(), and then atexit tries to clean up the lock
files.  But since lock is no longer static, the stack may have been
destroyed (I don't actually know whether this is true, so maybe someone
else does).

On Sun, 2018-05-06 at 16:10 +0200, Martin Ågren wrote:
> If we could not take the lock, we add an error to the `strbuf err`
> and
> return. However, this code is dead. The reason is that we take the
> lock
> using `LOCK_DIE_ON_ERROR`. Drop the flag to allow our more gentle
> error-handling to actually kick in.
> 
> We could instead just drop the dead code and die here. But everything
> is
> prepared for gently propagating the error, so let's do that instead.
> 
> There is similar dead code in `delete_pseudoref()`, but let's save
> that
> for the next patch.
> 
> While at it, make the lock non-static.
> 
> Signed-off-by: Martin Ågren 
> ---
>  refs.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/refs.c b/refs.c
> index 8b7a77fe5e..8c50b8b139 100644
> --- a/refs.c
> +++ b/refs.c
> @@ -644,7 +644,7 @@ static int write_pseudoref(const char *pseudoref,
> const struct object_id *oid,
>  {
>   const char *filename;
>   int fd;
> - static struct lock_file lock;
> + struct lock_file lock = LOCK_INIT;
>   struct strbuf buf = STRBUF_INIT;
>   int ret = -1;
>  
> @@ -654,8 +654,7 @@ static int write_pseudoref(const char *pseudoref,
> const struct object_id *oid,
>   strbuf_addf(, "%s\n", oid_to_hex(oid));
>  
>   filename = git_path("%s", pseudoref);
> - fd = hold_lock_file_for_update_timeout(, filename,
> -LOCK_DIE_ON_ERROR,
> + fd = hold_lock_file_for_update_timeout(, filename, 0,
>  get_files_ref_lock_ti
> meout_ms());
>   if (fd < 0) {
>   strbuf_addf(err, "could not open '%s' for writing:
> %s",

Re: [PATCH] refs: handle null-oid for pseudorefs

2018-05-06 Thread David Turner

LGTM.  

(This is the current best address to reach me, but do not expect fast
responses over the next few days as I'm out of town)



On Sun, 2018-05-06 at 15:35 +0200, Martin Ågren wrote:
> According to the documentation on `git update-ref`, it is possible to
> "specify 40 '0' or an empty string as  to make sure that
> the
> ref you are creating does not exist." But in the code for pseudorefs,
> we
> do not implement this. If we fail to read the old ref, we immediately
> die. A failure to read would actually be a good thing if we have been
> given the null-oid.
> 
> With the null-oid, allow -- and even require -- the ref-reading to
> fail.
> This implements the "make sure that the ref ... does not exist" part
> of
> the documentation.
> 
> Since we have a `strbuf err` for collecting errors, let's use it and
> signal an error to the caller instead of dying hard.
> 
> Reported-by: Rafael Ascensão 
> Helped-by: Rafael Ascensão 
> Signed-off-by: Martin Ågren 
> ---
> (David's twopensource-address bounced, so I'm trying instead the one
> he
> most recently posted from.)
> 
>  t/t1400-update-ref.sh |  7 +++
>  refs.c| 19 +++
>  2 files changed, 22 insertions(+), 4 deletions(-)
> 
> diff --git a/t/t1400-update-ref.sh b/t/t1400-update-ref.sh
> index 664a3a4e4e..bd41f86f22 100755
> --- a/t/t1400-update-ref.sh
> +++ b/t/t1400-update-ref.sh
> @@ -457,6 +457,13 @@ test_expect_success 'git cat-file blob
> master@{2005-05-26 23:42}:F (expect OTHER
>   test OTHER = $(git cat-file blob "master@{2005-05-26
> 23:42}:F")
>  '
>  
> +test_expect_success 'create pseudoref with old oid null, but do not
> overwrite' '
> + git update-ref PSEUDOREF $A $Z &&
> + test_when_finished "git update-ref -d PSEUDOREF" &&
> + test $A = $(cat .git/PSEUDOREF) &&
> + test_must_fail git update-ref PSEUDOREF $A $Z
> +'
> +
>  a=refs/heads/a
>  b=refs/heads/b
>  c=refs/heads/c
> diff --git a/refs.c b/refs.c
> index 8b7a77fe5e..3669190499 100644
> --- a/refs.c
> +++ b/refs.c
> @@ -666,10 +666,21 @@ static int write_pseudoref(const char
> *pseudoref, const struct object_id *oid,
>   if (old_oid) {
>   struct object_id actual_old_oid;
>  
> - if (read_ref(pseudoref, _old_oid))
> - die("could not read ref '%s'", pseudoref);
> - if (oidcmp(_old_oid, old_oid)) {
> - strbuf_addf(err, "unexpected sha1 when
> writing '%s'", pseudoref);
> + if (read_ref(pseudoref, _old_oid)) {
> + if (!is_null_oid(old_oid)) {
> + strbuf_addf(err, "could not read ref
> '%s'",
> + pseudoref);
> + rollback_lock_file();
> + goto done;
> + }
> + } else if (is_null_oid(old_oid)) {
> + strbuf_addf(err, "ref '%s' already exists",
> + pseudoref);
> + rollback_lock_file();
> + goto done;
> + } else if (oidcmp(_old_oid, old_oid)) {
> + strbuf_addf(err, "unexpected sha1 when
> writing '%s'",
> + pseudoref);
>   rollback_lock_file();
>   goto done;
>   }

Re: [PATCH v1] name-hash: properly fold directory names in adjust_dirname_case()

2018-02-07 Thread David Turner

On Wed, 2018-02-07 at 19:41 -0500, Ben Peart wrote:
> Correct the pointer arithmetic in adjust_dirname_case() so that it
> calls
> find_dir_entry() with the correct string length.  Previously passing
> in
> "dir1/foo" would pass a length of 6 instead of the correct 4.  This
> resulted in
> find_dir_entry() never finding the entry and so the subsequent memcpy
> that would
> fold the name to the version with the correct case never executed.
> 
> Add a test to validate the corrected behavior with name folding of
> directories.

Looks good to me.

Re: submodule modify/delete wrong message

2017-12-11 Thread David Turner

On Mon, 2017-12-11 at 12:27 -0800, Stefan Beller wrote:
> On Mon, Dec 4, 2017 at 1:39 PM, David Turner <nova...@novalis.org>
> wrote:
> > When merging with a submodule modify/delete conflict (i.e. I've
> > deleted
> > the submodule, and I'm merging in a branch that modified it), git
> > lies
> > about what it is doing:
> > 
> > "CONFLICT (modify/delete): submodule deleted in HEAD and modified
> > in
> > submodules.
> 
> Up to here the error message sounds correct, still?

Yep!

> > Version submodules of submodule left in tree at
> > submodule~submodules.
> > Automatic merge failed; fix conflicts and then commit the result."
> 
> This sounds as if the code assumed to handle only files.

This assumption is unfortunately very common -- I just filed a PR to
fix an instance of this in libgit2.

> > In fact, the working tree does not contain anything named
> > 'submodule~submodules'.
> > 
> > In addition, I would ordinarily resolve a conflict like this by
> > using
> > 'git rm'. Here, this gives a warning:
> > 
> > $ git rm submodule
> > submodule: needs merge
> 
> (Regarding submodule merges in general:)
> 
> Uh. We cannot add merge markers to a submodule or such.
> More importantly we'd have to ask the question if the merge conflict
> is on the superproject level (Choose one of the commits of the
> submodule)
> or on the submodule level (perform a merge in the submodule between
> the
> two commits) or some hybrid approach thereof.

Yeah, that's a tricky thing in general.  In this case, tho, the
submodule is being removed, so git rm should work.

> > rm 'submodule'
> > warning: Could not find section in .gitmodules where path=submodule
> 
> The deletion of the submodule removed the .gitmodules entry, and the
> merge of the .gitmodules file presumably went fine. :/

Indeed.

> I assume we need a special merge driver for the .gitmodules file to
> keep
> the submodule around when it is in at least one side.
> 
> > Git's behavior here is significantly better than liggit2's (which
> > tries
> > to check out 'submodule' as if it were a blob, and fails to do so),
> > but
> > it's still confusing.
> > 
> > It's not clear to me what the correct behavior is here.  Maybe it's
> > sufficient to just fix the message?
> 
> I think the first step is to fix the message to reflect reality.
> 
> As alluded to above, I don't know what the correct merge
> behavior is (and where to put 'conflict markers').

The only case we need 'conflict markers' is in the
{add,modify}/{add,modify} case (with different versions on each side). 
The delete/* and */delete case can be handled more simply.  We might
not want to do this until we can handle all cases -- but personally, I
think a half-solution is better than a non-solution.

submodule modify/delete wrong message

2017-12-04 Thread David Turner

When merging with a submodule modify/delete conflict (i.e. I've deleted
the submodule, and I'm merging in a branch that modified it), git lies
about what it is doing:

"CONFLICT (modify/delete): submodule deleted in HEAD and modified in
submodules. Version submodules of submodule left in tree at
submodule~submodules.
Automatic merge failed; fix conflicts and then commit the result."

In fact, the working tree does not contain anything named
'submodule~submodules'.  

In addition, I would ordinarily resolve a conflict like this by using
'git rm'. Here, this gives a warning:

$ git rm submodule
submodule: needs merge
rm 'submodule'
warning: Could not find section in .gitmodules where path=submodule

Git's behavior here is significantly better than liggit2's (which tries
to check out 'submodule' as if it were a blob, and fails to do so), but
it's still confusing.

It's not clear to me what the correct behavior is here.  Maybe it's
sufficient to just fix the message?

Re: git reset of addition of a submodule?

2017-11-30 Thread David Turner

On Thu, 2017-11-30 at 12:05 -0500, David Turner wrote:
> git submodule add https://my-git-repo blort
> git commit -m 'add a submodule'
> git reset HEAD^ blort
> 
> The reset deletes the gitlink, but does not delete the entry in
> .gitmodules.  On one hand, this is exactly what the user asked for --
> they wanted the path 'blort' to be changed in the index, and that's
> what they got.  On the other hand, the behavior differs from git rm,
> and seems confusing: most folks don't want an entry in .gitmodules
> which doesn't correspond to a gitlink.  
> 
> If reset isn't the right thing for me to do when I want to say "oops"
> about adding a submodule, then what is?  I could do:
> git reset HEAD^ blort .gitmodules
> but what if I added two submodules and only wanted to undo the
> addition
> of one?


Also, resetting the deletion of a submodule has an even worse issue --
you end up with a gitlink but no entry in .gitmodules.

git reset of addition of a submodule?

2017-11-30 Thread David Turner

git submodule add https://my-git-repo blort
git commit -m 'add a submodule'
git reset HEAD^ blort

The reset deletes the gitlink, but does not delete the entry in
.gitmodules.  On one hand, this is exactly what the user asked for --
they wanted the path 'blort' to be changed in the index, and that's
what they got.  On the other hand, the behavior differs from git rm,
and seems confusing: most folks don't want an entry in .gitmodules
which doesn't correspond to a gitlink.  

If reset isn't the right thing for me to do when I want to say "oops"
about adding a submodule, then what is?  I could do:
git reset HEAD^ blort .gitmodules
but what if I added two submodules and only wanted to undo the addition
of one?

RE: [PATCH v7 06/12] ls-files: Add support in ls-files to display the fsmonitor valid bit

2017-09-19 Thread David Turner


> -Original Message-
> From: Ben Peart [mailto:peart...@gmail.com]
> Sent: Tuesday, September 19, 2017 4:45 PM
> To: David Turner <david.tur...@twosigma.com>; 'Ben Peart'
> <benpe...@microsoft.com>
> Cc: ava...@gmail.com; christian.cou...@gmail.com; git@vger.kernel.org;
> gits...@pobox.com; johannes.schinde...@gmx.de; pclo...@gmail.com;
> p...@peff.net
> Subject: Re: [PATCH v7 06/12] ls-files: Add support in ls-files to display the
> fsmonitor valid bit
> 
> 
> 
> On 9/19/2017 3:46 PM, David Turner wrote:
> >> -Original Message-
> >> From: Ben Peart [mailto:benpe...@microsoft.com]
> >> Sent: Tuesday, September 19, 2017 3:28 PM
> >> To: benpe...@microsoft.com
> >> Cc: David Turner <david.tur...@twosigma.com>; ava...@gmail.com;
> >> christian.cou...@gmail.com; git@vger.kernel.org; gits...@pobox.com;
> >> johannes.schinde...@gmx.de; pclo...@gmail.com; p...@peff.net
> >> Subject: [PATCH v7 06/12] ls-files: Add support in ls-files to
> >> display the fsmonitor valid bit
> >>
> >> Add a new command line option (-f) to ls-files to have it use
> >> lowercase letters for 'fsmonitor valid' files
> >>
> >> Signed-off-by: Ben Peart <benpe...@microsoft.com>
> >> ---
> >>   builtin/ls-files.c | 8 ++--
> >>   1 file changed, 6 insertions(+), 2 deletions(-)
> >
> > This is still missing the corresponding documentation patch.
> 
> Sorry for the confusion.

Thanks for following up.

> > 10/12 (no reply, haven't checked whether same issue but I assume same
> > issue since the new case I mentioned isn't added)
> 
> It wasn't a bug so I didn't "fix" it.  I just sent an explanation and patch
> demonstrating why. You can find it here:

I was concerned about the case of an untracked file inside a directory 
that contains no tracked files.  Your patch in this mail treats dir3 just 
like dir1 and dir2.  I think you ought to test the case of a dir with no 
tracked files.

After more careful checking, it looks like this case does work, but it's 
still worth testing.

RE: [PATCH v7 06/12] ls-files: Add support in ls-files to display the fsmonitor valid bit

2017-09-19 Thread David Turner

> -Original Message-
> From: Ben Peart [mailto:benpe...@microsoft.com]
> Sent: Tuesday, September 19, 2017 3:28 PM
> To: benpe...@microsoft.com
> Cc: David Turner <david.tur...@twosigma.com>; ava...@gmail.com;
> christian.cou...@gmail.com; git@vger.kernel.org; gits...@pobox.com;
> johannes.schinde...@gmx.de; pclo...@gmail.com; p...@peff.net
> Subject: [PATCH v7 06/12] ls-files: Add support in ls-files to display the
> fsmonitor valid bit
> 
> Add a new command line option (-f) to ls-files to have it use lowercase
> letters for 'fsmonitor valid' files
> 
> Signed-off-by: Ben Peart <benpe...@microsoft.com>
> ---
>  builtin/ls-files.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)

This is still missing the corresponding documentation patch.  

I can see from replies that at least some of my messages got through.  In 
total, I sent messages about:
04/12 (I see replies)
05/12 (I see replies)
06/12 (no reply, issue not fixed)
10/12 (no reply, haven't checked whether same issue but I assume same issue 
since the new case I mentioned isn't added)
12/12 (no reply, typo fixed -- no reply required)

RE: [PATCH v6 10/12] fsmonitor: add test cases for fsmonitor extension

2017-09-19 Thread David Turner

I think my comment here might have gotten lost, and I don't want it to because 
it's something I'm really worried about:

> -Original Message-
> From: David Turner
> Sent: Friday, September 15, 2017 6:00 PM
> To: 'Ben Peart' <benpe...@microsoft.com>
> Cc: ava...@gmail.com; christian.cou...@gmail.com; git@vger.kernel.org;
> gits...@pobox.com; johannes.schinde...@gmx.de; pclo...@gmail.com;
> p...@peff.net
> Subject: RE: [PATCH v6 10/12] fsmonitor: add test cases for fsmonitor
> extension
> 
> > -Original Message-
> > +dirty_repo () {
> > +   : >untracked &&
> > +   : >dir1/untracked &&
> > +   : >dir2/untracked &&
> > +   echo 1 >modified &&
> > +   echo 2 >dir1/modified &&
> > +   echo 3 >dir2/modified &&
> > +   echo 4 >new &&
> > +   echo 5 >dir1/new &&
> > +   echo 6 >dir2/new
> 
> If I add an untracked file named dir3/untracked to dirty_repo  (and
> write_integration_script), then "status doesn't detect unreported
> modifications", below, fails.  Did I do something wrong, or does this turn up 
> a
> bug?

RE: [PATCH v6 04/12] fsmonitor: teach git to optionally utilize a file system monitor to speed up detecting new or changed files.

2017-09-18 Thread David Turner

> -Original Message-
> From: Ben Peart [mailto:peart...@gmail.com]
> Sent: Monday, September 18, 2017 9:07 AM
> To: David Turner <david.tur...@twosigma.com>; 'Ben Peart'
> <benpe...@microsoft.com>
> Cc: ava...@gmail.com; christian.cou...@gmail.com; git@vger.kernel.org;
> gits...@pobox.com; johannes.schinde...@gmx.de; pclo...@gmail.com;
> p...@peff.net
> Subject: Re: [PATCH v6 04/12] fsmonitor: teach git to optionally utilize a 
> file
> system monitor to speed up detecting new or changed files.
> 
> Thanks for taking the time to review/provide feedback!
> 
> On 9/15/2017 5:35 PM, David Turner wrote:
> >> -Original Message-
> >> From: Ben Peart [mailto:benpe...@microsoft.com]
> >> Sent: Friday, September 15, 2017 3:21 PM
> >> To: benpe...@microsoft.com
> >> Cc: David Turner <david.tur...@twosigma.com>; ava...@gmail.com;
> >> christian.cou...@gmail.com; git@vger.kernel.org; gits...@pobox.com;
> >> johannes.schinde...@gmx.de; pclo...@gmail.com; p...@peff.net
> >> Subject: [PATCH v6 04/12] fsmonitor: teach git to optionally utilize
> >> a file system monitor to speed up detecting new or changed files.
> >
> >> +int git_config_get_fsmonitor(void)
> >> +{
> >> +  if (git_config_get_pathname("core.fsmonitor", _fsmonitor))
> >> +  core_fsmonitor = getenv("GIT_FSMONITOR_TEST");
> >> +
> >> +  if (core_fsmonitor && !*core_fsmonitor)
> >> +  core_fsmonitor = NULL;
> >> +
> >> +  if (core_fsmonitor)
> >> +  return 1;
> >> +
> >> +  return 0;
> >> +}
> >
> > This functions return values are backwards relative to the rest of the
> git_config_* functions.
> 
> I'm confused.  If core.fsmonitor is configured, it returns 1. If it is not
> configured, it returns 0. I don't make use of the -1 /* default value */ 
> option
> as I didn't see any use/value in this case. What is backwards?

The other git_config_* functions return 1 for error and 0 for success.

> > [snip]
> >
> > +>  /*
> > +>   * With fsmonitor, we can trust the untracked cache's valid field.
> > +>   */
> >
> 
> Did you intend to make a comment here?

Sorry.  I was going to make a comment that I didn't see how that could work 
since we weren't touching the untracked cache here, but then I saw the bit 
further down.   I'm still not sure it works (see comment on 10/12), but at
least it could in theory work.

RE: [PATCH v6 10/12] fsmonitor: add test cases for fsmonitor extension

2017-09-15 Thread David Turner

> -Original Message-
> +dirty_repo () {
> + : >untracked &&
> + : >dir1/untracked &&
> + : >dir2/untracked &&
> + echo 1 >modified &&
> + echo 2 >dir1/modified &&
> + echo 3 >dir2/modified &&
> + echo 4 >new &&
> + echo 5 >dir1/new &&
> + echo 6 >dir2/new

If I add an untracked file named dir3/untracked to dirty_repo
 (and write_integration_script), then "status doesn't detect 
unreported modifications", below, fails.  Did I do something 
wrong, or does this turn up a bug?

> + test_expect_success "setup preloadIndex to $preload_val" '
> + git config core.preloadIndex $preload_val &&
> + if [ $preload_val -eq true ]

"-eq" is for numeric equality in POSIX shell.  So this works if your 
/bin/sh is bash but not if it's e.g. dash.  This happens twice more 
below.  Use "=" instead.

RE: [PATCH v6 12/12] fsmonitor: add a performance test

2017-09-15 Thread David Turner


> -Original Message-
> + # Choose integration script based on existance of Watchman.

Spelling: existence

RE: [PATCH v6 04/12] fsmonitor: teach git to optionally utilize a file system monitor to speed up detecting new or changed files.

2017-09-15 Thread David Turner

> -Original Message-
> From: Ben Peart [mailto:benpe...@microsoft.com]
> Sent: Friday, September 15, 2017 3:21 PM
> To: benpe...@microsoft.com
> Cc: David Turner <david.tur...@twosigma.com>; ava...@gmail.com;
> christian.cou...@gmail.com; git@vger.kernel.org; gits...@pobox.com;
> johannes.schinde...@gmx.de; pclo...@gmail.com; p...@peff.net
> Subject: [PATCH v6 04/12] fsmonitor: teach git to optionally utilize a file 
> system
> monitor to speed up detecting new or changed files.
 
> +int git_config_get_fsmonitor(void)
> +{
> + if (git_config_get_pathname("core.fsmonitor", _fsmonitor))
> + core_fsmonitor = getenv("GIT_FSMONITOR_TEST");
> +
> + if (core_fsmonitor && !*core_fsmonitor)
> + core_fsmonitor = NULL;
> +
> + if (core_fsmonitor)
> + return 1;
> +
> + return 0;
> +}

This functions return values are backwards relative to the rest of the 
git_config_* functions.

[snip]

+>  /*
+>   * With fsmonitor, we can trust the untracked cache's valid field.
+>   */

[snip]

> +int read_fsmonitor_extension(struct index_state *istate, const void *data,
> + unsigned long sz)
> +{

If git_config_get_fsmonitor returns 0, fsmonitor_dirty will leak.

[snip]

> + /* a fsmonitor process can return '*' to indicate all entries are 
> invalid */

That's not documented in your documentation.  Also, I'm not sure I like it: 
what 
if I have a file whose name starts with '*'?  Yeah, that would be silly, but 
this indicates the need 
for the protocol to have some sort of signaling mechanism that's out-of-band  
Maybe 
have some key\0value\0 pairs and then \0\0 and then the list of files?  Or, if 
you want to keep
it really simple, allow an entry of '/' (which is an invalid filename) to mean 
'all'.

> +void add_fsmonitor(struct index_state *istate) {
> + int i;
> +
> + if (!istate->fsmonitor_last_update) {
[snip]
> + /* reset the untracked cache */

Is this really necessary?  Shouldn't the untracked cache be in a correct state 
already? 

> +/*
> + * Clear the given cache entries CE_FSMONITOR_VALID bit and invalidate

Nit: "s/entries/entry's/".

RE: [PATCH v6 06/12] ls-files: Add support in ls-files to display the fsmonitor valid bit

2017-09-15 Thread David Turner

> -Original Message-
> From: Ben Peart [mailto:benpe...@microsoft.com]
> Sent: Friday, September 15, 2017 3:21 PM
> To: benpe...@microsoft.com
> Cc: David Turner <david.tur...@twosigma.com>; ava...@gmail.com;
> christian.cou...@gmail.com; git@vger.kernel.org; gits...@pobox.com;
> johannes.schinde...@gmx.de; pclo...@gmail.com; p...@peff.net
> Subject: [PATCH v6 06/12] ls-files: Add support in ls-files to display the 
> fsmonitor
> valid bit
> 
> Add a new command line option (-f) to ls-files to have it use lowercase 
> letters
> for 'fsmonitor valid' files

Document in man page, please.

RE: [PATCH v6 05/12] fsmonitor: add documentation for the fsmonitor extension.

2017-09-15 Thread David Turner



> -Original Message-
> From: Ben Peart [mailto:benpe...@microsoft.com]
> Sent: Friday, September 15, 2017 3:21 PM
> To: benpe...@microsoft.com
> Cc: David Turner <david.tur...@twosigma.com>; ava...@gmail.com;
> christian.cou...@gmail.com; git@vger.kernel.org; gits...@pobox.com;
> johannes.schinde...@gmx.de; pclo...@gmail.com; p...@peff.net
> Subject: [PATCH v6 05/12] fsmonitor: add documentation for the fsmonitor
> extension.
> 
> This includes the core.fsmonitor setting, the query-fsmonitor hook, and the
> fsmonitor index extension.
> 
> Signed-off-by: Ben Peart <benpe...@microsoft.com>
> ---
>  Documentation/config.txt |  6 ++
>  Documentation/githooks.txt   | 23 +++
>  Documentation/technical/index-format.txt | 19 +++
>  3 files changed, 48 insertions(+)
> 
> diff --git a/Documentation/config.txt b/Documentation/config.txt index
> dc4e3f58a2..c196007a27 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -413,6 +413,12 @@ core.protectNTFS::
>   8.3 "short" names.
>   Defaults to `true` on Windows, and `false` elsewhere.
> 
> +core.fsmonitor::
> + If set, the value of this variable is used as a command which
> + will identify all files that may have changed since the
> + requested date/time. This information is used to speed up git by
> + avoiding unnecessary processing of files that have not changed.

I'm confused here.  You have a file called "fsmonitor-watchman", which seems to 
discuss the protocol for core.fsmonitor scripts in general, and you have this 
documentation, which does not link to that file.  Can you clarify this? 



> +The hook should output to stdout the list of all files in the working
> +directory that may have changed since the requested time.  The logic
> +should be inclusive so that it does not miss any potential changes.

+"It is OK to include files which have not actually changed.  Newly-created and 
deleted files should also be included.  When files are renamed, both the old 
and the new name should be included."

Also, please discuss case sensitivity issues (e.g. on OS X).  

> +The paths should be relative to the root of the working directory and
> +be separated by a single NUL.



> +  - 32-bit version number: the current supported version is 1.
> +
> +  - 64-bit time: the extension data reflects all changes through the given
> + time which is stored as the nanoseconds elapsed since midnight,
> + January 1, 1970.

Nit: Please specify signed or unsigned for these.  (I expect to be getting out 
of 
cryosleep around 2262, and I want to know if my old git repos will keep 
working...)

> +  - 32-bit bitmap size: the size of the CE_FSMONITOR_VALID bitmap.
> +
> +  - An ewah bitmap, the n-th bit indicates whether the n-th index entry
> +is not CE_FSMONITOR_VALID.

RE: reftable [v5]: new ref storage format

2017-08-14 Thread David Turner

> -Original Message-
> From: Howard Chu [mailto:h...@symas.com]
> Sent: Monday, August 14, 2017 8:31 AM
> To: spea...@spearce.org
> Cc: David Turner <david.tur...@twosigma.com>; ava...@gmail.com;
> ben.a...@acegi.com.au; dborow...@google.com; git@vger.kernel.org;
> gits...@pobox.com; mhag...@alum.mit.edu; p...@peff.net;
> sbel...@google.com; sto...@gmail.com
> Subject: Re: reftable [v5]: new ref storage format
> 
> Howard Chu wrote:
> > The primary issue with using LMDB over NFS is with performance. All
> > reads are performed thru accesses of mapped memory, and in general,
> > NFS implementations don't cache mmap'd pages. I believe this is a
> > consequence of the fact that they also can't guarantee cache
> > coherence, so the only way for an NFS client to see a write from
> > another NFS client is by always refetching pages whenever they're accessed.
> 
> > LMDB's read lock management also wouldn't perform well over NFS; it
> > also uses an mmap'd file. On a local filesystem LMDB read locks are
> > zero cost since they just atomically update a word in the mmap. Over
> > NFS, each update to the mmap would also require an msync() to
> > propagate the change back to the server. This would seriously limit
> > the speed with which read transactions may be opened and closed.
> > (Ordinarily opening and closing a read txn can be done with zero
> > system calls.)
> 
> All that aside, we could simply add an EXCLUSIVE open-flag to LMDB, and
> prevent multiple processes from using the DB concurrently. In that case,
> maintaining coherence with other NFS clients is a non-issue. It strikes me 
> that git
> doesn't require concurrent multi-process access anyway, and any particular
> process would only use the DB for a short time before closing it and going 
> away.

Git, in general, does require concurrent multi-process access, depending on 
what 
that means.

For example, a post-receive hook might call some git command which opens the 
ref database.  This means that git receive-pack would have to close and 
re-open the ref database.  More generally, a fair number of git commands are
implemented in terms of other git commands, and might need the same treatment.
We could, in general, close and re-open the database around fork/exec, but I am
not sure that this solves the general problem -- by mere happenstance, one might
be e.g. pushing in one terminal while running git checkout in another.  This is 
especially true with git worktrees, which share one ref database across multiple
working directories.

RE: reftable [v5]: new ref storage format

2017-08-07 Thread David Turner

> -Original Message-
> From: Shawn Pearce [mailto:spea...@spearce.org]
> In git-core, I'm worried about the caveats related to locking. Git tries to 
> work
> nicely on NFS, and it seems LMDB wouldn't. Git also runs fine on a read-only
> filesystem, and LMDB gets a little weird about that. Finally, Git doesn't have
> nearly the risks LMDB has about a crashed reader or writer locking out future
> operations until the locks have been resolved. This is especially true with 
> shared
> user repositories, where another user might setup and own the semaphore.

FWIW, git has problems with stale lock file in the event of a crash 
(refs/foo.lock 
might still exist, and git does nothing to clean it up).

In my testing (which involved a *lot* of crashing), I never once had to clean 
up a
stale LMDB lock.  That said, I didn't test on a RO filesystem.

git-fast-export doesn't support mergetag header

2017-08-01 Thread David Turner

It looks to me like git-fast-export doesn't export the mergetag header on a 
commit.  Is this intentional, or an oversight?

RE: [PATCH v5 7/7] fsmonitor: add a performance test

2017-07-07 Thread David Turner

> -Original Message-
> From: Junio C Hamano [mailto:jch2...@gmail.com] On Behalf Of Junio C
> Hamano
> Sent: Friday, July 7, 2017 2:35 PM
> To: Ben Peart <peart...@gmail.com>
> Cc: git@vger.kernel.org; benpe...@microsoft.com; pclo...@gmail.com;
> johannes.schinde...@gmx.de; David Turner <david.tur...@twosigma.com>;
> p...@peff.net; christian.cou...@gmail.com; ava...@gmail.com
> Subject: Re: [PATCH v5 7/7] fsmonitor: add a performance test
> 
> Ben Peart <peart...@gmail.com> writes:
> 
> > On 6/14/2017 2:36 PM, Junio C Hamano wrote:
> >> Ben Peart <peart...@gmail.com> writes:
> >>
> >>>> Having said all that, I think you are using this ONLY on windows;
> >>>> perhaps it is better to drop #ifdef GIT_WINDOWS_NATIVE from all of
> >>>> the above and arrange Makefile to build test-drop-cache only on
> >>>> that platform, or something?
> >>>
> >>> I didn't find any other examples of Windows only tools.  I'll update
> >>> the #ifdef to properly dump the file system cache on Linux as well
> >>> and only error out on other platforms.
> >>
> >> If this will become Windows-only, then I have no problem with
> >> platform specfic typedef ;-) I have no problem with CamelCase,
> >> either, as that follows the local convention on the platform (similar
> >> to those in compat/* that are only for Windows).
> >>
> >> Having said all that.
> >>
> >> Another approach is to build this helper on all platforms, ...
> 
> ... and having said all that, I think it is perfectly fine to do such a 
> clean-up long
> after the series gets more exposure to wider audiences as a follow-up patch.
> Let's get the primary part that affects people's everyday use of Git right 
> and then
> worry about the test details later.
> 
> A quick show of hands to the list audiences.  How many of you guys actually
> tried this series on 'pu' and checked to see its performance (and correctness 
> ;-)
> characteristics?
> 
> Do you folks like it?  Rather not have such complexity in the core part of the
> system?  A good first step to start adding more performance improvements?  No
> opinion?

I have not had the chance to test the latest version out yet.  The idea, 
broadly, seems sound, but as Ben notes in a later mail, the details are 
important.  Since he's going to re-roll with more interesting invalidation 
logic, I'll wait to try it again until a new version is available.

bug? fetching from empty unnamed remote repos

2017-06-12 Thread David Turner

$ git init empty-1
$ git init empty-2 
$ cd empty-1
$ git fetch ../empty-2
fatal: Couldn't find remote ref HEAD
fatal: The remote end hung up unexpectedly

But:
$ git init empty-1
$ git init empty-2 
$ cd empty-1
$ git remote add other ../empty-2
$ git fetch other
# this works

I haven't spent a lot of time looking into the code, because we can use the 
second option.  But it does seem weird.

RE: [WIP/PATCH 7/6] perf: add a performance test for core.fsmonitor

2017-06-02 Thread David Turner

BTW, a medium-sized (~250k files across 40k dirs) synthetic repo is available 
over bittorrent at:
http://bitmover.com/2015-04-03-1M-git-bare.tar.bz2.torrent

I tried Ævar's perf test with that (on a beefy laptop with SSD), and got 
significantly slower results with bp/fsmonitor:
Test  origin/master bp/fsmonitor   
---
7519.2: status (first)0.32(0.23+0.39)   0.32(0.26+0.36) +0.0%  
7519.3: status (subsequent)   0.18(0.14+0.34)   0.32(0.24+0.37) +77.8% 
7519.4: status -uno   0.11(0.08+0.32)   0.24(0.18+0.34) +118.2%
7519.5: status -uall  0.49(0.22+0.56)   0.62(0.36+0.55) +26.5% 
7519.2: status (first)0.32(0.23+0.39)   0.32(0.26+0.36) +0.0%  
7519.3: status (subsequent)   0.18(0.14+0.34)   0.32(0.24+0.37) +77.8% 
7519.4: status -uno   0.11(0.08+0.32)   0.24(0.18+0.34) +118.2%
7519.5: status -uall  0.49(0.22+0.56)   0.62(0.36+0.55) +26.5% 

I have not yet looked into why this is.

> -Original Message-
> From: Ævar Arnfjörð Bjarmason [mailto:ava...@gmail.com]
> Sent: Friday, June 2, 2017 6:29 AM
> To: git@vger.kernel.org
> Cc: Junio C Hamano <gits...@pobox.com>; Ben Peart
> <peart...@gmail.com>; Nguyễn Thái Ngọc Duy <pclo...@gmail.com>;
> Johannes Schindelin <johannes.schinde...@gmx.de>; David Turner
> <david.tur...@twosigma.com>; Jeff King <p...@peff.net>; Christian
> Couder <christian.cou...@gmail.com>; Ævar Arnfjörð Bjarmason
> <ava...@gmail.com>
> Subject: [WIP/PATCH 7/6] perf: add a performance test for core.fsmonitor
> 
> Add a performance test for the new core.fsmonitor facility using the sample
> query-fsmonitor hook.
> 
> This is WIP code for the reasons explained in the setup comments,
> unfortunately the perf code doesn't easily allow you to run different setup
> code for different versions you're testing. This test will stop working if the
> fsmonitor is merged into the master branch.
> 
> Output against linxu.git:
> 
> $ GIT_PERF_REPEAT_COUNT=10 GIT_PERF_LARGE_REPO=~/g/linux
> GIT_PERF_MAKE_OPTS='-j8' ./run origin/master avar/fsmonitor ./p7519-
> fsmonitor.sh
> [...]
> Test  origin/master avar/fsmonitor
> ---
> 7519.2: status (first)0.08(0.04+0.09)   0.12(0.07+0.10) +50.0%
> 7519.3: status (subsequent)   0.08(0.04+0.09)   0.12(0.06+0.11) +50.0%
> 7519.4: status -uno   0.02(0.02+0.05)   0.06(0.05+0.06) +200.0%
> 7519.5: status -uall  0.08(0.06+0.07)   0.12(0.07+0.10) +50.0%
> 
> And against a larger in-house monorepo I have here, with the same options
> (except the repo path):
> 
> Test  origin/master avar/fsmonitor
> ---
> 7519.2: status (first)0.20(0.11+0.18)   0.27(0.15+0.21) +35.0%
> 7519.3: status (subsequent)   0.20(0.11+0.18)   0.27(0.15+0.21) +35.0%
> 7519.4: status -uno   0.04(0.03+0.10)   0.22(0.08+0.12) +450.0%
> 7519.5: status -uall  0.20(0.13+0.16)   0.27(0.18+0.19) +35.0%
> 
> Against linux.git with a hack to flush the FS cache (on Linux) before running
> the first 'git status', only running one test so the result isn't discarded 
> as the
> slowest of N:
> 
> $ GIT_PERF_REPEAT_COUNT=1 GIT_PERF_LARGE_REPO=~/g/linux
> GIT_PERF_MAKE_COMMAND='sudo sync && echo 3 | sudo tee
> /proc/sys/vm/drop_caches >/dev/null && make -j8' ./run origin/master
> avar/fsmonitor ./p7519-fsmonitor.sh
> [...]
> Test  origin/master avar/fsmonitor
> 
> 7519.2: status (first)0.30(0.18+0.10)   8.26(0.22+0.10) +2653.3%
> 7519.3: status (subsequent)   0.08(0.04+0.08)   0.81(0.09+0.07) +912.5%
> 7519.4: status -uno   0.02(0.01+0.06)   0.08(0.04+0.07) +300.0%
> 7519.5: status -uall  0.08(0.06+0.07)   0.15(0.08+0.09) +87.5%
> 
> Now obviously due to 1 run that has a lot of noise, but I would expect that
> first invocation to be blindingly fast since watchman has info on what files
> were modified since the cache was flushed.
> 
> The same on the large monorepo noted above:
> 
> Test  origin/master avar/fsmonitor
> ---
> 7519.2: status (first)0.59(0.28+0.24)   0.93(0.35+0.19) +57.6%
> 7519.3: status (subsequent)   0.20(0.10+0.19)   0.28(0.16+0.20) +40.0%
> 7519.4: status -uno   0.04(0.04+0.09)   0.11(0.08+0.1

RE: [PATCH 0/1] Preserve the untracked cache across checkout, reset --hard, etc

2017-05-18 Thread David Turner

> -Original Message-
> From: Ben Peart [mailto:peart...@gmail.com]
> Sent: Thursday, May 18, 2017 12:58 PM
> To: Junio C Hamano <gits...@pobox.com>; David Turner
> <david.tur...@twosigma.com>
> Cc: 'Christian Couder' <christian.cou...@gmail.com>; Johannes Schindelin
> <johannes.schinde...@gmx.de>; git <git@vger.kernel.org>; Nguyễn Thái Ngọc
> Duy <pclo...@gmail.com>; Ben Peart <benpe...@microsoft.com>
> Subject: Re: [PATCH 0/1] Preserve the untracked cache across checkout, reset 
> --
> hard, etc
> 
> 
> 
> On 5/9/2017 8:51 AM, Ben Peart wrote:
> >
> > On 5/9/2017 1:02 AM, Junio C Hamano wrote:
> >> David Turner <david.tur...@twosigma.com> writes:
> >>
> >>> Can you actually keep the email address as my Twopensource one?  I
> >>> want to make sure that Twitter, my employer at the time, gets credit
> >>> for this work (just as I want to make sure that my current employer,
> >>> Two Sigma, gets credit for my current work).
> >>>
> >>> Please feel free to add Signed-off-by: David Turner
> >>> <dtur...@twosigma.com> in case that makes tracking easier.
> >>>
> >>> Thanks.
> >>>
> >>> WRT the actual patch, I want to note that past me did not do a great
> >>> job here.  The tests do not correctly check that the post-checkout
> >>> untracked cache is still valid after a checkout.
> >>> For example, let's say that previously, the directory foo was
> >>> entirely untracked (but it contained a file bar), but after the
> >>> checkout, there is a file foo/baz.  Does the untracked cache need to
> >>> get updated?
> >>>
> >>> Unfortunately, the untracked cache is very unlikely to make it to
> >>> the top of my priority list any time soon, so I won't be able to
> >>> correct this test (and, if necessary, correct the code).  But I
> >>> would strongly suggest that the test be improved before this code is
> >>> merged.
> >>>
> >>> Thanks for CCing me.
> >> I will try to find time to tweak what was sent to the list here to
> >> reflect your affiliations better, but marked with DONTMERGE waiting
> >> for the necessary updates you mentioned above, so that this change is
> >> not forgotten.  It may turn out to be that copying from src to dst
> >> like the patch does is all that is needed, or the cache may need
> >> further invalidation when the copying happens, and I haven't got a
> >> good feeling that anybody who are familiar with the codepath vetted
> >> the correctness from seeing the discussion from sidelines (yet).
> >>
> >> Thanks.
> >
> > I've been looking into similar issues with the cache associated with
> > using a file system monitor (aka Watchman)
> > (https://github.com/git-for-windows/git/compare/master...benpeart:fsmo
> > nitor) to speed updating the index to reflect changes in the working
> > directory.
> >
> > I can take a look and see if this patch results in valid results and
> > reply to the thread or resubmit as needed.
> >
> > Ben
> 
> TLDR: the patch looks good from my perspective but I'd like the experts to 
> weigh
> in as well.

Thanks for looking into this.  I'm glad to learn that I got it right the first 
time, 
although I still wish I had been more assiduous about testing back then. 

> After digging into the untracked cache code and thinking about whether it is
> reasonable to copy the cache from the old index to the new index in
> unpack_trees() I believe the answer is "yes."  I'm not the expert in this 
> code so I'll
> outline my reasoning here and hopefully the real experts can review it and 
> see if
> I've missed something.
> 
> The interesting part of the untracked cache for this discussion is the list of
> untracked_cache_dir structures.  Because each directory cache entry contains
> stat_data (esp ctime and mtime) for that directory - the existing logic will 
> detect
> if that directory has had any changes made in it since the cache entry was 
> saved.
> It doesn't really care when, why, or how the change was made, just if one has
> happened.
> 
> I then tried to think of ways that this logic could be broken (like David's 
> example
> above) but was unsuccessful in coming up with any.  This makes sense because
> the untracked cache obviously has to correctly detect _any_ change so really
> doesn't care whether it's cached state was initially saved before or after a 
> call to
> unpack_trees().

It looks like un

RE: [PATCH v1 2/5] Teach git to optionally utilize a file system monitor to speed up detecting new or changed files.

2017-05-15 Thread David Turner


> -Original Message-
> From: Ben Peart [mailto:peart...@gmail.com]
> Sent: Monday, May 15, 2017 3:14 PM
> To: git@vger.kernel.org
> Cc: gits...@pobox.com; benpe...@microsoft.com; pclo...@gmail.com;
> johannes.schinde...@gmx.de; David Turner <david.tur...@twosigma.com>;
> p...@peff.net
> Subject: [PATCH v1 2/5] Teach git to optionally utilize a file system monitor 
> to
> speed up detecting new or changed files.

> @@ -342,6 +344,8 @@ struct index_state {
>   struct hashmap dir_hash;
>   unsigned char sha1[20];
>   struct untracked_cache *untracked;
> + time_t last_update;
> + struct ewah_bitmap *bitmap;

The name 'bitmap' doesn't tell the reader much about what it used for.

> +static int update_istate(const char *name, void *is) {

Rename to mark_file_dirty?  Also why does it take a void pointer?  Or return 
int (rather than void)?

> +void refresh_by_fsmonitor(struct index_state *istate) {
> + static has_run_once = FALSE;
> + struct strbuf buffer = STRBUF_INIT;

Rename to query_result? Also I think you're leaking it.

RE: [PATCH v1 5/5] Add a sample query-fsmonitor hook script that integrates with the cross platform Watchman file watching service.

2017-05-15 Thread David Turner



> -Original Message-
> From: Ben Peart [mailto:peart...@gmail.com]
> Sent: Monday, May 15, 2017 3:14 PM
> To: git@vger.kernel.org
> Cc: gits...@pobox.com; benpe...@microsoft.com; pclo...@gmail.com;
> johannes.schinde...@gmx.de; David Turner <david.tur...@twosigma.com>;
> p...@peff.net
> Subject: [PATCH v1 5/5] Add a sample query-fsmonitor hook script that
> integrates with the cross platform Watchman file watching service.
> 
> To use the script:
> 
> Download and install Watchman from https://facebook.github.io/watchman/
> and instruct Watchman to watch your working directory for changes
> ('watchman watch-project /usr/src/git').
> 
> Rename the sample integration hook from query-fsmonitor.sample to query-
> fsmonitor.
> 
> Configure git to use the extension ('git config core.fsmonitor true') and
> optionally turn on the untracked cache for optimal performance ('git config
> core.untrackedcache true').
> 
> Signed-off-by: Ben Peart <benpe...@microsoft.com>
> Signed-off-by: Johannes Schindelin <johannes.schinde...@gmx.de>
> ---
>  templates/hooks--query-fsmonitor.sample | 27
> +++
>  1 file changed, 27 insertions(+)
>  create mode 100644 templates/hooks--query-fsmonitor.sample
> 
> diff --git a/templates/hooks--query-fsmonitor.sample b/templates/hooks--
> query-fsmonitor.sample
> new file mode 100644
> index 00..4bd22f21d8
> --- /dev/null
> +++ b/templates/hooks--query-fsmonitor.sample
> @@ -0,0 +1,27 @@
> +#!/bin/sh
> +#
> +# An example hook script to integrate Watchman #
> +(https://facebook.github.io/watchman/) with git to provide fast # git
> +status.
> +#
> +# The hook is passed a time_t formatted as a string and outputs to #
> +stdout all files that have been modified since the given time.
> +# Paths must be relative to the root of the working tree and #
> +separated by a single NUL.
> +#
> +# To enable this hook, rename this file to "query-fsmonitor"
> +
> +# Convert unix style paths to escaped Windows style paths case "$(uname
> +-s)" in
> +MINGW*|MSYS_NT*)
> +  GIT_WORK_TREE="$(cygpath -aw "$PWD" | sed 's,\\,,g')"
> +  ;;
> +*)
> +  GIT_WORK_TREE="$PWD"
> +  ;;
> +esac
> +
> +# Query Watchman for all the changes since the requested time echo
> +"[\"query\", \"$GIT_WORK_TREE\", {\"since\": $1,
> +\"fields\":[\"name\"]}]" | \ watchman -j | \ perl -e 'use JSON::PP; my
> +$o = JSON::PP->new->utf8->decode(join("", <>)); die "Watchman: $o-
> >{'error'}.\nFalling back to scanning...\n" if defined($o->{"error"});
> print(join("\0", @{$o->{"files"}}));'

Last time I checked, the argument to 'since' was not a time_t -- it was a 
watchman clock spec.  Have you tested this?  Does it work?

RE: [PATCH 0/1] Preserve the untracked cache across checkout, reset --hard, etc

2017-05-08 Thread David Turner

Can you actually keep the email address as my Twopensource one?  I want to make 
sure that Twitter, my employer at the time, gets credit for this work (just as 
I want to make sure that my current employer, Two Sigma, gets credit for my 
current work).

Please feel free to add Signed-off-by: David Turner <dtur...@twosigma.com> in 
case that makes tracking easier.

Thanks.

WRT the actual patch, I want to note that past me did not do a great job here.  
The tests do not correctly check that the post-checkout untracked cache is 
still valid after a checkout.  For example, let's say that previously, the 
directory foo was entirely untracked (but it contained a file bar), but after 
the checkout, there is a file foo/baz.  Does the untracked cache need to get 
updated?  

Unfortunately, the untracked cache is very unlikely to make it to the top of my 
priority list any time soon, so I won't be able to correct this test (and, if 
necessary, correct the code).But I would strongly suggest that the test be 
improved before this code is merged.

Thanks for CCing me.

> -Original Message-
> From: Christian Couder [mailto:christian.cou...@gmail.com]
> Sent: Monday, May 8, 2017 6:12 AM
> To: Johannes Schindelin <johannes.schinde...@gmx.de>
> Cc: git <git@vger.kernel.org>; Junio C Hamano <gits...@pobox.com>; Nguyễn
> Thái Ngọc Duy <pclo...@gmail.com>; Ben Peart <benpe...@microsoft.com>;
> David Turner <david.tur...@twosigma.com>
> Subject: Re: [PATCH 0/1] Preserve the untracked cache across checkout, reset 
> --
> hard, etc
> 
> (Adding Dave in Cc as it looks like he is involved.)
> 
> On Mon, May 8, 2017 at 11:41 AM, Johannes Schindelin
> <johannes.schinde...@gmx.de> wrote:
> > I recently sent out a request for assistance, after noticing that the
> > untracked cache is simply thrown away after operations such as `git
> > checkout` or `git reset --hard`:
> >
> > http://public-inbox.org/git/alpine.DEB.2.20.1705031202470.3480@virtual
> > box/
> >
> > Duy responded with some high-level reasoning that it should be
> > possible to simply reuse the untracked cache data structure in the new
> > index, as he had a gut feeling that "we do invalidation right".
> >
> > I did not have time to back that up by a thorough analysis of the
> > code, but it turns out that it is unnecessary: Ben Peart pointed me to
> > a patch of Dave Turner's that was submitted as part of the watchman
> > series, addressing the very issue about which I was concerned.
> >
> > And I trust Dave to have validated the idea that the untracked cache
> > invalidation "is done right" even when we simply move the pointer to a
> > different index_state struct than originally.
> >
> > Seeing as the untracked cache being dropped unceremoniously when it
> > should not be dropped, in a surprising number of operations, I think
> > it is a sensible change, and important, too, and independent enough
> > from the watchman patches to merit being separated out and applied
> > pretty soon.
> >
> > So what I did was simply to drop the two lines from this patch that
> > referred to index_state fields added by Dave's watchman patch series.
> >
> > Please do not mistake this for a sign that I am disinterested in
> > watchman support, far from it... stay tuned ;-)
> >
> > Oh, and I adjusted Dave's email address. Dave, is that okay?
> >
> > As we are in a feature freeze phase, I was debating whether to send
> > out this patch now or later.
> >
> > Having thought about it for quite a bit, I am now convinced that this
> > patch fixes a bug in the untracked cache feature that is so critical
> > as to render it useless: if you
> >
> > - have to switch between branches frequently, or
> > - rebase frequently (which calls `git reset --hard`), or
> > - stash frequently (which calls `git reset --hard`),
> >
> > it is as if you had not enabled the untracked cache at all. Even
> > worse, Git will do a ton of work to recreate the untracked cache and
> > to store it as an index extension, *just* to throw the untracked away in the
> end.
> >
> >
> > David Turner (1):
> >   unpack-trees: preserve index extensions
> >
> >  cache.h   |  1 +
> >  read-cache.c  |  6 ++
> >  t/t7063-status-untracked-cache.sh | 22 ++
> >  unpack-trees.c|  1 +
> >  4 files changed, 30 insertions(+)
> >
> >
> > base-commit: 4fa66c85f11bc5a541462ca5ae3246aa0ce02e74
> > Published-As:
> > https://github.com/dscho/git/releases/tag/preserve-untracked-cache-v1
> > Fetch-It-Via: git fetch https://github.com/dscho/git
> > preserve-untracked-cache-v1
> >
> > --
> > 2.12.2.windows.2.800.gede8f145e06
> >

RE: [PATCH] Increase core.packedGitLimit

2017-04-20 Thread David Turner


> -Original Message-
> From: Johannes Schindelin [mailto:johannes.schinde...@gmx.de]
> Sent: Thursday, April 20, 2017 5:58 PM
> To: Jeff King <p...@peff.net>
> Cc: David Turner <david.tur...@twosigma.com>; git@vger.kernel.org
> Subject: Re: [PATCH] Increase core.packedGitLimit
> 
> Hi Peff,
> 
> On Thu, 20 Apr 2017, Jeff King wrote:
> 
> > On Thu, Apr 20, 2017 at 04:41:18PM -0400, David Turner wrote:
> >
> > > When core.packedGitLimit is exceeded, git will close packs.  If
> > > there is a repack operation going on in parallel with a fetch, the
> > > fetch might open a pack, and then be forced to close it due to
> > > packedGitLimit being hit.  The repack could then delete the pack out
> > > from under the fetch, causing the fetch to fail.
> > >
> > > Increase core.packedGitLimit's default value to prevent this.
> > >
> > > On current 64-bit x86_64 machines, 48 bits of address space are
> > > available.  It appears that 64-bit ARM machines have no standard
> > > amount of address space (that is, it varies by manufacturer), and
> > > IA64 and POWER machines have the full 64 bits.  So 48 bits is the
> > > only limit that we can reasonably care about.  We reserve a few bits
> > > of the 48-bit address space for the kernel's use (this is not
> > > strictly necessary, but it's better to be safe), and use up to the
> > > remaining 45.  No git repository will be anywhere near this large
> > > any time soon, so this should prevent the failure.
> >
> > Yep, I think this is a reasonable direction.
> >
> > > ---
> > >  git-compat-util.h | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > This probably needs an update to the core.packedGitLimit section of
> > Documentation/config.txt.
> >
> > > diff --git a/git-compat-util.h b/git-compat-util.h index
> > > 8a4a3f85e7..1c5de153a5 100644
> > > --- a/git-compat-util.h
> > > +++ b/git-compat-util.h
> > > @@ -616,7 +616,7 @@ extern int git_lstat(const char *, struct stat
> > > *);  #endif
> > >
> > >  #define DEFAULT_PACKED_GIT_LIMIT \
> > > - ((1024L * 1024L) * (size_t)(sizeof(void*) >= 8 ? 8192 : 256))
> > > + ((1024L * 1024L) * (size_t)(sizeof(void*) >= 8 ? (32 * 1024L *
> > > +1024L) : 256))
> >
> > I wondered if we would run afoul of integer sizes on 64-bit systems
> > where "long" is still only 32-bits (i.e., Windows). But I think it's
> > OK, because the values before we cast to size_t are in megabytes. So
> > your
> > 32*1024*1024 needs only 25 bits to store it. And then after we cast to
> > size_t, everything is in 64-bit.
> 
> Indeed, when I patch a local Git checkout accordingly, I see that
> packed_git_limit is set to 35184372088832.
> 
> The bigger problem in this regard is that users are allowed to override this 
> via
> core.packedgitlimit but that value is parsed as an unsigned long.

We might want to think about replacing git_config_ulong with 
git_config_size_t in nearly all cases. "long" has ceased to be 
useful.  More modern versions of C prefer uint64_t, but I
think that we'll usually want size_t because these values will
be used as memory limits of various sorts.

[PATCH] Increase core.packedGitLimit

2017-04-20 Thread David Turner

When core.packedGitLimit is exceeded, git will close packs.  If there
is a repack operation going on in parallel with a fetch, the fetch
might open a pack, and then be forced to close it due to
packedGitLimit being hit.  The repack could then delete the pack
out from under the fetch, causing the fetch to fail.

Increase core.packedGitLimit's default value to prevent
this.

On current 64-bit x86_64 machines, 48 bits of address space are
available.  It appears that 64-bit ARM machines have no standard
amount of address space (that is, it varies by manufacturer), and IA64
and POWER machines have the full 64 bits.  So 48 bits is the only
limit that we can reasonably care about.  We reserve a few bits of the
48-bit address space for the kernel's use (this is not strictly
necessary, but it's better to be safe), and use up to the remaining
45.  No git repository will be anywhere near this large any time soon,
so this should prevent the failure.

Helped-by: Jeff King <p...@peff.net>
Signed-off-by: David Turner <dtur...@twosigma.com>
---
 git-compat-util.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/git-compat-util.h b/git-compat-util.h
index 8a4a3f85e7..1c5de153a5 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -616,7 +616,7 @@ extern int git_lstat(const char *, struct stat *);
 #endif
 
 #define DEFAULT_PACKED_GIT_LIMIT \
-   ((1024L * 1024L) * (size_t)(sizeof(void*) >= 8 ? 8192 : 256))
+   ((1024L * 1024L) * (size_t)(sizeof(void*) >= 8 ? (32 * 1024L * 1024L) : 
256))
 
 #ifdef NO_PREAD
 #define pread git_pread
-- 
2.11.GIT

RE: [PATCH] repack: respect gc.pid lock

2017-04-20 Thread David Turner

> -Original Message-
> From: Jeff King [mailto:p...@peff.net]
> Sent: Tuesday, April 18, 2017 1:50 PM
> To: David Turner <david.tur...@twosigma.com>
> Cc: git@vger.kernel.org; christian.cou...@gmail.com; mf...@codeaurora.org;
> jacob.kel...@gmail.com
> Subject: Re: [PATCH] repack: respect gc.pid lock
> 
> On Tue, Apr 18, 2017 at 05:43:29PM +, David Turner wrote:
> 
> > > A lock can catch the racy cases where both run at the same time. But
> > > I think that
> > > even:
> > >
> > >   git -c repack.writeBitmaps=true repack -Ad
> > >   [...wait...]
> > >   git gc
> > >
> > > is questionable, because that gc will erase your bitmaps. How does
> > > git-gc know that it's doing a bad thing by repacking without
> > > bitmaps, and that you didn't simply change your configuration or want to 
> > > get
> rid of them?
> >
> > Sorry, the gc in Gitlab does keep bitmaps.  The one I quoted in a
> > previous message  doesn't, because the person typing the command was
> > just doing some manual  testing and I guess didn't realize that
> > bitmaps were important.  Or perhaps he knew that repack.writeBitmaps was
> already set in the config.
> 
> Sure, but I guess I'd just wonder what _else_ is different between the 
> commands
> (and if nothing, why are both running).

Presumably, repack is faster, and they're not intended to run concurrently (but 
there's a Gitlab bug causing them to do so).  But you'll have to ask the Gitlab 
folks for more details.

> > So given that the lock will catch the races, might it be a good idea
> > (if Implemented to avoid locking on repack -d)?
> 
> I'm mildly negative just because it increases complexity, and I don't think 
> it's
> actually buying very much. It's not clear to me which invocations of repack
> would want to lock and which ones wouldn't.
> 
> Is "-a" or "-A" the key factor? Are there current callers who prefer the 
> current
> behavior of "possibly duplicate some work, but never report failure" versus 
> "do
> not duplicate work, but sometimes fail due to lock contention"?

One problem with failing is that it can leave a temp pack behind.

I think the correct fix is to change the default code.packedGitLimit on 64-bit 
machines to 32 terabytes (2**45 bytes).  That's because on modern Intel 
processors, there are 48 bits of address space actually available, but the 
kernel 
is going to probably reserve a few bits.  My machine claims to have 2**46 bytes 
of virtual address space available.  It's also several times bigger than any 
repo that I know of or can easily imagine.

Does that seem reasonable to you?

RE: [PATCH v3 1/2] use HOST_NAME_MAX to size buffers for gethostname(2)

2017-04-19 Thread David Turner

> I had another look at this last night and cooked up the following patch.  
> Might
> have gone overboard with it..
> 
> -- >8 --
> Subject: [PATCH] gc: support arbitrary hostnames and pids in 
> lock_repo_for_gc()
> 
> git gc writes its pid and hostname into a pidfile to prevent concurrent 
> garbage
> collection.  Repositories may be shared between systems with different limits
> for host name length and different pid ranges.  Use a strbuf to store the file
> contents to allow for arbitrarily long hostnames and pids to be shown to the
> user on early abort.

This is pretty paranoid, but maybe the remote host has a longer pid_t than we 
do, so we should be using intmax_t when reading the pid, and only check its 
size  before passing it to kill?

(Personally, I think this whole patch is kind of overkill, but some folks 
probably
think the same about my original patches, so I'm happy to live and let live).

RE: [PATCH v3 2/2] xgethostname: handle long hostnames

2017-04-19 Thread David Turner

> -Original Message-
> From: Junio C Hamano [mailto:gits...@pobox.com]
> Sent: Tuesday, April 18, 2017 10:51 PM
> To: Jonathan Nieder <jrnie...@gmail.com>
> Cc: David Turner <david.tur...@twosigma.com>; git@vger.kernel.org;
> l@web.de
> Subject: Re: [PATCH v3 2/2] xgethostname: handle long hostnames
> 
> Jonathan Nieder <jrnie...@gmail.com> writes:
> 
> > Hi,
> >
> > David Turner wrote:
> >
> >> If the full hostname doesn't fit in the buffer supplied to
> >> gethostname, POSIX does not specify whether the buffer will be
> >> null-terminated, so to be safe, we should do it ourselves.  Introduce
> >> new function, xgethostname, which ensures that there is always a \0
> >> at the end of the buffer.
> >
> > I think we should detect the error instead of truncating the hostname.
> > That (on top of your patch) would look like the following.
> >
> > Thoughts?
> > Jonathan
> >
> > diff --git i/wrapper.c w/wrapper.c
> > index d837417709..e218bd3bef 100644
> > --- i/wrapper.c
> > +++ w/wrapper.c
> > @@ -660,11 +660,13 @@ int xgethostname(char *buf, size_t len)  {
> > /*
> >  * If the full hostname doesn't fit in buf, POSIX does not
> > -* specify whether the buffer will be null-terminated, so to
> > -* be safe, do it ourselves.
> > +* guarantee that an error will be returned. Check for ourselves
> > +* to be safe.
> >  */
> > int ret = gethostname(buf, len);
> > -   if (!ret)
> > -   buf[len - 1] = 0;
> > +   if (!ret && !memchr(buf, 0, len)) {
> > +   errno = ENAMETOOLONG;
> > +   return -1;
> > +   }
> 
> H.  "Does not specify if the buffer will be NUL-terminated"
> would mean that it is OK for the platform gethostname() to stuff
> sizeof(buf)-1 first bytes of the hostname in the buffer and then truncate by
> placing '\0' at the end of the buf, and we would not notice truncation with 
> the
> above change on such a platform, no?

My read of the docs is that not only is that OK, but it is also permitted
for the platform to put sizeof(buf) bytes into the buffer and *not* 
put \0 at the end.

So in order to do a dynamic approach, we would have to allocate some
buffer, then run gethostname, then check if the penultimate element 
of the buffer was written to, and if so, allocate a larger buffer.  Yucky,
but possible.

[PATCH v3 2/2] xgethostname: handle long hostnames

2017-04-18 Thread David Turner

If the full hostname doesn't fit in the buffer supplied to
gethostname, POSIX does not specify whether the buffer will be
null-terminated, so to be safe, we should do it ourselves.  Introduce
new function, xgethostname, which ensures that there is always a \0
at the end of the buffer.

Signed-off-by: David Turner <dtur...@twosigma.com>
---
 builtin/gc.c   |  2 +-
 builtin/receive-pack.c |  2 +-
 fetch-pack.c   |  2 +-
 git-compat-util.h  |  2 ++
 ident.c|  2 +-
 wrapper.c  | 13 +
 6 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index 4c4a36e2b5..33a1edabbc 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -250,7 +250,7 @@ static const char *lock_repo_for_gc(int force, pid_t* 
ret_pid)
/* already locked */
return NULL;
 
-   if (gethostname(my_host, sizeof(my_host)))
+   if (xgethostname(my_host, sizeof(my_host)))
xsnprintf(my_host, sizeof(my_host), "unknown");
 
pidfile_path = git_pathdup("gc.pid");
diff --git a/builtin/receive-pack.c b/builtin/receive-pack.c
index 2612efad3d..0ca423a711 100644
--- a/builtin/receive-pack.c
+++ b/builtin/receive-pack.c
@@ -1700,7 +1700,7 @@ static const char *unpack(int err_fd, struct shallow_info 
*si)
argv_array_pushl(, "index-pack",
 "--stdin", hdr_arg, NULL);
 
-   if (gethostname(hostname, sizeof(hostname)))
+   if (xgethostname(hostname, sizeof(hostname)))
xsnprintf(hostname, sizeof(hostname), "localhost");
argv_array_pushf(,
 "--keep=receive-pack %"PRIuMAX" on %s",
diff --git a/fetch-pack.c b/fetch-pack.c
index 055f568775..15d59a0440 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -803,7 +803,7 @@ static int get_pack(struct fetch_pack_args *args,
argv_array_push(, "--fix-thin");
if (args->lock_pack || unpack_limit) {
char hostname[HOST_NAME_MAX + 1];
-   if (gethostname(hostname, sizeof(hostname)))
+   if (xgethostname(hostname, sizeof(hostname)))
xsnprintf(hostname, sizeof(hostname), 
"localhost");
argv_array_pushf(,
"--keep=fetch-pack %"PRIuMAX " on %s",
diff --git a/git-compat-util.h b/git-compat-util.h
index 46f3abe401..bd04564a69 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -888,6 +888,8 @@ extern int xsnprintf(char *dst, size_t max, const char 
*fmt, ...);
 #define HOST_NAME_MAX 256
 #endif
 
+extern int xgethostname(char *buf, size_t len);
+
 /* in ctype.c, for kwset users */
 extern const unsigned char tolower_trans_tbl[256];
 
diff --git a/ident.c b/ident.c
index 556851cf94..bea871c8e0 100644
--- a/ident.c
+++ b/ident.c
@@ -122,7 +122,7 @@ static void add_domainname(struct strbuf *out, int 
*is_bogus)
 {
char buf[HOST_NAME_MAX + 1];
 
-   if (gethostname(buf, sizeof(buf))) {
+   if (xgethostname(buf, sizeof(buf))) {
warning_errno("cannot get host name");
strbuf_addstr(out, "(none)");
*is_bogus = 1;
diff --git a/wrapper.c b/wrapper.c
index 0542fc7582..d837417709 100644
--- a/wrapper.c
+++ b/wrapper.c
@@ -655,3 +655,16 @@ void sleep_millisec(int millisec)
 {
poll(NULL, 0, millisec);
 }
+
+int xgethostname(char *buf, size_t len)
+{
+   /*
+* If the full hostname doesn't fit in buf, POSIX does not
+* specify whether the buffer will be null-terminated, so to
+* be safe, do it ourselves.
+*/
+   int ret = gethostname(buf, len);
+   if (!ret)
+   buf[len - 1] = 0;
+   return ret;
+}
-- 
2.11.GIT

[PATCH v3 1/2] use HOST_NAME_MAX to size buffers for gethostname(2)

2017-04-18 Thread David Turner

From: René Scharfe <l@web.de>

POSIX limits the length of host names to HOST_NAME_MAX.  Export the
fallback definition from daemon.c and use this constant to make all
buffers used with gethostname(2) big enough for any possible result
and a terminating NUL.

Inspired-by: David Turner <dtur...@twosigma.com>
Signed-off-by: Rene Scharfe <l@web.de>
Signed-off-by: David Turner <dtur...@twosigma.com>
---
 builtin/gc.c   | 10 +++---
 builtin/receive-pack.c |  2 +-
 daemon.c   |  4 
 fetch-pack.c   |  2 +-
 git-compat-util.h  |  4 
 ident.c|  2 +-
 6 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index c2c61a57bb..4c4a36e2b5 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -238,7 +238,7 @@ static int need_to_gc(void)
 static const char *lock_repo_for_gc(int force, pid_t* ret_pid)
 {
static struct lock_file lock;
-   char my_host[128];
+   char my_host[HOST_NAME_MAX + 1];
struct strbuf sb = STRBUF_INIT;
struct stat st;
uintmax_t pid;
@@ -257,8 +257,12 @@ static const char *lock_repo_for_gc(int force, pid_t* 
ret_pid)
fd = hold_lock_file_for_update(, pidfile_path,
   LOCK_DIE_ON_ERROR);
if (!force) {
-   static char locking_host[128];
+   static char locking_host[HOST_NAME_MAX + 1];
+   static char *scan_fmt;
int should_exit;
+
+   if (!scan_fmt)
+   scan_fmt = xstrfmt("%s %%%dc", "%"SCNuMAX, 
HOST_NAME_MAX);
fp = fopen(pidfile_path, "r");
memset(locking_host, 0, sizeof(locking_host));
should_exit =
@@ -274,7 +278,7 @@ static const char *lock_repo_for_gc(int force, pid_t* 
ret_pid)
 * running.
 */
time(NULL) - st.st_mtime <= 12 * 3600 &&
-   fscanf(fp, "%"SCNuMAX" %127c", , locking_host) == 2 
&&
+   fscanf(fp, scan_fmt, , locking_host) == 2 &&
/* be gentle to concurrent "gc" on remote hosts */
(strcmp(locking_host, my_host) || !kill(pid, 0) || 
errno == EPERM);
if (fp != NULL)
diff --git a/builtin/receive-pack.c b/builtin/receive-pack.c
index aca9c33d8d..2612efad3d 100644
--- a/builtin/receive-pack.c
+++ b/builtin/receive-pack.c
@@ -1695,7 +1695,7 @@ static const char *unpack(int err_fd, struct shallow_info 
*si)
if (status)
return "unpack-objects abnormal exit";
} else {
-   char hostname[256];
+   char hostname[HOST_NAME_MAX + 1];
 
argv_array_pushl(, "index-pack",
 "--stdin", hdr_arg, NULL);
diff --git a/daemon.c b/daemon.c
index 473e6b6b63..1503e1ed6f 100644
--- a/daemon.c
+++ b/daemon.c
@@ -4,10 +4,6 @@
 #include "strbuf.h"
 #include "string-list.h"
 
-#ifndef HOST_NAME_MAX
-#define HOST_NAME_MAX 256
-#endif
-
 #ifdef NO_INITGROUPS
 #define initgroups(x, y) (0) /* nothing */
 #endif
diff --git a/fetch-pack.c b/fetch-pack.c
index d07d85ce30..055f568775 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -802,7 +802,7 @@ static int get_pack(struct fetch_pack_args *args,
if (args->use_thin_pack)
argv_array_push(, "--fix-thin");
if (args->lock_pack || unpack_limit) {
-   char hostname[256];
+   char hostname[HOST_NAME_MAX + 1];
if (gethostname(hostname, sizeof(hostname)))
xsnprintf(hostname, sizeof(hostname), 
"localhost");
argv_array_pushf(,
diff --git a/git-compat-util.h b/git-compat-util.h
index 8a4a3f85e7..46f3abe401 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -884,6 +884,10 @@ static inline size_t xsize_t(off_t len)
 __attribute__((format (printf, 3, 4)))
 extern int xsnprintf(char *dst, size_t max, const char *fmt, ...);
 
+#ifndef HOST_NAME_MAX
+#define HOST_NAME_MAX 256
+#endif
+
 /* in ctype.c, for kwset users */
 extern const unsigned char tolower_trans_tbl[256];
 
diff --git a/ident.c b/ident.c
index c0364fe3a1..556851cf94 100644
--- a/ident.c
+++ b/ident.c
@@ -120,7 +120,7 @@ static int canonical_name(const char *host, struct strbuf 
*out)
 
 static void add_domainname(struct strbuf *out, int *is_bogus)
 {
-   char buf[1024];
+   char buf[HOST_NAME_MAX + 1];
 
if (gethostname(buf, sizeof(buf))) {
warning_errno("cannot get host name");
-- 
2.11.GIT

[PATCH v3 0/2] gethostbyname fixes

2017-04-18 Thread David Turner

This version includes Junio's fixup to René's patch, and then my patch
rebased on top of René's.  I thought it was easier to just send both
in one series, than to have Junio do a bunch of conflict resolution.
I think this still needs Junio's signoff on the first patch, since
I've added his code.

David Turner (1):
  xgethostname: handle long hostnames

René Scharfe (1):
  use HOST_NAME_MAX to size buffers for gethostname(2)

 builtin/gc.c   | 12 
 builtin/receive-pack.c |  4 ++--
 daemon.c   |  4 
 fetch-pack.c   |  4 ++--
 git-compat-util.h  |  6 ++
 ident.c|  4 ++--
 wrapper.c  | 13 +
 7 files changed, 33 insertions(+), 14 deletions(-)

-- 
2.11.GIT

RE: [PATCH v2] xgethostname: handle long hostnames

2017-04-18 Thread David Turner

> -Original Message-
> From: René Scharfe [mailto:l@web.de]
> Sent: Tuesday, April 18, 2017 12:08 PM
> To: Junio C Hamano <gits...@pobox.com>; David Turner
... 
> >> Of course, my_host is sized to HOST_NAME_MAX + 1 and we are comparing
> >> it with locking_host, so perhaps we'd need to take this version to
> >> size locking_host to also HOST_NAME_MAX + 1, and then scan with %255c
> >> (but then shouldn't we scan with %256c instead?  I am not sure where
> >> these +1 comes from).
> >
> > That is, something along this line...
> >
> >   builtin/gc.c | 6 +-
> >   1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/builtin/gc.c b/builtin/gc.c index be75508292..4f85610d87
> > 100644
> > --- a/builtin/gc.c
> > +++ b/builtin/gc.c
> > @@ -240,7 +240,11 @@ static const char *lock_repo_for_gc(int force, pid_t*
> ret_pid)
> >LOCK_DIE_ON_ERROR);
> > if (!force) {
> > static char locking_host[HOST_NAME_MAX + 1];
> > +   static char *scan_fmt;
> > int should_exit;
> > +
> > +   if (!scan_fmt)
> > +   scan_fmt = xstrfmt("%s %%%dc", "%"SCNuMAX,
> HOST_NAME_MAX);
> > fp = fopen(pidfile_path, "r");
> > memset(locking_host, 0, sizeof(locking_host));
> > should_exit =
> > @@ -256,7 +260,7 @@ static const char *lock_repo_for_gc(int force, pid_t*
> ret_pid)
> >  * running.
> >  */
> > time(NULL) - st.st_mtime <= 12 * 3600 &&
> > -   fscanf(fp, "%"SCNuMAX" %127c", , locking_host)
> == 2 &&
> > +   fscanf(fp, scan_fmt, , locking_host) == 2 &&
> > /* be gentle to concurrent "gc" on remote hosts */
> > (strcmp(locking_host, my_host) || !kill(pid, 0) || errno
> == EPERM);
> > if (fp != NULL)
> >
> 
> How important is it to scan the whole file in one call?  We could split it up 
> like
> this and use a strbuf to handle host names of any length.  We need to be
> permissive here to allow machines with different values for HOST_NAME_MAX
> to work with the same file on a network file system, so this would have to be 
> the
> first patch, right?

If the writer has the smaller HOST_NAME_MAX, this will work fine.  If the reader
has the smaller HOST_NAME_MAX, and the writer's actual value is too long,
then there's no way the strcmp would succeed anyway.  So I don't think we need
to worry about it.

> NB: That && cascade has enough meat for a whole function.

+1

RE: [PATCH] repack: respect gc.pid lock

2017-04-18 Thread David Turner



> -Original Message-
> From: Jeff King [mailto:p...@peff.net]
> Sent: Tuesday, April 18, 2017 1:20 PM
> To: David Turner <david.tur...@twosigma.com>
> Cc: git@vger.kernel.org; christian.cou...@gmail.com; mf...@codeaurora.org;
> jacob.kel...@gmail.com
> Subject: Re: [PATCH] repack: respect gc.pid lock
> 
> On Tue, Apr 18, 2017 at 05:16:52PM +, David Turner wrote:
> 
> > > -Original Message-
> > > From: Jeff King [mailto:p...@peff.net]
> > > Sent: Monday, April 17, 2017 11:42 PM
> > > To: David Turner <david.tur...@twosigma.com>
> > > Cc: git@vger.kernel.org; christian.cou...@gmail.com;
> > > mf...@codeaurora.org; jacob.kel...@gmail.com
> > > Subject: Re: [PATCH] repack: respect gc.pid lock
> > >
> > > On Mon, Apr 17, 2017 at 11:29:18PM +, David Turner wrote:
> > >
> > > > We saw this failure in the logs multiple  times (with three
> > > > different shas, while a gc was running):
> > > > April 12, 2017 06:45 -> ERROR -> 'git -c repack.writeBitmaps=true
> > > > repack -A -d
> > > --pack-kept-objects' in [repo] failed:
> > > > fatal: packfile ./objects/pack/pack-[sha].pack cannot be accessed
> > > > Possibly some other repack was also running at the time as well.
> > > >
> > > > My colleague also saw it while manually doing gc (again while
> > > > repacks were likely to be running):
> > >
> > > This is sort of a side question, but...why are you running other
> > > repacks alongside git-gc? It seems like you ought to be doing one or the
> other.
> >
> > But actually, it would be kind of nice if git would help protect us from 
> > doing
> this?
> 
> A lock can catch the racy cases where both run at the same time. But I think 
> that
> even:
> 
>   git -c repack.writeBitmaps=true repack -Ad
>   [...wait...]
>   git gc
> 
> is questionable, because that gc will erase your bitmaps. How does git-gc know
> that it's doing a bad thing by repacking without bitmaps, and that you didn't
> simply change your configuration or want to get rid of them?

Sorry, the gc in Gitlab does keep bitmaps.  The one I quoted in a previous 
message  doesn't, because the person typing the command was just doing some 
manual  testing and I guess didn't realize that bitmaps were important.  Or 
perhaps he knew that repack.writeBitmaps was already set in the config.

So given that the lock will catch the races, might it be a good idea (if 
Implemented to avoid locking on repack -d)?

RE: [PATCH] repack: respect gc.pid lock

2017-04-18 Thread David Turner

> -Original Message-
> From: Jeff King [mailto:p...@peff.net]
> Sent: Monday, April 17, 2017 11:42 PM
> To: David Turner <david.tur...@twosigma.com>
> Cc: git@vger.kernel.org; christian.cou...@gmail.com; mf...@codeaurora.org;
> jacob.kel...@gmail.com
> Subject: Re: [PATCH] repack: respect gc.pid lock
> 
> On Mon, Apr 17, 2017 at 11:29:18PM +, David Turner wrote:
> 
> > We saw this failure in the logs multiple  times (with three different
> > shas, while a gc was running):
> > April 12, 2017 06:45 -> ERROR -> 'git -c repack.writeBitmaps=true repack -A 
> > -d
> --pack-kept-objects' in [repo] failed:
> > fatal: packfile ./objects/pack/pack-[sha].pack cannot be accessed
> > Possibly some other repack was also running at the time as well.
> >
> > My colleague also saw it while manually doing gc (again while repacks
> > were likely to be running):
> 
> This is sort of a side question, but...why are you running other repacks 
> alongside
> git-gc? It seems like you ought to be doing one or the other.

But actually, it would be kind of nice if git would help protect us from doing 
this?

RE: [PATCH] repack: respect gc.pid lock

2017-04-18 Thread David Turner

> -Original Message-
> From: Jeff King [mailto:p...@peff.net]
> Sent: Monday, April 17, 2017 11:42 PM
> To: David Turner <david.tur...@twosigma.com>
> Cc: git@vger.kernel.org; christian.cou...@gmail.com; mf...@codeaurora.org;
> jacob.kel...@gmail.com
> Subject: Re: [PATCH] repack: respect gc.pid lock
> 
> On Mon, Apr 17, 2017 at 11:29:18PM +, David Turner wrote:
> 
> > We saw this failure in the logs multiple  times (with three different
> > shas, while a gc was running):
> > April 12, 2017 06:45 -> ERROR -> 'git -c repack.writeBitmaps=true repack -A 
> > -d
> --pack-kept-objects' in [repo] failed:
> > fatal: packfile ./objects/pack/pack-[sha].pack cannot be accessed
> > Possibly some other repack was also running at the time as well.
> >
> > My colleague also saw it while manually doing gc (again while repacks
> > were likely to be running):
> 
> This is sort of a side question, but...why are you running other repacks 
> alongside
> git-gc? It seems like you ought to be doing one or the other.
>
> I don't begrudge anybody with a complicated setup running their own set of gc
> commands, but I'd think you would want to do locking there, and disable auto-
> gc entirely. Otherwise you're going to get different results depending on who
> gc'd last.

That's what gitlab does, so you'll have to ask them why they do it that way.  
From https://gitlab.com/gitlab-org/gitlab-ce/issues/30939#note_27487981
 it looks like they may have intended to have a lock but not quite succeeded.
 
> > $ git gc --aggressive
> > Counting objects: 13800073, done.
> > Delta compression using up to 8 threads.
> > Compressing objects:  99% (11465846/11465971)
> > Compressing objects: 100% (11465971/11465971), done.
> > fatal: packfile [repo]/objects/pack/pack-[sha].pack cannot be accessed
> 
> OK, so this presumably happened during the writing phase. Which seems like the
> "a pack was closed, and we couldn't re-open it" problem we've seen before.
> 
> > We have a reasonable rlimit (64k soft limit), so that failure mode is
> > pretty unlikely.  I  think we should have had 20 or so packs -- not tens of
> thousands.
> > [...]
> > Do you have any idea why this would be happening other than the rlimit 
> > thing?
> 
> Yeah, that should be enough (you could double check the return of
> get_max_fd_limit() on your system if you wanted to be paranoid).
> 
> We also keep only a limited number of bytes mmap'd at one time. Normally we
> don't actually close packfiles when we release their mmap windows.
> But I think there is one path that might. When use_pack() maps a pack, if the
> entire pack fits in a single window, then we close it; this is due to 
> d131b7afe
> (sha1_file.c: Don't retain open fds on small packs, 2011-03-02).
> 
> But if we ever unmap that window, now we have no handle to the pack.
> Normally on a 64-bit system this wouldn't happen at all, since the default
> core.packedGitLimit is 8GB there.

Aha, I missed that limit while messing around with the code.  That must be it.

> So if you have a few small packs and one very large pack (over 8GB), I think 
> this
> could trigger. We may do the small-pack thing for some of them, and then the
> large pack forces us to drop the mmaps for some of the others. When we go
> back to access the small pack, we find it's gone.
> 
> One solution would be to bump core.packedGitLimit to something much higher
> (it's an mmap, so we're really just chewing up address space; it's up to the 
> OS to
> decide when to load pages from disk and when to drop them).
>
> The other alternative is to disable the small-pack closing from d131b7afe. It
> might need to be configurable, or perhaps auto-tuned based on the fd limit.
> Linux systems tend to have generous descriptor limits, but I'm not sure we can
> rely on that. OTOH, it seems like the code to close descriptors when needed
> would take care of things. So maybe we should just revert d131b7afe entirely.

I definitely remember running into fd limits when processing very large numbers 
of packs at Twitter, but I don't recall the exact details.  Presumably, 
d131b7afe
was supposed to help with this, but in fact, it did not totally solve it. 
Perhaps 
we were doing something funny.  Adjusting the fd limits was the easy fix.

On 64-bit systems, I think core.packedGitLimit doesn't make a 
lot of sense. There is plenty of address space.  Why not use it?

For 32-bit systems, of course, address space is more precious.

I'll ask our git server administrator to adjust core.packedGitLimit
and turn repacks back on to see if that fixes the issue.

> The final thing I'd ask is whether you might be on a networked filesystem that
> would f

RE: [PATCH] repack: respect gc.pid lock

2017-04-17 Thread David Turner


> -Original Message-
> From: Jeff King [mailto:p...@peff.net]
> Sent: Friday, April 14, 2017 3:34 PM
> To: David Turner <david.tur...@twosigma.com>
> Cc: git@vger.kernel.org; christian.cou...@gmail.com; mf...@codeaurora.org;
> jacob.kel...@gmail.com
> Subject: Re: [PATCH] repack: respect gc.pid lock
> 
> On Thu, Apr 13, 2017 at 04:27:12PM -0400, David Turner wrote:
> 
> > Git gc locks the repository (using a gc.pid file) so that other gcs
> > don't run concurrently. Make git repack respect this lock.
> >
> > Now repack, by default, will refuse to run at the same time as a gc.
> > This fixes a concurrency issue: a repack which deleted packs would
> > make a concurrent gc sad when its packs were deleted out from under
> > it.  The gc would fail with: "fatal: ./objects/pack/pack-$sha.pack
> > cannot be accessed".  Then it would die, probably leaving a large temp
> > pack hanging around.
> >
> > Git repack learns --no-lock, so that when run under git gc, it doesn't
> > attempt to manage the lock itself.
> 
> This also means that two repack invocations cannot run simultaneously, because
> they want to take the same lock.  But depending on the options, the two don't
> necessarily conflict. For example, two simultaneous incremental "git repack 
> -d"
> invocations should be able to complete.
> 
> Do we know where the error message is coming from? I couldn't find the error
> message you've given above; grepping for "cannot be accessed"
> shows only error messages that would have "packfile" after the "fatal:".
> Is it a copy-paste error?

Yes, it is.  Sorry.

We saw this failure in the logs multiple  times (with three different
shas, while a gc was running):
April 12, 2017 06:45 -> ERROR -> 'git -c repack.writeBitmaps=true repack -A -d 
--pack-kept-objects' in [repo] failed:
fatal: packfile ./objects/pack/pack-[sha].pack cannot be accessed
Possibly some other repack was also running at the time as well.

My colleague also saw it while manually doing gc (again while 
repacks were likely to be running):
$ git gc --aggressive
Counting objects: 13800073, done.
Delta compression using up to 8 threads.
Compressing objects:  99% (11465846/11465971)   
Compressing objects: 100% (11465971/11465971), done.
fatal: packfile [repo]/objects/pack/pack-[sha].pack cannot be accessed

(yes, I know that --aggressive is usually not desirable)

> If that's the case, then it's the one in use_pack(). Do we know what
> program/operation is causing the error? Having a simultaneous gc delete a
> packfile is _supposed_ to work, through a combination of:
> 
>   1. Most sha1-access operations can re-scan the pack directory if they
>  find the packfile went away.
> 
>   2. The pack-objects run by a simultaneous repack is somewhat special
>  in that once it finds and commits to a copy of an object in a pack,
>  we need to use exactly that pack, because we record its offset,
>  delta representation, etc. Usually this works because we open and
>  mmap the packfile before making that commitment, and open packfiles
>  are only closed if you run out of file descriptors (which should
>  only happen when you have a huge number of packs).

We have a reasonable rlimit (64k soft limit), so that failure mode is pretty 
unlikely.  I  think we should have had 20 or so packs -- not tens of thousands.

> So I'm worried that this repack lock is going to regress some other cases 
> that run
> fine together. But I'm also worried that it's a band-aid over a more subtle
> problem. If pack-objects is not able to run alongside a gc, then you're also 
> going
> to have problems serving fetches, and obviously you wouldn't want to take a
> lock there.

I see your point.  I don't know if it's pack-objects that's seeing this, 
although maybe 
that's the only reasonable codepath.

I did some tracing through the code, and couldn't figure out how to trigger 
that 
error message.  It appears in two places in the code, but only when the pack is 
not 
initialized.  But the packs always seem to be set up by that point in my test 
runs.  
It's worth noting that I'm not testing on the gitlab server; I'm testing on my 
laptop with
a completely different repo.  But I've tried various ways to repro this -- or 
even to 
get to a point where those errors would have been thrown given a missing pack 
-- 
and I have not been able to.

Do you have any idea why this would be happening other than the rlimit thing?

[PATCH v2] xgethostname: handle long hostnames

2017-04-17 Thread David Turner

If the full hostname doesn't fit in the buffer supplied to
gethostname, POSIX does not specify whether the buffer will be
null-terminated, so to be safe, we should do it ourselves.  Introduce
new function, xgethostname, which ensures that there is always a \0
at the end of the buffer.

Always use a consistent buffer size when calling xgethostname.  We use
HOST_NAME_MAX + 1.  When this is unavailable, we fall back to 256, a
decision we previously made in daemon.c, but which we now move to
git-compat-util.h so that it can be used everywhere.

Signed-off-by: David Turner <dtur...@twosigma.com>
---

This version addresses some comments by René Scharfe.  René, we're
still silently truncating, but as I noted in my message to Jonathan:
Looking at the users of this function, I think most would be happier
with a truncated buffer than an error:

* gc.c: used to see if we are the same machine as the machine that
locked the repo. it is unlikely that two machines have hostnames that
differ only in the 256th-or-above character, and it would be weird to
fail to gc just because our hostname is long.
* fetch-pack.c, receive-pack.c: similar to gc.c; the hostname is a note
in the .keep file for human consumption only
* ident.c: used to make up a fake email address. On my laptop,
gethostname returns "corey" (no domain part), so the email address is
not likely to be valid anyway.

 builtin/gc.c   |  6 +++---
 builtin/receive-pack.c |  4 ++--
 daemon.c   |  4 
 fetch-pack.c   |  4 ++--
 git-compat-util.h  |  6 ++
 ident.c|  4 ++--
 wrapper.c  | 13 +
 7 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index c2c61a57bb..5de0209c59 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -238,7 +238,7 @@ static int need_to_gc(void)
 static const char *lock_repo_for_gc(int force, pid_t* ret_pid)
 {
static struct lock_file lock;
-   char my_host[128];
+   char my_host[HOST_NAME_MAX + 1];
struct strbuf sb = STRBUF_INIT;
struct stat st;
uintmax_t pid;
@@ -250,14 +250,14 @@ static const char *lock_repo_for_gc(int force, pid_t* 
ret_pid)
/* already locked */
return NULL;
 
-   if (gethostname(my_host, sizeof(my_host)))
+   if (xgethostname(my_host, sizeof(my_host)))
xsnprintf(my_host, sizeof(my_host), "unknown");
 
pidfile_path = git_pathdup("gc.pid");
fd = hold_lock_file_for_update(, pidfile_path,
   LOCK_DIE_ON_ERROR);
if (!force) {
-   static char locking_host[128];
+   static char locking_host[HOST_NAME_MAX + 1];
int should_exit;
fp = fopen(pidfile_path, "r");
memset(locking_host, 0, sizeof(locking_host));
diff --git a/builtin/receive-pack.c b/builtin/receive-pack.c
index aca9c33d8d..0ca423a711 100644
--- a/builtin/receive-pack.c
+++ b/builtin/receive-pack.c
@@ -1695,12 +1695,12 @@ static const char *unpack(int err_fd, struct 
shallow_info *si)
if (status)
return "unpack-objects abnormal exit";
} else {
-   char hostname[256];
+   char hostname[HOST_NAME_MAX + 1];
 
argv_array_pushl(, "index-pack",
 "--stdin", hdr_arg, NULL);
 
-   if (gethostname(hostname, sizeof(hostname)))
+   if (xgethostname(hostname, sizeof(hostname)))
xsnprintf(hostname, sizeof(hostname), "localhost");
argv_array_pushf(,
 "--keep=receive-pack %"PRIuMAX" on %s",
diff --git a/daemon.c b/daemon.c
index 473e6b6b63..1503e1ed6f 100644
--- a/daemon.c
+++ b/daemon.c
@@ -4,10 +4,6 @@
 #include "strbuf.h"
 #include "string-list.h"
 
-#ifndef HOST_NAME_MAX
-#define HOST_NAME_MAX 256
-#endif
-
 #ifdef NO_INITGROUPS
 #define initgroups(x, y) (0) /* nothing */
 #endif
diff --git a/fetch-pack.c b/fetch-pack.c
index d07d85ce30..15d59a0440 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -802,8 +802,8 @@ static int get_pack(struct fetch_pack_args *args,
if (args->use_thin_pack)
argv_array_push(, "--fix-thin");
if (args->lock_pack || unpack_limit) {
-   char hostname[256];
-   if (gethostname(hostname, sizeof(hostname)))
+   char hostname[HOST_NAME_MAX + 1];
+   if (xgethostname(hostname, sizeof(hostname)))
xsnprintf(hostname, sizeof(hostname), 
"localhost");
argv_array_pushf(,
"--keep=fetch-pack %"PRIuMAX " on %s",
diff --git a/git-c

RE: [PATCH] xgethostname: handle long hostnames

2017-04-13 Thread David Turner

> -Original Message-
> From: Jonathan Nieder [mailto:jrnie...@gmail.com]
> Sent: Thursday, April 13, 2017 6:05 PM
> To: David Turner <david.tur...@twosigma.com>
> Cc: git@vger.kernel.org
> Subject: Re: [PATCH] xgethostname: handle long hostnames
> 
> Hi,
> 
> David Turner wrote:
> 
> > If the full hostname doesn't fit in the buffer supplied to
> > gethostname, POSIX does not specify whether the buffer will be
> > null-terminated, so to be safe, we should do it ourselves.
> [...]
> > +++ b/wrapper.c
> > @@ -655,3 +655,16 @@ void sleep_millisec(int millisec)  {
> > poll(NULL, 0, millisec);
> >  }
> > +
> > +int xgethostname(char *buf, size_t len) {
> > +   /*
> > +* If the full hostname doesn't fit in buf, POSIX does not
> > +* specify whether the buffer will be null-terminated, so to
> > +* be safe, do it ourselves.
> > +*/
> > +   int ret = gethostname(buf, len);
> > +   if (!ret)
> > +   buf[len - 1] = 0;
> > +   return ret;
> 
> I wonder if after null-terminating we would want to report this as an error,
> instead of silently using a truncated result.  I.e. something like
> 
> > +   if (!ret)
> > +   buf[len - 1] = 0;
> > +   if (strlen(buf) >= len - 1) {
> > +   errno = ENAMETOOLONG;
> > +   return -1;
> > +   }
>
> (or EINVAL --- either is equally descriptive).

Looking at the users of this function, I think most would be happier with a 
truncated buffer than an error:
gc.c: used to see if we are the same machine as the machine that locked the 
repo. Unlikely that two machines have hostnames that differ only in the 
256th-or-above character.
fetch-pack.c, receive-pack.c: similar to gc.c; the hostname is a note in the 
.keep file
Ident.c: used to make up a fake email address. On my laptop, gethostname 
returns "corey" (no domain part), so the email address is not likely to be 
valid anyway.

> Also POSIX requires that hostnames are <= 255 bytes.  Maybe we can force the
> buffer to be large enough.

That is now how I read it.  I read the limit as HOST_NAME_MAX, which has a 
*minimum* value of 255, but which might be larger.

The existing hostname buffers are 128, 256, and 1024 bytes, so they're pretty 
arbitrary.

[PATCH] repack: respect gc.pid lock

2017-04-13 Thread David Turner

Git gc locks the repository (using a gc.pid file) so that other gcs
don't run concurrently. Make git repack respect this lock.

Now repack, by default, will refuse to run at the same time as a gc.
This fixes a concurrency issue: a repack which deleted packs would
make a concurrent gc sad when its packs were deleted out from under
it.  The gc would fail with: "fatal: ./objects/pack/pack-$sha.pack
cannot be accessed".  Then it would die, probably leaving a large temp
pack hanging around.

Git repack learns --no-lock, so that when run under git gc, it doesn't
attempt to manage the lock itself.

Martin Fick suggested just moving the lock into git repack, but this
would leave parts of git gc (e.g. git prune) protected by only local
locks.  I worried that a prune (part of git gc) concurrent with a
repack could confuse the repack, so I decided to go with this
solution.

Signed-off-by: David Turner <dtur...@twosigma.com>
---
 Documentation/git-repack.txt |  5 +++
 Makefile |  1 +
 builtin/gc.c | 72 ++--
 builtin/repack.c | 13 +++
 repack.c | 88 
 repack.h |  8 
 t/t7700-repack.sh|  8 
 7 files changed, 127 insertions(+), 68 deletions(-)
 create mode 100644 repack.c
 create mode 100644 repack.h

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 26afe6ed54..b347ff5c62 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -143,6 +143,11 @@ other objects in that pack they already have locally.
being removed. In addition, any unreachable loose objects will
be packed (and their loose counterparts removed).
 
+--no-lock::
+   Do not lock the repository, and do not respect any existing lock.
+   Mostly useful for running repack within git gc.  Do not use this
+   unless you know what you are doing.
+
 Configuration
 -
 
diff --git a/Makefile b/Makefile
index 9b36068ac5..7095f03959 100644
--- a/Makefile
+++ b/Makefile
@@ -816,6 +816,7 @@ LIB_OBJS += refs/files-backend.o
 LIB_OBJS += refs/iterator.o
 LIB_OBJS += ref-filter.o
 LIB_OBJS += remote.o
+LIB_OBJS += repack.o
 LIB_OBJS += replace_object.o
 LIB_OBJS += rerere.o
 LIB_OBJS += resolve-undo.o
diff --git a/builtin/gc.c b/builtin/gc.c
index c2c61a57bb..9b9c27020b 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -18,6 +18,7 @@
 #include "sigchain.h"
 #include "argv-array.h"
 #include "commit.h"
+#include "repack.h"
 
 #define FAILED_RUN "failed to run %s"
 
@@ -45,7 +46,6 @@ static struct argv_array prune = ARGV_ARRAY_INIT;
 static struct argv_array prune_worktrees = ARGV_ARRAY_INIT;
 static struct argv_array rerere = ARGV_ARRAY_INIT;
 
-static struct tempfile pidfile;
 static struct lock_file log_lock;
 
 static struct string_list pack_garbage = STRING_LIST_INIT_DUP;
@@ -234,70 +234,6 @@ static int need_to_gc(void)
return 1;
 }
 
-/* return NULL on success, else hostname running the gc */
-static const char *lock_repo_for_gc(int force, pid_t* ret_pid)
-{
-   static struct lock_file lock;
-   char my_host[128];
-   struct strbuf sb = STRBUF_INIT;
-   struct stat st;
-   uintmax_t pid;
-   FILE *fp;
-   int fd;
-   char *pidfile_path;
-
-   if (is_tempfile_active())
-   /* already locked */
-   return NULL;
-
-   if (gethostname(my_host, sizeof(my_host)))
-   xsnprintf(my_host, sizeof(my_host), "unknown");
-
-   pidfile_path = git_pathdup("gc.pid");
-   fd = hold_lock_file_for_update(, pidfile_path,
-  LOCK_DIE_ON_ERROR);
-   if (!force) {
-   static char locking_host[128];
-   int should_exit;
-   fp = fopen(pidfile_path, "r");
-   memset(locking_host, 0, sizeof(locking_host));
-   should_exit =
-   fp != NULL &&
-   !fstat(fileno(fp), ) &&
-   /*
-* 12 hour limit is very generous as gc should
-* never take that long. On the other hand we
-* don't really need a strict limit here,
-* running gc --auto one day late is not a big
-* problem. --force can be used in manual gc
-* after the user verifies that no gc is
-* running.
-*/
-   time(NULL) - st.st_mtime <= 12 * 3600 &&
-   fscanf(fp, "%"SCNuMAX" %127c", , locking_host) == 2 
&&
-   /* be gentle to concurrent "gc" on remote hosts */
-   (strcmp

[PATCH] xgethostname: handle long hostnames

2017-04-13 Thread David Turner

If the full hostname doesn't fit in the buffer supplied to
gethostname, POSIX does not specify whether the buffer will be
null-terminated, so to be safe, we should do it ourselves.  Introduce
new function, xgethostname, which ensures that there is always a \0
at the end of the buffer.

Signed-off-by: David Turner <dtur...@twosigma.com>
---
 builtin/gc.c   |  2 +-
 builtin/receive-pack.c |  2 +-
 fetch-pack.c   |  2 +-
 git-compat-util.h  |  2 ++
 ident.c|  2 +-
 wrapper.c  | 13 +
 6 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index c2c61a57bb..5633483f56 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -250,7 +250,7 @@ static const char *lock_repo_for_gc(int force, pid_t* 
ret_pid)
/* already locked */
return NULL;
 
-   if (gethostname(my_host, sizeof(my_host)))
+   if (xgethostname(my_host, sizeof(my_host)))
xsnprintf(my_host, sizeof(my_host), "unknown");
 
pidfile_path = git_pathdup("gc.pid");
diff --git a/builtin/receive-pack.c b/builtin/receive-pack.c
index aca9c33d8d..fb62a94bc3 100644
--- a/builtin/receive-pack.c
+++ b/builtin/receive-pack.c
@@ -1700,7 +1700,7 @@ static const char *unpack(int err_fd, struct shallow_info 
*si)
argv_array_pushl(, "index-pack",
 "--stdin", hdr_arg, NULL);
 
-   if (gethostname(hostname, sizeof(hostname)))
+   if (xgethostname(hostname, sizeof(hostname)))
xsnprintf(hostname, sizeof(hostname), "localhost");
argv_array_pushf(,
 "--keep=receive-pack %"PRIuMAX" on %s",
diff --git a/fetch-pack.c b/fetch-pack.c
index d07d85ce30..a899441c34 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -803,7 +803,7 @@ static int get_pack(struct fetch_pack_args *args,
argv_array_push(, "--fix-thin");
if (args->lock_pack || unpack_limit) {
char hostname[256];
-   if (gethostname(hostname, sizeof(hostname)))
+   if (xgethostname(hostname, sizeof(hostname)))
xsnprintf(hostname, sizeof(hostname), 
"localhost");
argv_array_pushf(,
"--keep=fetch-pack %"PRIuMAX " on %s",
diff --git a/git-compat-util.h b/git-compat-util.h
index 8a4a3f85e7..e49b65c235 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -884,6 +884,8 @@ static inline size_t xsize_t(off_t len)
 __attribute__((format (printf, 3, 4)))
 extern int xsnprintf(char *dst, size_t max, const char *fmt, ...);
 
+extern int xgethostname(char *buf, size_t len);
+
 /* in ctype.c, for kwset users */
 extern const unsigned char tolower_trans_tbl[256];
 
diff --git a/ident.c b/ident.c
index c0364fe3a1..7de9f47c41 100644
--- a/ident.c
+++ b/ident.c
@@ -122,7 +122,7 @@ static void add_domainname(struct strbuf *out, int 
*is_bogus)
 {
char buf[1024];
 
-   if (gethostname(buf, sizeof(buf))) {
+   if (xgethostname(buf, sizeof(buf))) {
warning_errno("cannot get host name");
strbuf_addstr(out, "(none)");
*is_bogus = 1;
diff --git a/wrapper.c b/wrapper.c
index 0542fc7582..d837417709 100644
--- a/wrapper.c
+++ b/wrapper.c
@@ -655,3 +655,16 @@ void sleep_millisec(int millisec)
 {
poll(NULL, 0, millisec);
 }
+
+int xgethostname(char *buf, size_t len)
+{
+   /*
+* If the full hostname doesn't fit in buf, POSIX does not
+* specify whether the buffer will be null-terminated, so to
+* be safe, do it ourselves.
+*/
+   int ret = gethostname(buf, len);
+   if (!ret)
+   buf[len - 1] = 0;
+   return ret;
+}
-- 
2.11.GIT

Re: Simultaneous gc and repack

2017-04-13 Thread David Turner

On Thu, 2017-04-13 at 12:36 -0600, Martin Fick wrote:
> On Thursday, April 13, 2017 02:28:07 PM David Turner wrote:
> > On Thu, 2017-04-13 at 12:08 -0600, Martin Fick wrote:
> > > On Thursday, April 13, 2017 11:03:14 AM Jacob Keller 
> 
> wrote:
> > > > On Thu, Apr 13, 2017 at 10:31 AM, David Turner 
> > > 
> > > <nova...@novalis.org> wrote:
> > > > > Git gc locks the repository (using a gc.pid file) so
> > > > > that other gcs don't run concurrently. But git
> > > > > repack
> > > > > doesn't respect this lock, so it's possible to have
> > > > > a
> > > > > repack running at the same time as a gc.  This makes
> > > > > the gc sad when its packs are deleted out from under
> > > > > it
> > > > > with: "fatal: ./objects/pack/pack-$sha.pack cannot
> > > > > be
> > > > > accessed".  Then it dies, leaving a large temp file
> > > > > hanging around.
> > > > > 
> > > > > Does the following seem reasonable?
> > > > > 
> > > > > 1. Make git repack, by default, check for a gc.pid
> > > > > file
> > > > > (using the same logic as git gc itself does).
> > > > > 2. Provide a --force option to git repack to ignore
> > > > > said
> > > > > check. 3. Make git gc provide that --force option
> > > > > when
> > > > > it calls repack under its own lock.
> > > > 
> > > > What about just making the code that calls repack
> > > > today
> > > > just call gc instead? I guess it's more work if you
> > > > don't
> > > > strictly need it but..?
> > > 
> > > There are many scanerios where this does not achieve
> > > the 
> > > same thing.  On the obvious side, gc does more than 
> > > repacking, but on the other side, repacking has many 
> > > switches that are not available via gc.
> > > 
> > > Would it make more sense to move the lock to repack
> > > instead  of to gc?
> > 
> > Other gc operations might step on each other too (e.g.
> > packing refs). That would be less bad (and less common),
> > but it still seems worth avoiding.
> 
> Yes, but all of thsoe operations need to be self protected 
> already, or they risk the same issue.

They are  individually protected.  But I would rather fail fast.  And
I'm not sure what happens if git prune happens during a git repack -a;
might the repack fail if an object that it expects to pack is pruned
out from under it?

Re: Simultaneous gc and repack

2017-04-13 Thread David Turner

On Thu, 2017-04-13 at 12:08 -0600, Martin Fick wrote:
> On Thursday, April 13, 2017 11:03:14 AM Jacob Keller wrote:
> > On Thu, Apr 13, 2017 at 10:31 AM, David Turner 
> 
> <nova...@novalis.org> wrote:
> > > Git gc locks the repository (using a gc.pid file) so
> > > that other gcs don't run concurrently. But git repack
> > > doesn't respect this lock, so it's possible to have a
> > > repack running at the same time as a gc.  This makes
> > > the gc sad when its packs are deleted out from under it
> > > with: "fatal: ./objects/pack/pack-$sha.pack cannot be
> > > accessed".  Then it dies, leaving a large temp file
> > > hanging around.
> > > 
> > > Does the following seem reasonable?
> > > 
> > > 1. Make git repack, by default, check for a gc.pid file
> > > (using the same logic as git gc itself does).
> > > 2. Provide a --force option to git repack to ignore said
> > > check. 3. Make git gc provide that --force option when
> > > it calls repack under its own lock.
> > 
> > What about just making the code that calls repack today
> > just call gc instead? I guess it's more work if you don't
> > strictly need it but..?
> 
> There are many scanerios where this does not achieve the 
> same thing.  On the obvious side, gc does more than 
> repacking, but on the other side, repacking has many 
> switches that are not available via gc.
> 
> Would it make more sense to move the lock to repack instead 
> of to gc?

Other gc operations might step on each other too (e.g. packing refs). 
That would be less bad (and less common), but it still seems worth
avoiding.

Simultaneous gc and repack

2017-04-13 Thread David Turner

Git gc locks the repository (using a gc.pid file) so that other gcs
don't run concurrently. But git repack doesn't respect this lock, so
it's possible to have a repack running at the same time as a gc.  This
makes the gc sad when its packs are deleted out from under it with:
"fatal: ./objects/pack/pack-$sha.pack cannot be accessed".  Then it
dies, leaving a large temp file hanging around.

Does the following seem reasonable?

1. Make git repack, by default, check for a gc.pid file (using the same
logic as git gc itself does).
2. Provide a --force option to git repack to ignore said check.
3. Make git gc provide that --force option when it calls repack under
its own lock.

This came up because Gitlab runs a repack after every N pushes and a gc
after every M commits, where M >> N.  Sometimes, when pushes come in
rapidly, the repack catches the still-running gc and the above badness
happens.  At least, that's my understanding: I don't run our Gitlab
servers, but I talked to the person who does and that's what he said.

Of course, Gitlab could do its own locking, but the general approach
seems like it would help other folks too.

RE: [PATCHv2.1] t6500: wait for detached auto gc at the end of the test script

2017-04-13 Thread David Turner

Thanks for fixing this!

> -Original Message-
> From: SZEDER Gábor [mailto:szeder@gmail.com]
> Sent: Thursday, April 13, 2017 6:32 AM
> To: Junio C Hamano <gits...@pobox.com>
> Cc: Jeff King <p...@peff.net>; Johannes Sixt <j...@kdbg.org>; David Turner
> <david.tur...@twosigma.com>; git@vger.kernel.org; SZEDER Gábor
> <szeder@gmail.com>
> Subject: [PATCHv2.1] t6500: wait for detached auto gc at the end of the test
> script
> 
> The last test in 't6500-gc', 'background auto gc does not run if gc.log is 
> present
> and recent but does if it is old', added in
> a831c06a2 (gc: ignore old gc.log files, 2017-02-10), may sporadically trigger 
> an
> error message from the test harness:
> 
>   rm: cannot remove 'trash directory.t6500-gc/.git/objects': Directory not 
> empty
> 
> The test in question ends with executing an auto gc in the backround, which
> occasionally takes so long that it's still running when 'test_done' is about 
> to
> remove the trash directory.  This 'rm -rf $trash' in the foreground might race
> with the detached auto gc to create and delete files and directories, and gc
> might (re-)create a path that 'rm' already visited and removed, triggering the
> above error message when 'rm' attempts to remove its parent directory.
> 
> Commit bb05510e5 (t5510: run auto-gc in the foreground, 2016-05-01) fixed the
> same problem in a different test script by simply disallowing background gc.
> Unfortunately, what worked there is not applicable here, because the purpose
> of this test is to check the behavior of a detached auto gc.
> 
> Make sure that the test doesn't continue before the gc is finished in the
> background with a clever bit of shell trickery:
> 
>   - Open fd 9 in the shell, to be inherited by the background gc
> process, because our daemonize() only closes the standard fds 0,
> 1 and 2.
>   - Duplicate this fd 9 to stdout.
>   - Read 'git gc's stdout, and thus fd 9, through a command
> substitution.  We don't actually care about gc's output, but this
> construct has two useful properties:
>   - This read blocks until stdout or fd 9 are open.  While stdout is
> closed after the main gc process creates the background process
> and exits, fd 9 remains open until the backround process exits.
>   - The variable assignment from the command substitution gets its
> exit status from the command executed within the command
> substitution, i.e. a failing main gc process will cause the test
> to fail.
> 
> Note, that this fd trickery doesn't work on Windows, because due to MSYS
> limitations the git process only inherits the standard fds 0, 1 and 2 from 
> the shell.
> Luckily, it doesn't matter in this case, because on Windows daemonize() is
> basically a noop, thus 'git gc --auto' always runs in the foreground.
> 
> And since we can now continue the test reliably after the detached gc 
> finished,
> check that there is only a single packfile left at the end, i.e. that the 
> detached gc
> actually did what it was supposed to do.
> Also add a comment at the end of the test script to warn developers of future
> tests about this issue of long running detached gc processes.
> 
> Helped-by: Jeff King <p...@peff.net>
> Helped-by: Johannes Sixt <j...@kdbg.org>
> Signed-off-by: SZEDER Gábor <szeder@gmail.com>
> ---
> 
> Updated subject line, but otherwise the same as v2.
> 
>  t/t6500-gc.sh | 18 +-
>  1 file changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh index 08de2e8ab..cc7acd101 100755
> --- a/t/t6500-gc.sh
> +++ b/t/t6500-gc.sh
> @@ -67,6 +67,16 @@ test_expect_success 'auto gc with too many loose
> objects does not attempt to cre
>   test_line_count = 2 new # There is one new pack and its .idx  '
> 
> +run_and_wait_for_auto_gc () {
> + # We read stdout from gc for the side effect of waiting until the
> + # background gc process exits, closing its fd 9.  Furthermore, the
> + # variable assignment from a command substitution preserves the
> + # exit status of the main gc process.
> + # Note: this fd trickery doesn't work on Windows, but there is no
> + # need to, because on Win the auto gc always runs in the foreground.
> + doesnt_matter=$(git gc --auto 9>&1)
> +}
> +
>  test_expect_success 'background auto gc does not run if gc.log is present and
> recent but does if it is old' '
>   test_commit foo &&
>   test_commit bar &&
> @@ -80,7 +90,13 @@ test_expect_success 'background auto gc does not run if
> gc.log is present and re
>   test-chmtime =-345600 .git/gc.log &&

[PATCH v6] http.postbuffer: allow full range of ssize_t values

2017-04-11 Thread David Turner

Unfortunately, in order to push some large repos where a server does
not support chunked encoding, the http postbuffer must sometimes
exceed two gigabytes.  On a 64-bit system, this is OK: we just malloc
a larger buffer.

This means that we need to use CURLOPT_POSTFIELDSIZE_LARGE to set the
buffer size.

Signed-off-by: David Turner <dtur...@twosigma.com>
---
 cache.h   |  1 +
 config.c  | 17 +
 http.c|  6 --
 http.h|  2 +-
 remote-curl.c | 12 +---
 5 files changed, 32 insertions(+), 6 deletions(-)

diff --git a/cache.h b/cache.h
index fbdf7a815a..5e6747dbb4 100644
--- a/cache.h
+++ b/cache.h
@@ -1900,6 +1900,7 @@ extern int git_parse_maybe_bool(const char *);
 extern int git_config_int(const char *, const char *);
 extern int64_t git_config_int64(const char *, const char *);
 extern unsigned long git_config_ulong(const char *, const char *);
+extern ssize_t git_config_ssize_t(const char *, const char *);
 extern int git_config_bool_or_int(const char *, const char *, int *);
 extern int git_config_bool(const char *, const char *);
 extern int git_config_maybe_bool(const char *, const char *);
diff --git a/config.c b/config.c
index 1a4d85537b..aae6dcc34e 100644
--- a/config.c
+++ b/config.c
@@ -834,6 +834,15 @@ int git_parse_ulong(const char *value, unsigned long *ret)
return 1;
 }
 
+static int git_parse_ssize_t(const char *value, ssize_t *ret)
+{
+   intmax_t tmp;
+   if (!git_parse_signed(value, , 
maximum_signed_value_of_type(ssize_t)))
+   return 0;
+   *ret = tmp;
+   return 1;
+}
+
 NORETURN
 static void die_bad_number(const char *name, const char *value)
 {
@@ -892,6 +901,14 @@ unsigned long git_config_ulong(const char *name, const 
char *value)
return ret;
 }
 
+ssize_t git_config_ssize_t(const char *name, const char *value)
+{
+   ssize_t ret;
+   if (!git_parse_ssize_t(value, ))
+   die_bad_number(name, value);
+   return ret;
+}
+
 int git_parse_maybe_bool(const char *value)
 {
if (!value)
diff --git a/http.c b/http.c
index 96d84bbed3..7bccb36459 100644
--- a/http.c
+++ b/http.c
@@ -19,7 +19,7 @@ long int git_curl_ipresolve;
 #endif
 int active_requests;
 int http_is_verbose;
-size_t http_post_buffer = 16 * LARGE_PACKET_MAX;
+ssize_t http_post_buffer = 16 * LARGE_PACKET_MAX;
 
 #if LIBCURL_VERSION_NUM >= 0x070a06
 #define LIBCURL_CAN_HANDLE_AUTH_ANY
@@ -331,7 +331,9 @@ static int http_options(const char *var, const char *value, 
void *cb)
}
 
if (!strcmp("http.postbuffer", var)) {
-   http_post_buffer = git_config_int(var, value);
+   http_post_buffer = git_config_ssize_t(var, value);
+   if (http_post_buffer < 0)
+   warning(_("negative value for http.postbuffer; 
defaulting to %d"), LARGE_PACKET_MAX);
if (http_post_buffer < LARGE_PACKET_MAX)
http_post_buffer = LARGE_PACKET_MAX;
return 0;
diff --git a/http.h b/http.h
index 02bccb7b0c..f7bd3b26b0 100644
--- a/http.h
+++ b/http.h
@@ -111,7 +111,7 @@ extern struct curl_slist *http_copy_default_headers(void);
 extern long int git_curl_ipresolve;
 extern int active_requests;
 extern int http_is_verbose;
-extern size_t http_post_buffer;
+extern ssize_t http_post_buffer;
 extern struct credential http_auth;
 
 extern char curl_errorstr[CURL_ERROR_SIZE];
diff --git a/remote-curl.c b/remote-curl.c
index e953d06f66..cf171b1bc9 100644
--- a/remote-curl.c
+++ b/remote-curl.c
@@ -531,6 +531,12 @@ static int probe_rpc(struct rpc_state *rpc, struct 
slot_results *results)
return err;
 }
 
+static curl_off_t xcurl_off_t(ssize_t len) {
+   if (len > maximum_signed_value_of_type(curl_off_t))
+   die("cannot handle pushes this big");
+   return (curl_off_t) len;
+}
+
 static int post_rpc(struct rpc_state *rpc)
 {
struct active_request_slot *slot;
@@ -614,7 +620,7 @@ static int post_rpc(struct rpc_state *rpc)
 * and we just need to send it.
 */
curl_easy_setopt(slot->curl, CURLOPT_POSTFIELDS, gzip_body);
-   curl_easy_setopt(slot->curl, CURLOPT_POSTFIELDSIZE, gzip_size);
+   curl_easy_setopt(slot->curl, CURLOPT_POSTFIELDSIZE_LARGE, 
xcurl_off_t(gzip_size));
 
} else if (use_gzip && 1024 < rpc->len) {
/* The client backend isn't giving us compressed data so
@@ -645,7 +651,7 @@ static int post_rpc(struct rpc_state *rpc)
 
headers = curl_slist_append(headers, "Content-Encoding: gzip");
curl_easy_setopt(slot->curl, CURLOPT_POSTFIELDS, gzip_body);
-   curl_easy_setopt(slot->curl, CURLOPT_POSTFIELDSIZE, gzip_size);
+   curl_easy_setopt(slot->curl, CURLOPT_POSTFIELDSIZE_LARGE, 
xcurl_off_t(gzip_size));
 
if (options.verbosity >

RE: [PATCH v3] http.postbuffer: allow full range of ssize_t values

2017-04-04 Thread David Turner

> -Original Message-
> From: Jeff King [mailto:p...@peff.net]
> Sent: Monday, April 3, 2017 10:02 PM
> To: David Turner <david.tur...@twosigma.com>
> Cc: git@vger.kernel.org
> Subject: Re: [PATCH v3] http.postbuffer: allow full range of ssize_t values
> 
> On Mon, Apr 03, 2017 at 05:03:49PM +, David Turner wrote:
> 
> > > > Unfortunately, in order to push some large repos, the http
> > > > postbuffer must sometimes exceed two gigabytes.  On a 64-bit system,
> this is OK:
> > > > we just malloc a larger buffer.
> > >
> > > I'm still not sure why a 2GB post-buffer is necessary. It sounds
> > > like something is broken in your setup. Large pushes should be sent
> chunked.
> > >
> > > I know broken setups are a fact of life, but this feels like a
> > > really hacky work- around.
> >
> > I'm not sure what other workaround I should use.  I guess I could do
> > multiple pushes, but only if individual objects are under the size
> > limit, and I'm not sure all of mine are (maybe I'll get lucky tho).  I
> > know that this is a configuration issue with gitlab:
> > https://gitlab.com/gitlab-org/gitlab-ce/issues/30315 but I don't know
> > when that will get fixed.  I could manually copy the repo to the
> > server and do a local push, but I don't know that I have the necessary
> > permissions to do that. Or I could do this, which would hopefully
> > actually solve the problem.
> 
> I didn't think we had gotten details on what was actually broken. Is it really
> that GitLab does not support chunked transfers? I know that's what the issue
> above says, but it sounds like it is just assuming that is the problem based 
> on
> the recent messages to the list.

I agree that we don't know for sure what's actually broken.  I think probably 
the
GitLab bug tracker might be the better place to figure that out, unless GitLab 
thinks they're hitting a bug in git.

> If that's really the issue, then OK. That's lame, but something the client 
> has to
> work around. It seems like a pretty big gap, though (and one I'd think people
> would have hit before; the default post-buffer is only 1MB. Surely people
> routinely push more than that to GitLab servers?
> So I'm really wondering if there is something else going on here.
> 
> What does it look like when it fails? What does GIT_TRACE_CURL look like (or
> GIT_CURL_VERBOSE if your client is older, but remember to sanitize any auth
> lines)?

Unfortunately, we've already worked around the problem by pushing over SSH, 
so I no longer have a failing case to examine. Last time I tried, I actually 
did some 
hackery to create a push smaller than 2GB, but it still failed (this time, with 
"502 Bad Gateway").  So, something is clearly weird in GitLab land.

I did see "Transfer-Encoding: chunked" in one of the responses from the server,
 but not in the request (not sure if that's normal). The smaller push had: 
Content-Length: 1048908476

(For me to publish longer log traces requires a level of security review which 
is 
probably too much of a hassle unless you think it will be really useful).

> > > The ultimate fate of this number, though, is to be handed to:
> > >
> > >   curl_easy_setopt(slot->curl, CURLOPT_POSTFIELDSIZE, rpc->len);
> > >
> > > where the final argument is interpreted as a long. So I suspect that
> > > on 64-bit Windows, setting http.postbuffer to "3G" would cause some
> > > kind of weird error (either a truncated post or some internal curl
> > > error due to the negative size, depending on how curl handles it).
> >
> > Ah, so we would need to use CURLOPT_POSTFIELDSIZE_LARGE.  Will re-roll.
> 
> Ah, neat. I didn't even know about CURLOPT_POSTFIELDSIZE_LARGE, and
> thought we'd have to just limit 32-bit platforms. That's a much better
> solution.
> 
> > > I saw the earlier iteration used a size_t, but you switched it after
> > > the compiler
> > > (rightfully) complained about the signedness. But I'm not sure why
> > > we would want ssize_t here instead of just using git_parse_unsigned().
> >
> > It was originally signed.  I'm not sure why that was, but I figured it
> > would be simpler to save the extra bit just in case.
> 
> I think it was simply because git_config_int() is the generic "number"
> parser, and nobody ever thought about it.
> 
> In fact, passing a negative value to curl would be disastrous, as it would use
> strlen(). I don't think a negative value could ever get that far, though. It 
> looks
> like the config code would silently turn a negative value into
> LARGE_PACKET_MAX.

I would stil

[PATCH v5] http.postbuffer: allow full range of ssize_t values

2017-04-03 Thread David Turner

Unfortunately, in order to push some large repos, the http postbuffer
must sometimes exceed two gigabytes.  On a 64-bit system, this is OK:
we just malloc a larger buffer.

This means that we need to use CURLOPT_POSTFIELDSIZE_LARGE to set the
buffer size.

Signed-off-by: David Turner <dtur...@twosigma.com>
---

V5 addresses comments from Torsten Boegershausen and Ramsay Jones.  Since
I don't have a 32-bit machine handy, it's difficult for me to check
for compiler warnings on 32-bit machines.  Hopefully my guess as
to the solution to Ramsay's issue will be correct.

cache.h   |  1 +
 config.c  | 17 +
 http.c|  4 ++--
 http.h|  2 +-
 remote-curl.c | 12 +---
 5 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/cache.h b/cache.h
index fbdf7a815a..5e6747dbb4 100644
--- a/cache.h
+++ b/cache.h
@@ -1900,6 +1900,7 @@ extern int git_parse_maybe_bool(const char *);
 extern int git_config_int(const char *, const char *);
 extern int64_t git_config_int64(const char *, const char *);
 extern unsigned long git_config_ulong(const char *, const char *);
+extern ssize_t git_config_ssize_t(const char *, const char *);
 extern int git_config_bool_or_int(const char *, const char *, int *);
 extern int git_config_bool(const char *, const char *);
 extern int git_config_maybe_bool(const char *, const char *);
diff --git a/config.c b/config.c
index 1a4d85537b..aae6dcc34e 100644
--- a/config.c
+++ b/config.c
@@ -834,6 +834,15 @@ int git_parse_ulong(const char *value, unsigned long *ret)
return 1;
 }
 
+static int git_parse_ssize_t(const char *value, ssize_t *ret)
+{
+   intmax_t tmp;
+   if (!git_parse_signed(value, , 
maximum_signed_value_of_type(ssize_t)))
+   return 0;
+   *ret = tmp;
+   return 1;
+}
+
 NORETURN
 static void die_bad_number(const char *name, const char *value)
 {
@@ -892,6 +901,14 @@ unsigned long git_config_ulong(const char *name, const 
char *value)
return ret;
 }
 
+ssize_t git_config_ssize_t(const char *name, const char *value)
+{
+   ssize_t ret;
+   if (!git_parse_ssize_t(value, ))
+   die_bad_number(name, value);
+   return ret;
+}
+
 int git_parse_maybe_bool(const char *value)
 {
if (!value)
diff --git a/http.c b/http.c
index 96d84bbed3..22f8167ba2 100644
--- a/http.c
+++ b/http.c
@@ -19,7 +19,7 @@ long int git_curl_ipresolve;
 #endif
 int active_requests;
 int http_is_verbose;
-size_t http_post_buffer = 16 * LARGE_PACKET_MAX;
+ssize_t http_post_buffer = 16 * LARGE_PACKET_MAX;
 
 #if LIBCURL_VERSION_NUM >= 0x070a06
 #define LIBCURL_CAN_HANDLE_AUTH_ANY
@@ -331,7 +331,7 @@ static int http_options(const char *var, const char *value, 
void *cb)
}
 
if (!strcmp("http.postbuffer", var)) {
-   http_post_buffer = git_config_int(var, value);
+   http_post_buffer = git_config_ssize_t(var, value);
if (http_post_buffer < LARGE_PACKET_MAX)
http_post_buffer = LARGE_PACKET_MAX;
return 0;
diff --git a/http.h b/http.h
index 02bccb7b0c..f7bd3b26b0 100644
--- a/http.h
+++ b/http.h
@@ -111,7 +111,7 @@ extern struct curl_slist *http_copy_default_headers(void);
 extern long int git_curl_ipresolve;
 extern int active_requests;
 extern int http_is_verbose;
-extern size_t http_post_buffer;
+extern ssize_t http_post_buffer;
 extern struct credential http_auth;
 
 extern char curl_errorstr[CURL_ERROR_SIZE];
diff --git a/remote-curl.c b/remote-curl.c
index e953d06f66..b7b69e096a 100644
--- a/remote-curl.c
+++ b/remote-curl.c
@@ -531,6 +531,12 @@ static int probe_rpc(struct rpc_state *rpc, struct 
slot_results *results)
return err;
 }
 
+static curl_off_t xcurl_off_t(ssize_t len) {
+   if (len > (curl_off_t) len)
+   die("Cannot handle pushes this big");
+   return (curl_off_t) len;
+}
+
 static int post_rpc(struct rpc_state *rpc)
 {
struct active_request_slot *slot;
@@ -614,7 +620,7 @@ static int post_rpc(struct rpc_state *rpc)
 * and we just need to send it.
 */
curl_easy_setopt(slot->curl, CURLOPT_POSTFIELDS, gzip_body);
-   curl_easy_setopt(slot->curl, CURLOPT_POSTFIELDSIZE, gzip_size);
+   curl_easy_setopt(slot->curl, CURLOPT_POSTFIELDSIZE_LARGE, 
xcurl_off_t(gzip_size));
 
} else if (use_gzip && 1024 < rpc->len) {
/* The client backend isn't giving us compressed data so
@@ -645,7 +651,7 @@ static int post_rpc(struct rpc_state *rpc)
 
headers = curl_slist_append(headers, "Content-Encoding: gzip");
curl_easy_setopt(slot->curl, CURLOPT_POSTFIELDS, gzip_body);
-   curl_easy_setopt(slot->curl, CURLOPT_POSTFIELDSIZE, gzip_size);
+   curl_easy_setopt(slot->curl, CURLOPT_POSTFIELDSIZE_LARGE, 
xcurl_off_t(gzip_size));
 
if (op

RE: [PATCH v4] http.postbuffer: allow full range of ssize_t values

2017-04-03 Thread David Turner



> -Original Message-
> From: Ramsay Jones [mailto:ram...@ramsayjones.plus.com]
> Sent: Monday, April 3, 2017 6:24 PM
> To: David Turner <david.tur...@twosigma.com>; git@vger.kernel.org
> Subject: Re: [PATCH v4] http.postbuffer: allow full range of ssize_t values
> 
> 
> 
> On 03/04/17 18:30, David Turner wrote:
> > Unfortunately, in order to push some large repos, the http postbuffer
> > must sometimes exceed two gigabytes.  On a 64-bit system, this is OK:
> > we just malloc a larger buffer.
> >
> > This means that we need to use CURLOPT_POSTFIELDSIZE_LARGE to set
> the
> > buffer size.
> >
> > Signed-off-by: David Turner <dtur...@twosigma.com>
> > ---
> >  cache.h   |  1 +
> >  config.c  | 17 +
> >  http.c|  4 ++--
> >  http.h|  2 +-
> >  remote-curl.c |  6 +++---
> >  5 files changed, 24 insertions(+), 6 deletions(-)
> >
> > diff --git a/cache.h b/cache.h
> > index fbdf7a815a..5e6747dbb4 100644
> > --- a/cache.h
> > +++ b/cache.h
> > @@ -1900,6 +1900,7 @@ extern int git_parse_maybe_bool(const char *);
> > extern int git_config_int(const char *, const char *);  extern int64_t
> > git_config_int64(const char *, const char *);  extern unsigned long
> > git_config_ulong(const char *, const char *);
> > +extern ssize_t git_config_ssize_t(const char *, const char *);
> >  extern int git_config_bool_or_int(const char *, const char *, int *);
> > extern int git_config_bool(const char *, const char *);  extern int
> > git_config_maybe_bool(const char *, const char *); diff --git
> > a/config.c b/config.c index 1a4d85537b..de5b155a4e 100644
> > --- a/config.c
> > +++ b/config.c
> > @@ -834,6 +834,15 @@ int git_parse_ulong(const char *value, unsigned
> long *ret)
> > return 1;
> >  }
> >
> > +static int git_parse_ssize_t(const char *value, ssize_t *ret) {
> > +   ssize_t tmp;
> > +   if (!git_parse_signed(value, ,
> maximum_signed_value_of_type(ssize_t)))
> > +   return 0;
> > +   *ret = tmp;
> > +   return 1;
> > +}
> > +
> 
> The previous version of this patch causes gcc to complain on a Linux 32bit
> build, like so:
> 
> CC config.o
> config.c: In function ‘git_parse_ssize_t’:
> config.c:840:31: warning: passing argument 2 of ‘git_parse_signed’ from
> incompatible pointer type [-Wincompatible-pointer-types]
>   if (!git_parse_signed(value, ,
> maximum_signed_value_of_type(ssize_t)))
>^
> config.c:753:12: note: expected ‘intmax_t * {aka long long int *}’ but
> argument is of type ‘ssize_t * {aka int *}’
>  static int git_parse_signed(const char *value, intmax_t *ret, intmax_t max)
> ^
> 
> ... and I don't think this version would be any different (though I haven't
> tried).

So I guess that means tmp needs to be an intmax_t.

[PATCH v4] http.postbuffer: allow full range of ssize_t values

2017-04-03 Thread David Turner

Unfortunately, in order to push some large repos, the http postbuffer
must sometimes exceed two gigabytes.  On a 64-bit system, this is OK:
we just malloc a larger buffer.

This means that we need to use CURLOPT_POSTFIELDSIZE_LARGE to set the
buffer size.

Signed-off-by: David Turner <dtur...@twosigma.com>
---
 cache.h   |  1 +
 config.c  | 17 +
 http.c|  4 ++--
 http.h|  2 +-
 remote-curl.c |  6 +++---
 5 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/cache.h b/cache.h
index fbdf7a815a..5e6747dbb4 100644
--- a/cache.h
+++ b/cache.h
@@ -1900,6 +1900,7 @@ extern int git_parse_maybe_bool(const char *);
 extern int git_config_int(const char *, const char *);
 extern int64_t git_config_int64(const char *, const char *);
 extern unsigned long git_config_ulong(const char *, const char *);
+extern ssize_t git_config_ssize_t(const char *, const char *);
 extern int git_config_bool_or_int(const char *, const char *, int *);
 extern int git_config_bool(const char *, const char *);
 extern int git_config_maybe_bool(const char *, const char *);
diff --git a/config.c b/config.c
index 1a4d85537b..de5b155a4e 100644
--- a/config.c
+++ b/config.c
@@ -834,6 +834,15 @@ int git_parse_ulong(const char *value, unsigned long *ret)
return 1;
 }
 
+static int git_parse_ssize_t(const char *value, ssize_t *ret)
+{
+   ssize_t tmp;
+   if (!git_parse_signed(value, , 
maximum_signed_value_of_type(ssize_t)))
+   return 0;
+   *ret = tmp;
+   return 1;
+}
+
 NORETURN
 static void die_bad_number(const char *name, const char *value)
 {
@@ -892,6 +901,14 @@ unsigned long git_config_ulong(const char *name, const 
char *value)
return ret;
 }
 
+ssize_t git_config_ssize_t(const char *name, const char *value)
+{
+   ssize_t ret;
+   if (!git_parse_ssize_t(value, ))
+   die_bad_number(name, value);
+   return ret;
+}
+
 int git_parse_maybe_bool(const char *value)
 {
if (!value)
diff --git a/http.c b/http.c
index 96d84bbed3..22f8167ba2 100644
--- a/http.c
+++ b/http.c
@@ -19,7 +19,7 @@ long int git_curl_ipresolve;
 #endif
 int active_requests;
 int http_is_verbose;
-size_t http_post_buffer = 16 * LARGE_PACKET_MAX;
+ssize_t http_post_buffer = 16 * LARGE_PACKET_MAX;
 
 #if LIBCURL_VERSION_NUM >= 0x070a06
 #define LIBCURL_CAN_HANDLE_AUTH_ANY
@@ -331,7 +331,7 @@ static int http_options(const char *var, const char *value, 
void *cb)
}
 
if (!strcmp("http.postbuffer", var)) {
-   http_post_buffer = git_config_int(var, value);
+   http_post_buffer = git_config_ssize_t(var, value);
if (http_post_buffer < LARGE_PACKET_MAX)
http_post_buffer = LARGE_PACKET_MAX;
return 0;
diff --git a/http.h b/http.h
index 02bccb7b0c..f7bd3b26b0 100644
--- a/http.h
+++ b/http.h
@@ -111,7 +111,7 @@ extern struct curl_slist *http_copy_default_headers(void);
 extern long int git_curl_ipresolve;
 extern int active_requests;
 extern int http_is_verbose;
-extern size_t http_post_buffer;
+extern ssize_t http_post_buffer;
 extern struct credential http_auth;
 
 extern char curl_errorstr[CURL_ERROR_SIZE];
diff --git a/remote-curl.c b/remote-curl.c
index e953d06f66..69b4d71e4c 100644
--- a/remote-curl.c
+++ b/remote-curl.c
@@ -614,7 +614,7 @@ static int post_rpc(struct rpc_state *rpc)
 * and we just need to send it.
 */
curl_easy_setopt(slot->curl, CURLOPT_POSTFIELDS, gzip_body);
-   curl_easy_setopt(slot->curl, CURLOPT_POSTFIELDSIZE, gzip_size);
+   curl_easy_setopt(slot->curl, CURLOPT_POSTFIELDSIZE_LARGE, 
(curl_off_t) gzip_size);
 
} else if (use_gzip && 1024 < rpc->len) {
/* The client backend isn't giving us compressed data so
@@ -645,7 +645,7 @@ static int post_rpc(struct rpc_state *rpc)
 
headers = curl_slist_append(headers, "Content-Encoding: gzip");
curl_easy_setopt(slot->curl, CURLOPT_POSTFIELDS, gzip_body);
-   curl_easy_setopt(slot->curl, CURLOPT_POSTFIELDSIZE, gzip_size);
+   curl_easy_setopt(slot->curl, CURLOPT_POSTFIELDSIZE_LARGE, 
(curl_off_t) gzip_size);
 
if (options.verbosity > 1) {
fprintf(stderr, "POST %s (gzip %lu to %lu bytes)\n",
@@ -658,7 +658,7 @@ static int post_rpc(struct rpc_state *rpc)
 * more normal Content-Length approach.
 */
curl_easy_setopt(slot->curl, CURLOPT_POSTFIELDS, rpc->buf);
-   curl_easy_setopt(slot->curl, CURLOPT_POSTFIELDSIZE, rpc->len);
+   curl_easy_setopt(slot->curl, CURLOPT_POSTFIELDSIZE_LARGE, 
(curl_off_t) rpc->len);
if (options.verbosity > 1) {
fprintf(stderr, "POST %s (%lu bytes)\n",
rpc->service_name, (unsigned long)rpc->len);
-- 
2.11.GIT

RE: [PATCH v3] http.postbuffer: allow full range of ssize_t values

2017-04-03 Thread David Turner



> -Original Message-
> From: Jeff King [mailto:p...@peff.net]
> Sent: Saturday, April 1, 2017 2:01 AM
> To: David Turner <david.tur...@twosigma.com>
> Cc: git@vger.kernel.org
> Subject: Re: [PATCH v3] http.postbuffer: allow full range of ssize_t values
> 
> On Fri, Mar 31, 2017 at 01:26:31PM -0400, David Turner wrote:
> 
> > Unfortunately, in order to push some large repos, the http postbuffer
> > must sometimes exceed two gigabytes.  On a 64-bit system, this is OK:
> > we just malloc a larger buffer.
> 
> I'm still not sure why a 2GB post-buffer is necessary. It sounds like 
> something
> is broken in your setup. Large pushes should be sent chunked.
> 
> I know broken setups are a fact of life, but this feels like a really hacky 
> work-
> around.

I'm not sure what other workaround I should use.  I guess I could do multiple 
pushes, but only if individual objects are under the size limit, and I'm not 
sure all of mine are (maybe I'll get lucky tho).  I know that this is a 
configuration issue with gitlab: 
https://gitlab.com/gitlab-org/gitlab-ce/issues/30315 but I don't know when that 
will get fixed.  I could manually copy the repo to the server and do a local 
push, but I don't know that I have the necessary permissions to do that. Or I 
could do this, which would hopefully actually solve the problem.

> > diff --git a/cache.h b/cache.h
> > index fbdf7a815a..5e6747dbb4 100644
> > --- a/cache.h
> > +++ b/cache.h
> > @@ -1900,6 +1900,7 @@ extern int git_parse_maybe_bool(const char *);
> > extern int git_config_int(const char *, const char *);  extern int64_t
> > git_config_int64(const char *, const char *);  extern unsigned long
> > git_config_ulong(const char *, const char *);
> > +extern ssize_t git_config_ssize_t(const char *, const char *);
> 
> For most of our other "big" values we use git_config_ulong(). E.g.,
> core.bigfilethreshold. I suspect that would be fine for your purposes here,
> though using size_t is more correct (on Windows "unsigned long" is still only
> 32 bits, even on 64-bit systems).
> 
> The ultimate fate of this number, though, is to be handed to:
> 
>   curl_easy_setopt(slot->curl, CURLOPT_POSTFIELDSIZE, rpc->len);
> 
> where the final argument is interpreted as a long. So I suspect that on 64-bit
> Windows, setting http.postbuffer to "3G" would cause some kind of weird
> error (either a truncated post or some internal curl error due to the negative
> size, depending on how curl handles it).

Ah, so we would need to use CURLOPT_POSTFIELDSIZE_LARGE.  Will re-roll.

> 
> I think a "git_config_long()" would probably do everything correctly.
> 
> > +static int git_parse_ssize_t(const char *value, ssize_t *ret) {
> > +   ssize_t tmp;
> > +   if (!git_parse_signed(value, ,
> maximum_signed_value_of_type(ssize_t)))
> > +   return 0;
> > +   *ret = tmp;
> > +   return 1;
> > +}
> 
> I saw the earlier iteration used a size_t, but you switched it after the 
> compiler
> (rightfully) complained about the signedness. But I'm not sure why we would
> want ssize_t here instead of just using git_parse_unsigned().

It was originally signed.  I'm not sure why that was, but I figured it would be 
simpler to save the extra bit just in case.

[PATCH v3] http.postbuffer: allow full range of ssize_t values

2017-03-31 Thread David Turner

Unfortunately, in order to push some large repos, the http postbuffer
must sometimes exceed two gigabytes.  On a 64-bit system, this is OK:
we just malloc a larger buffer.

Signed-off-by: David Turner <dtur...@twosigma.com>
---

This version fixes the definition of git_parse_ssize_t to return int.

Sorry for the sloppiness.

 cache.h  |  1 +
 config.c | 17 +
 http.c   |  4 ++--
 http.h   |  2 +-
 4 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/cache.h b/cache.h
index fbdf7a815a..5e6747dbb4 100644
--- a/cache.h
+++ b/cache.h
@@ -1900,6 +1900,7 @@ extern int git_parse_maybe_bool(const char *);
 extern int git_config_int(const char *, const char *);
 extern int64_t git_config_int64(const char *, const char *);
 extern unsigned long git_config_ulong(const char *, const char *);
+extern ssize_t git_config_ssize_t(const char *, const char *);
 extern int git_config_bool_or_int(const char *, const char *, int *);
 extern int git_config_bool(const char *, const char *);
 extern int git_config_maybe_bool(const char *, const char *);
diff --git a/config.c b/config.c
index 1a4d85537b..de5b155a4e 100644
--- a/config.c
+++ b/config.c
@@ -834,6 +834,15 @@ int git_parse_ulong(const char *value, unsigned long *ret)
return 1;
 }
 
+static int git_parse_ssize_t(const char *value, ssize_t *ret)
+{
+   ssize_t tmp;
+   if (!git_parse_signed(value, , 
maximum_signed_value_of_type(ssize_t)))
+   return 0;
+   *ret = tmp;
+   return 1;
+}
+
 NORETURN
 static void die_bad_number(const char *name, const char *value)
 {
@@ -892,6 +901,14 @@ unsigned long git_config_ulong(const char *name, const 
char *value)
return ret;
 }
 
+ssize_t git_config_ssize_t(const char *name, const char *value)
+{
+   ssize_t ret;
+   if (!git_parse_ssize_t(value, ))
+   die_bad_number(name, value);
+   return ret;
+}
+
 int git_parse_maybe_bool(const char *value)
 {
if (!value)
diff --git a/http.c b/http.c
index 96d84bbed3..22f8167ba2 100644
--- a/http.c
+++ b/http.c
@@ -19,7 +19,7 @@ long int git_curl_ipresolve;
 #endif
 int active_requests;
 int http_is_verbose;
-size_t http_post_buffer = 16 * LARGE_PACKET_MAX;
+ssize_t http_post_buffer = 16 * LARGE_PACKET_MAX;
 
 #if LIBCURL_VERSION_NUM >= 0x070a06
 #define LIBCURL_CAN_HANDLE_AUTH_ANY
@@ -331,7 +331,7 @@ static int http_options(const char *var, const char *value, 
void *cb)
}
 
if (!strcmp("http.postbuffer", var)) {
-   http_post_buffer = git_config_int(var, value);
+   http_post_buffer = git_config_ssize_t(var, value);
if (http_post_buffer < LARGE_PACKET_MAX)
http_post_buffer = LARGE_PACKET_MAX;
return 0;
diff --git a/http.h b/http.h
index 02bccb7b0c..f7bd3b26b0 100644
--- a/http.h
+++ b/http.h
@@ -111,7 +111,7 @@ extern struct curl_slist *http_copy_default_headers(void);
 extern long int git_curl_ipresolve;
 extern int active_requests;
 extern int http_is_verbose;
-extern size_t http_post_buffer;
+extern ssize_t http_post_buffer;
 extern struct credential http_auth;
 
 extern char curl_errorstr[CURL_ERROR_SIZE];
-- 
2.11.GIT

RE: [PATCH] http.postbuffer: make a size_t

2017-03-31 Thread David Turner

> -Original Message-
> From: Torsten Bögershausen [mailto:tbo...@web.de]
> Sent: Friday, March 31, 2017 12:22 AM
> To: David Turner <david.tur...@twosigma.com>; git@vger.kernel.org
> Subject: Re: [PATCH] http.postbuffer: make a size_t
> 
> 
> 
> On 30/03/17 22:29, David Turner wrote:
> > Unfortunately, in order to push some large repos, the http postbuffer
> > must sometimes exceed two gigabytes.  On a 64-bit system, this is OK:
> > we just malloc a larger buffer.
> >
> > Signed-off-by: David Turner <dtur...@twosigma.com>
> > ---
> >  cache.h  |  1 +
> >  config.c | 17 +
> >  http.c   |  2 +-
> >  3 files changed, 19 insertions(+), 1 deletion(-)
> >
> > diff --git a/cache.h b/cache.h
> > index fbdf7a815a..a8c1b65db0 100644
> > --- a/cache.h
> > +++ b/cache.h
> > @@ -1900,6 +1900,7 @@ extern int git_parse_maybe_bool(const char *);
> > extern int git_config_int(const char *, const char *);  extern int64_t
> > git_config_int64(const char *, const char *);  extern unsigned long
> > git_config_ulong(const char *, const char *);
> > +extern size_t git_config_size_t(const char *, const char *);
> >  extern int git_config_bool_or_int(const char *, const char *, int *);
> > extern int git_config_bool(const char *, const char *);  extern int
> > git_config_maybe_bool(const char *, const char *); diff --git
> > a/config.c b/config.c index 1a4d85537b..7b706cf27a 100644
> > --- a/config.c
> > +++ b/config.c
> > @@ -834,6 +834,15 @@ int git_parse_ulong(const char *value, unsigned
> long *ret)
> > return 1;
> >  }
> >
> > +static size_t git_parse_size_t(const char *value, unsigned long *ret)
> > +{
> > +   size_t tmp;
> > +   if (!git_parse_signed(value, ,
> maximum_unsigned_value_of_type(size_t)))
> > +   return 0;
> > +   *ret = tmp;
> > +   return 1;
> > +}
> What is the return value here ?
> Isn't it a size_t we want ?

Yeah, that should return an int, since it's just "parsed" vs "unparsed".

> (There was a recent discussion about "unsigned long" vs "size_t", which
>   are the same on many systems, but not under Win64) Would the following
> work ?
> 
> static int git_parse_size_t(const char *value, size_t *ret) {
>   if (!git_parse_signed(value, ret,
> maximum_unsigned_value_of_type(size_t)))
>   return 0;
>   return 1;
> }
> 
> []
> > +size_t git_config_size_t(const char *name, const char *value) {
> > +   unsigned long ret;
> > +   if (!git_parse_size_t(value, ))
> > +   die_bad_number(name, value);
> > +   return ret;
> > +}
> > +
> Same here:
> size_t git_config_size_t(const char *name, const char *value) {
>   size_t ret;
>   if (!git_parse_size_t(value, ))
>   die_bad_number(name, value);
>   return ret;
> }

In v2 I decided to make this a ssize_t, since it was previously a signed int.  
I know we're not using negative values right now, but since nobody has 2**63 
bytes of ram anyway, it's probably safe to keep it signed to keep our options 
open.

[PATCH v2] http.postbuffer: allow full range of ssize_t values

2017-03-30 Thread David Turner

Unfortunately, in order to push some large repos, the http postbuffer
must sometimes exceed two gigabytes.  On a 64-bit system, this is OK:
we just malloc a larger buffer.

Signed-off-by: David Turner <dtur...@twosigma.com>
---
 cache.h  |  1 +
 config.c | 17 +
 http.c   |  4 ++--
 http.h   |  2 +-
 4 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/cache.h b/cache.h
index fbdf7a815a..5e6747dbb4 100644
--- a/cache.h
+++ b/cache.h
@@ -1900,6 +1900,7 @@ extern int git_parse_maybe_bool(const char *);
 extern int git_config_int(const char *, const char *);
 extern int64_t git_config_int64(const char *, const char *);
 extern unsigned long git_config_ulong(const char *, const char *);
+extern ssize_t git_config_ssize_t(const char *, const char *);
 extern int git_config_bool_or_int(const char *, const char *, int *);
 extern int git_config_bool(const char *, const char *);
 extern int git_config_maybe_bool(const char *, const char *);
diff --git a/config.c b/config.c
index 1a4d85537b..5ead880d5b 100644
--- a/config.c
+++ b/config.c
@@ -834,6 +834,15 @@ int git_parse_ulong(const char *value, unsigned long *ret)
return 1;
 }
 
+static ssize_t git_parse_ssize_t(const char *value, ssize_t *ret)
+{
+   ssize_t tmp;
+   if (!git_parse_signed(value, , 
maximum_signed_value_of_type(ssize_t)))
+   return 0;
+   *ret = tmp;
+   return 1;
+}
+
 NORETURN
 static void die_bad_number(const char *name, const char *value)
 {
@@ -892,6 +901,14 @@ unsigned long git_config_ulong(const char *name, const 
char *value)
return ret;
 }
 
+ssize_t git_config_ssize_t(const char *name, const char *value)
+{
+   ssize_t ret;
+   if (!git_parse_ssize_t(value, ))
+   die_bad_number(name, value);
+   return ret;
+}
+
 int git_parse_maybe_bool(const char *value)
 {
if (!value)
diff --git a/http.c b/http.c
index 96d84bbed3..22f8167ba2 100644
--- a/http.c
+++ b/http.c
@@ -19,7 +19,7 @@ long int git_curl_ipresolve;
 #endif
 int active_requests;
 int http_is_verbose;
-size_t http_post_buffer = 16 * LARGE_PACKET_MAX;
+ssize_t http_post_buffer = 16 * LARGE_PACKET_MAX;
 
 #if LIBCURL_VERSION_NUM >= 0x070a06
 #define LIBCURL_CAN_HANDLE_AUTH_ANY
@@ -331,7 +331,7 @@ static int http_options(const char *var, const char *value, 
void *cb)
}
 
if (!strcmp("http.postbuffer", var)) {
-   http_post_buffer = git_config_int(var, value);
+   http_post_buffer = git_config_ssize_t(var, value);
if (http_post_buffer < LARGE_PACKET_MAX)
http_post_buffer = LARGE_PACKET_MAX;
return 0;
diff --git a/http.h b/http.h
index 02bccb7b0c..f7bd3b26b0 100644
--- a/http.h
+++ b/http.h
@@ -111,7 +111,7 @@ extern struct curl_slist *http_copy_default_headers(void);
 extern long int git_curl_ipresolve;
 extern int active_requests;
 extern int http_is_verbose;
-extern size_t http_post_buffer;
+extern ssize_t http_post_buffer;
 extern struct credential http_auth;
 
 extern char curl_errorstr[CURL_ERROR_SIZE];
-- 
2.11.GIT

RE: [PATCH] http.postbuffer: make a size_t

2017-03-30 Thread David Turner

GitLab.  I can't speak to our particular configuration of it -- but if you have 
a specific question about what the config is, I can ask our gitlab admins.

> -Original Message-
> From: Shawn Pearce [mailto:spea...@spearce.org]
> Sent: Thursday, March 30, 2017 4:42 PM
> To: David Turner <david.tur...@twosigma.com>
> Cc: git <git@vger.kernel.org>
> Subject: Re: [PATCH] http.postbuffer: make a size_t
> 
> On Thu, Mar 30, 2017 at 1:29 PM, David Turner <dtur...@twosigma.com>
> wrote:
> > Unfortunately, in order to push some large repos, the http postbuffer
> > must sometimes exceed two gigabytes.  On a 64-bit system, this is OK:
> > we just malloc a larger buffer.
> 
> I'm slightly curious what server you are pushing to that needs the entire 
> thing
> buffered to compute a Content-Length, rather than using
> Transfer-Encoding: chunked. Most Git-over-HTTP should be accepting
> Transfer-Encoding: chunked when the stream exceeds postBuffer.

[PATCH] http.postbuffer: make a size_t

2017-03-30 Thread David Turner

Unfortunately, in order to push some large repos, the http postbuffer
must sometimes exceed two gigabytes.  On a 64-bit system, this is OK:
we just malloc a larger buffer.

Signed-off-by: David Turner <dtur...@twosigma.com>
---
 cache.h  |  1 +
 config.c | 17 +
 http.c   |  2 +-
 3 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/cache.h b/cache.h
index fbdf7a815a..a8c1b65db0 100644
--- a/cache.h
+++ b/cache.h
@@ -1900,6 +1900,7 @@ extern int git_parse_maybe_bool(const char *);
 extern int git_config_int(const char *, const char *);
 extern int64_t git_config_int64(const char *, const char *);
 extern unsigned long git_config_ulong(const char *, const char *);
+extern size_t git_config_size_t(const char *, const char *);
 extern int git_config_bool_or_int(const char *, const char *, int *);
 extern int git_config_bool(const char *, const char *);
 extern int git_config_maybe_bool(const char *, const char *);
diff --git a/config.c b/config.c
index 1a4d85537b..7b706cf27a 100644
--- a/config.c
+++ b/config.c
@@ -834,6 +834,15 @@ int git_parse_ulong(const char *value, unsigned long *ret)
return 1;
 }
 
+static size_t git_parse_size_t(const char *value, unsigned long *ret)
+{
+   size_t tmp;
+   if (!git_parse_signed(value, , 
maximum_unsigned_value_of_type(size_t)))
+   return 0;
+   *ret = tmp;
+   return 1;
+}
+
 NORETURN
 static void die_bad_number(const char *name, const char *value)
 {
@@ -892,6 +901,14 @@ unsigned long git_config_ulong(const char *name, const 
char *value)
return ret;
 }
 
+size_t git_config_size_t(const char *name, const char *value)
+{
+   unsigned long ret;
+   if (!git_parse_size_t(value, ))
+   die_bad_number(name, value);
+   return ret;
+}
+
 int git_parse_maybe_bool(const char *value)
 {
if (!value)
diff --git a/http.c b/http.c
index 96d84bbed3..ab6080835f 100644
--- a/http.c
+++ b/http.c
@@ -331,7 +331,7 @@ static int http_options(const char *var, const char *value, 
void *cb)
}
 
if (!strcmp("http.postbuffer", var)) {
-   http_post_buffer = git_config_int(var, value);
+   http_post_buffer = git_config_size_t(var, value);
if (http_post_buffer < LARGE_PACKET_MAX)
http_post_buffer = LARGE_PACKET_MAX;
return 0;
-- 
2.11.GIT

Re: bug?: git reset --mixed ignores deinitialized submodules

2017-03-13 Thread David Turner

On Mon, 2017-03-13 at 14:19 -0700, Stefan Beller wrote:
> > > The change is not really lost, as you can get it via
> > > 
> > > git checkout HEAD@{1}
> > > git submodule update --init
> > 
> > Sure, the commit isn't lost entirely.  But a mixed reset is often
> > used
> > to mean "go back to before I committed", and here, that's not
> > precisely
> > what's happening.
> 
> Well, you ran the deinit after the commit, so I don't think
> we expect to undo everything up to "just before the commit".

Sure, but other workdir changes after the commit would have been blown
away; so why not this one?

> > > * lack of --recurse-submodules in git-reset? (and that not
> > >   being default, see prior point)
> > 
> > Or possibly this.
> 
> Well even in this case a reset recursing into submodules doesn't
> change the (de-)init status of a submodule. All it would alter is the
> submodule HEAD pointing to, IMHO.

That's OK with me.  It's weird, but at least it's explicable. 

> > For me, the bug (if any) is the bad user experience of doing a
> > mixed
> > reset and expecting to be able to commit (possibly with some git-
> > add
> > operations) from there and get back something like the commit to
> > which
> > the user had git-reset.
> 
> Technically this is also doable,
> 
> git reset --mixed HEAD^ # as in the original email
> git update-index --add --cacheinfo \
> 16,$(git -C .git/modules/sub1 rev-parse HEAD),sub1
> git add another_file
> git commit -m "recreate the commit"

Yeah, unless the user had done various other operations that messed
with .git/modules/sub1 (e.g. if they had first checked out another
branch with --recurse-submodules).

Also, this updates the index, which a mixed reset isn't supposed to
touch. 

> > That's why I have the question mark there -- it's not clear that
> > this
> > is a reasonable expectation.
> 
> Fuzzing around with gitlinks that are non-populated submodules is
> a messy business.

Agreed.

> So another way would be to populate the submodule manually
> 
> GIT_DIR=.git/modules/sub1 \
> GIT_WORKTREE=sub1 \
> git checkout -f # implies last HEAD
> 
> and then continue in the superproject.

(see above for why this is not a general solution)

> I am making up excuses for poor UX here, though.
> I do agree that the expectations may vary wildly because of poor UX
> in submodules.

I agree that it's not quite clear what to expect.  I can just say that
I was quite surprised when my colleague demoed this one for me.

Re: bug?: git reset --mixed ignores deinitialized submodules

2017-03-13 Thread David Turner

On Mon, 2017-03-13 at 10:51 -0700, Stefan Beller wrote:
> On Fri, Mar 10, 2017 at 1:06 PM, David Turner <nova...@novalis.org>
> wrote:
> > Git reset --mixed ignores submodules which are not
> > initialized.  I've
> > attached a demo script.
> > 
> > On one hand, this matches the documentation ("Resets the index but
> > not
> > the working tree").  But on the other hand, it kind of doesn't:
> > "(i.e.,
> > the changed files are preserved but not marked for commit)".
> > 
> > It's hard to figure out what a mixed reset should do.  It would be
> > weird for it to initialize the submodule.  Maybe it should just
> > refuse
> > to run?  Maybe there should be an option for it to initialize the
> > submodule for you?  Maybe it should drop a special-purpose file
> > that
> > git understands to be a submodule change?  For instance (and this
> > is
> > insane, but, like, maybe worth considering) it could use extended
> > filesystem attributes, where available.
> > 
> > #!/bin/bash
> > mkdir demo
> > cd demo
> > 
> > git init main
> > 
> > (
> > git init sub1 &&
> > cd sub1 &&
> > dd if=/dev/urandom of=f bs=40 count=1 &&
> 
> We prefer reproducability in tests, so if we take this into
> a (regression) test later we may want to
> s/dd.../echo "determinism!" >f/

Yeah, that was leftover from some previous version of this script, I
think.   This wasn't intended to be a t/ test, since I don't know what
the right answer is -- just a demo in case my prose was unclear.

> > # commit that change on main,  deinit the submodule and do a mixed
> > reset
> > (
> > cd main &&
> > git add sub1 &&
> > git commit -m 'update sub1' &&
> > git submodule deinit sub1 &&
> > git reset --mixed HEAD^ &&
> 
> As of now most commands (including reset)
> are unaware of submodules to begin with.
> They are ignored in most cases, i.e. git-status
> has some tack-on to (pseudo-)report changes
> in submodules, but it's not really builtin.
> 
> A submodule that is not initialized
> ( i.e. submodule..url is unset) ought
> to not be touched at all. This is spelled out in
> the man page for "submodule update" only at this
> point.
> 
> 
> > git status # change to sub1 is lost
> 
> The change is not really lost, as you can get it via
> 
> git checkout HEAD@{1}
> git submodule update --init

Sure, the commit isn't lost entirely.  But a mixed reset is often used
to mean "go back to before I committed", and here, that's not precisely
what's happening.  In other words, it's not confusing to the user.

> This works most of the time, but it is unreliable as the
> submodule may have had some gc inbetween which
> threw away important objects.

Sure; that's a separate issue.

> Steping back a bit, rereading the subject line,
> what do you think is the bug here?
> 
> * git-status not reporting about uninitialized submodules?

Here, I think git-status is correctly reporting the state of the repo.

> * git reset --mixed not touching the submodule worktree

Yes, possibly.

> * lack of --recurse-submodules in git-reset? (and that not
>   being default, see prior point)

Or possibly this.

> * submodules being in detached HEAD all the time?

In this case, the submodule is not initialized, so it is not in any
state at all.


For me, the bug (if any) is the bad user experience of doing a mixed
reset and expecting to be able to commit (possibly with some git-add
operations) from there and get back something like the commit to which
the user had git-reset.  

That's why I have the question mark there -- it's not clear that this
is a reasonable expectation.

bug?: git reset --mixed ignores deinitialized submodules

2017-03-10 Thread David Turner

Git reset --mixed ignores submodules which are not initialized.  I've
attached a demo script.  

On one hand, this matches the documentation ("Resets the index but not
the working tree").  But on the other hand, it kind of doesn't: "(i.e.,
the changed files are preserved but not marked for commit)".

It's hard to figure out what a mixed reset should do.  It would be
weird for it to initialize the submodule.  Maybe it should just refuse
to run?  Maybe there should be an option for it to initialize the
submodule for you?  Maybe it should drop a special-purpose file that
git understands to be a submodule change?  For instance (and this is
insane, but, like, maybe worth considering) it could use extended
filesystem attributes, where available.

#!/bin/bash
mkdir demo
cd demo

git init main

(
git init sub1 &&
cd sub1 &&
dd if=/dev/urandom of=f bs=40 count=1 &&
git add f &&
git commit -m f
) &&

(
cd main &&
git submodule add ../sub1 sub1 &&
git commit -m 'add submodule' &&
git tag start
) &&

# add a commit on sub1
(
cd main/sub1 &&
echo morx > f &&
git add f &&
git commit -m 'a commit'
) &&

# commit that change on main,  deinit the submodule and do a mixed
reset
(
cd main &&
git add sub1 &&
git commit -m 'update sub1' &&
git submodule deinit sub1 &&
git reset --mixed HEAD^ &&
git status # change to sub1 is lost
)

RE: [PATCH 2/2] http: add an "auto" mode for http.emptyauth

2017-02-23 Thread David Turner

> -Original Message-
> From: Jeff King [mailto:p...@peff.net]
> Sent: Thursday, February 23, 2017 2:44 PM
> To: David Turner <david.tur...@twosigma.com>
> Cc: Junio C Hamano <gits...@pobox.com>; git@vger.kernel.org;
> sand...@crustytoothpaste.net; Johannes Schindelin
> <johannes.schinde...@gmx.de>; Eric Sunshine <sunsh...@sunshineco.com>
> Subject: Re: [PATCH 2/2] http: add an "auto" mode for http.emptyauth
> 
> On Thu, Feb 23, 2017 at 04:31:13PM +, David Turner wrote:
> 
> > > As somebody who is using non-Basic auth, can you apply these patches
> > > and show us the output of:
> > >
> > >GIT_TRACE_CURL=1 \
> > >git ls-remote https://your-server 2>&1 >/dev/null |
> > >egrep '(Send|Recv) header: (GET|HTTP|Auth)'
> > >
> > > (without http.emptyauth turned on, obviously).
> >
> > The results appear to be identical with and without the patch.  With
> > http.emptyauth turned off,
> > 16:27:28.208924 http.c:524  => Send header: GET
> /info/refs?service=git-upload-pack HTTP/1.1
> > 16:27:28.212872 http.c:524  <= Recv header: HTTP/1.1 401
> Authorization Required
> > Username for 'http://git': [I just pressed enter] Password for
> > 'http://git': [ditto]
> > 16:27:29.928872 http.c:524  => Send header: GET
> /info/refs?service=git-upload-pack HTTP/1.1
> > 16:27:29.929787 http.c:524  <= Recv header: HTTP/1.1 401
> Authorization Required
> 
> Just to be sure: did you remove http.emptyauth config completely from your
> config files, or did you turn it to "false"? Because the new behavior only 
> kicks
> in when it isn't configured at all (probably we should respect "auto" as a 
> user-
> provided name).

I turned it to false. With it completely removed, I get this, both times:

20:03:49.896797 http.c:524  => Send header: GET 
/info/refs?service=git-upload-pack HTTP/1.1
20:03:49.900776 http.c:524  <= Recv header: HTTP/1.1 401 
Authorization Required
20:03:49.900929 http.c:524  => Send header: GET 
/info/refs?service=git-upload-pack HTTP/1.1
20:03:49.904754 http.c:524  <= Recv header: HTTP/1.1 401 
Authorization Required
20:03:49.906649 http.c:524  => Send header: GET 
/info/refs?service=git-upload-pack HTTP/1.1
20:03:49.906654 http.c:524  => Send header: Authorization: 
Negotiate 
20:03:49.956753 http.c:524  <= Recv header: HTTP/1.1 200 OK - 
$gitservername

> > (if someone else wants to replicate this, delete >/dev/null bit from
> > Jeff's shell snippet)
> 
> Hrm, you shouldn't need to. The stderr redirection comes first, so it should
> become the new stdout.

Weird.  It didn't appear work earlier, but I must have screwed something up.
And I learned something about shell redirection.

RE: [PATCH 2/2] http: add an "auto" mode for http.emptyauth

2017-02-23 Thread David Turner



> -Original Message-
> From: Jeff King [mailto:p...@peff.net]
> Sent: Wednesday, February 22, 2017 8:38 PM
> To: David Turner <david.tur...@twosigma.com>
> Cc: Junio C Hamano <gits...@pobox.com>; git@vger.kernel.org;
> sand...@crustytoothpaste.net; Johannes Schindelin
> <johannes.schinde...@gmx.de>; Eric Sunshine <sunsh...@sunshineco.com>
> Subject: Re: [PATCH 2/2] http: add an "auto" mode for http.emptyauth
> 
> On Thu, Feb 23, 2017 at 01:16:33AM +, David Turner wrote:
> 
> > I don't know enough about how libcurl handles authentication to know
> > whether these patches are a good idea, but I have a minor comment
> anyway.
> 
> As somebody who is using non-Basic auth, can you apply these patches and
> show us the output of:
> 
>GIT_TRACE_CURL=1 \
>git ls-remote https://your-server 2>&1 >/dev/null |
>egrep '(Send|Recv) header: (GET|HTTP|Auth)'
> 
> (without http.emptyauth turned on, obviously).

The results appear to be identical with and without
the patch.  With http.emptyauth turned off,
16:27:28.208924 http.c:524  => Send header: GET 
/info/refs?service=git-upload-pack HTTP/1.1
16:27:28.212872 http.c:524  <= Recv header: HTTP/1.1 401 
Authorization Required
Username for 'http://git': [I just pressed enter]
Password for 'http://git': [ditto]
16:27:29.928872 http.c:524  => Send header: GET 
/info/refs?service=git-upload-pack HTTP/1.1
16:27:29.929787 http.c:524  <= Recv header: HTTP/1.1 401 
Authorization Required

(if someone else wants to replicate this, delete >/dev/null bit 
from Jeff's shell snippet)

RE: [PATCH 2/2] http: add an "auto" mode for http.emptyauth

2017-02-22 Thread David Turner

I don't know enough about how libcurl handles authentication to know whether 
these patches are a good idea, but I have a minor comment anyway.

> -Original Message-
> From: Jeff King [mailto:p...@peff.net]
> +static int curl_empty_auth_enabled(void) {
> + if (curl_empty_auth < 0) {
> +#ifdef LIBCURL_CAN_HANDLE_AUTH_ANY
> + /*
> +  * In the automatic case, kick in the empty-auth
> +  * hack as long as we would potentially try some
> +  * method more exotic than "Basic".
> +  *
> +  * But only do so when this is _not_ our initial
> +  * request, as we would not then yet know what
> +  * methods are available.
> +  */

Eliminate double-negative:

"But only do this when this is our second or subsequent request, 
as by then we know what methods are available."

RE: [PATCH] http(s): automatically try NTLM authentication first

2017-02-22 Thread David Turner

> -Original Message-
> From: brian m. carlson [mailto:sand...@crustytoothpaste.net]
> 
> This is SPNEGO.  It will work with NTLM as well as Kerberos.
> 
> Browsers usually disable this feature by default, as it basically will 
> attempt to
> authenticate to any site that sends a 401.  For Kerberos against a malicious
> site, the user will either not have a valid ticket for that domain, or the 
> user's
> Kerberos server will refuse to provide a ticket to pass to the server, so
> there's no security risk involved.
> 
> I'm unclear how SPNEGO works with NTLM, so I can't speak for the security
> of it.  From what I understand of NTLM and from RFC 4559, it consists of a
> shared secret.  I'm unsure what security measures are in place to not send
> that to an untrusted server.
> 
> As far as Kerberos, this is a desirable feature to have enabled, with little
> downside.  I just don't know about the security of the NTLM part, and I don't
> think we should take this patch unless we're sure we know the
> consequences of it.

NTLM on its own is bad:

https://msdn.microsoft.com/en-us/library/windows/desktop/aa378749(v=vs.85).aspx
says:

"
1. (Interactive authentication only) A user accesses a client computer and 
provides a domain name, user name, and password. The client computes a 
cryptographic hash of the password and discards the actual password.
2. The client sends the user name to the server (in plaintext).
3. The server generates a 16-byte random number, called a challenge or 
nonce, and sends it to the client.
4. The client encrypts this challenge with the hash of the user's password 
and returns the result to the server. This is called the response.
..."

Wait, what?  If I'm a malicious server, I can get access to an offline oracle
for whether I've correctly guessed the user's password?  That doesn't 
sound secure at all!  Skimming the SPNEGO RFCs, there appears to be no
mitigation for this.  

So, I guess, this patch might be considered a security risk. But on the 
other hand, even *without* this patch, and without http.allowempty at 
all, I think a config which simply uses a https://  url without the magic :@
would try SPNEGO.  As I understand it, the http.allowempty config just 
makes the traditional :@ urls work. 

Actually, though, I am not sure this is as bad as it seems, because gssapi
might protect us.  When I locally tried a fake server, git (libcurl) refused to 
send my Kerberos credentials because "Server not found in Kerberos 
database".  I don't have a machine set up with NTLM authentication 
(because, apparently, that would be insane), so I don't know how to 
confirm that gssapi would operate off of a whitelist for NTLM as well.

RE: [PATCH] http(s): automatically try NTLM authentication first

2017-02-22 Thread David Turner

> -Original Message-
> From: Junio C Hamano [mailto:jch2...@gmail.com] On Behalf Of Junio C
> Hamano
> Sent: Wednesday, February 22, 2017 3:20 PM
> To: David Turner <david.tur...@twosigma.com>
> Cc: git@vger.kernel.org; sand...@crustytoothpaste.net; Johannes Schindelin
> <johannes.schinde...@gmx.de>; Eric Sunshine
> <sunsh...@sunshineco.com>; Jeff King <p...@peff.net>
> Subject: Re: [PATCH] http(s): automatically try NTLM authentication first
> 
> David Turner <dtur...@twosigma.com> writes:
> 
> > From: Johannes Schindelin <johannes.schinde...@gmx.de>
> >
> > It is common in corporate setups to have permissions managed via a
> > domain account. That means that the user does not really have to log
> > in when accessing a central repository via https://, but that the
> > login credentials are used to authenticate with that repository.
> >
> > The common way to do that used to require empty credentials, i.e.
> > hitting Enter twice when being asked for user name and password, or by
> > using the very funny notation https://:@server/repository
> >
> > A recent commit (5275c3081c (http: http.emptyauth should allow empty
> > (not just NULL) usernames, 2016-10-04)) broke that usage, though, all
> > of a sudden requiring users to set http.emptyAuth = true.
> >
> > Which brings us to the bigger question why http.emptyAuth defaults to
> > false, to begin with.
> 
> This is a valid question, and and I do not see it explicitly asked in the 
> thread:
> 
> https://public-
> inbox.org/git/CAPig+cSphEu3iRJrkdBA+BRhi9HnopLJnKOHVuGhUqavtV1RXg
> @mail.gmail.com/#t
> 
> even though there is a hint of it already there.
> 
> > It would be one thing if cURL would not let the user specify
> > credentials interactively after attempting NTLM authentication (i.e.
> > login credentials), but that is not the case.
> >
> > It would be another thing if attempting NTLM authentication was not
> > usually what users need to do when trying to authenticate via https://.
> > But that is also not the case.
> 
> Some other possible worries we may have had I can think of are:
> 
>  - With this enabled unconditionally, would we leak some information?

I think "NTLM" is actually a misnomer here (I just copied Johannes's 
commit message). The mechanism is actually SPNEGO, if I understand this 
correctly. It seems to me that this is probably secure, since it is apparently
widely implemented in browsers.

>  - With this enabled unconditionally, would we always incur an extra
>roundtrip for people who are not running NTLM at all?
>
> I do not think the former is the case, but what would I know (adding a few
> people involved in the original thread to CC: ;-)

Always, no.  For failed authentication (or authorization), apparently, yes.  
I tested this by  setting the variable to false and then true, and trying to 
Push to a github repository which I didn't have write access to, with 
both an empty username (https://@:github.com/...) and no username 
(http://github.com/...).   I ran this under GIT_CURL_VERBOSE=1 and
I saw two 401 responses in the "http.emptyauth=true" case and one
in the false case.  I also tried with a repo that I did have access to (first
configuring the necessary tokens for HTTPS push access), and saw two
401 responses in *both* cases.

[PATCH] http(s): automatically try NTLM authentication first

2017-02-22 Thread David Turner

From: Johannes Schindelin <johannes.schinde...@gmx.de>

It is common in corporate setups to have permissions managed via a
domain account. That means that the user does not really have to log in
when accessing a central repository via https://, but that the login
credentials are used to authenticate with that repository.

The common way to do that used to require empty credentials, i.e. hitting
Enter twice when being asked for user name and password, or by using the
very funny notation https://:@server/repository

A recent commit (5275c3081c (http: http.emptyauth should allow empty (not
just NULL) usernames, 2016-10-04)) broke that usage, though, all of a
sudden requiring users to set http.emptyAuth = true.

Which brings us to the bigger question why http.emptyAuth defaults to
false, to begin with.

It would be one thing if cURL would not let the user specify credentials
interactively after attempting NTLM authentication (i.e. login
credentials), but that is not the case.

It would be another thing if attempting NTLM authentication was not
usually what users need to do when trying to authenticate via https://.
But that is also not the case.

So let's just go ahead and change the default, and unbreak the NTLM
authentication. As a bonus, this also makes the "you need to hit Enter
twice" (which is hard to explain: why enter empty credentials when you
want to authenticate with your login credentials?) and the ":@" hack
(which is also pretty, pretty hard to explain to users) obsolete.

This fixes https://github.com/git-for-windows/git/issues/987

Signed-off-by: Johannes Schindelin <johannes.schinde...@gmx.de>
Signed-off-by: David Turner <dtur...@twosigma.com>
---
This has been in git for Windows for a few months (without the
config.txt change).  We've also been using it internally.  So I think
it's time to merge back to upstream git.

 Documentation/config.txt | 3 ++-
 http.c   | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index fc5a28a320..b0da64ed33 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1742,7 +1742,8 @@ http.emptyAuth::
Attempt authentication without seeking a username or password.  This
can be used to attempt GSS-Negotiate authentication without specifying
a username in the URL, as libcurl normally requires a username for
-   authentication.
+   authentication.  Default is true, since if this fails, git will fall
+   back to asking the user for their username/password.
 
 http.delegation::
Control GSSAPI credential delegation. The delegation is disabled
diff --git a/http.c b/http.c
index 90a1c0f113..943e630ea6 100644
--- a/http.c
+++ b/http.c
@@ -109,7 +109,7 @@ static int curl_save_cookies;
 struct credential http_auth = CREDENTIAL_INIT;
 static int http_proactive_auth;
 static const char *user_agent;
-static int curl_empty_auth;
+static int curl_empty_auth = 1;
 
 enum http_follow_config http_follow_config = HTTP_FOLLOW_INITIAL;
 
-- 
2.11.GIT

Re: [PATCH v2 0/9] Store submodules in a hash, not a linked list

2017-02-12 Thread David Turner



On Fri, 2017-02-10 at 12:16 +0100, Michael Haggerty wrote:
> This is v2 of the patch series, considerably reorganized but not that
> different codewise. Thanks to Stefan, Junio, and Peff for their
> feedback about v1 [1]. I think I have addressed all of your comments.

LGTM.

RE: [PATCH v5] gc: ignore old gc.log files

2017-02-10 Thread David Turner



> -Original Message-
> From: Jeff King [mailto:p...@peff.net]
> Sent: Friday, February 10, 2017 4:15 PM
> To: David Turner <david.tur...@twosigma.com>
> Cc: git@vger.kernel.org; pclo...@gmail.com; Junio C Hamano
> <gits...@pobox.com>
> Subject: Re: [PATCH v5] gc: ignore old gc.log files
> 
> > @@ -76,10 +78,30 @@ static void git_config_date_string(const char
> > *key, const char **output)  static void process_log_file(void)  {
> > struct stat st;
> > -   if (!fstat(get_lock_file_fd(_lock), ) && st.st_size)
> > +   if (fstat(get_lock_file_fd(_lock), )) {
> > +   /*
> > +* Perhaps there was an i/o error or another
> > +* unlikely situation.  Try to make a note of
> > +* this in gc.log along with any existing
> > +* messages.
> > +*/
> > +   FILE *fp;
> > +   int saved_errno = errno;
> > +   fp = fdopen(log_lock.tempfile.fd, "a");
> 
> We usually use xfdopen() to handle (unlikely) errors rather than segfaulting. 
>  But
> I think you'd actually want fdopen_lock_file(), which attaches the fd to the
> tempfile for flushing and cleanup purposes.
> 
> That said, I'm not sure I understand why you're opening a new stdio 
> filehandle.
> We know that stderr already points to our logfile (that's how content gets 
> there
> in the first place). If there's a problem with the file or the descriptor, 
> opening a
> new filehandle around the same descriptor won't help.
> 
> Speaking of stderr, I wonder if this function should be calling
> fflush(stderr) before looking at the fstat result. There could be contents 
> buffered
> there that haven't been written out yet (not from child processes, but perhaps
> ones written in this process itself).
> Probably unlikely in practice, since stderr is typically unbuffered by 
> default.

Process_log_file_at_exit calls fflush.  Will fix the other.

[PATCH v6] gc: ignore old gc.log files

2017-02-10 Thread David Turner

A server can end up in a state where there are lots of unreferenced
loose objects (say, because many users are doing a bunch of rebasing
and pushing their rebased branches).  Running "git gc --auto" in
this state would cause a gc.log file to be created, preventing
future auto gcs, causing pack files to pile up.  Since many git
operations are O(n) in the number of pack files, this would lead to
poor performance.

Git should never get itself into a state where it refuses to do any
maintenance, just because at some point some piece of the maintenance
didn't make progress.

Teach Git to ignore gc.log files which are older than (by default)
one day old, which can be tweaked via the gc.logExpiry configuration
variable.  That way, these pack files will get cleaned up, if
necessary, at least once per day.  And operators who find a need for
more-frequent gcs can adjust gc.logExpiry to meet their needs.

There is also some cleanup: a successful manual gc, or a
warning-free auto gc with an old log file, will remove any old
gc.log files.

It might still happen that manual intervention is required
(e.g. because the repo is corrupt), but at the very least it won't
be because Git is too dumb to try again.

Signed-off-by: David Turner <dtur...@twosigma.com>
Helped-by: Jeff King <p...@peff.net>
Signed-off-by: Junio C Hamano <gits...@pobox.com>
---
 Documentation/config.txt |  6 +
 builtin/gc.c | 57 ++--
 t/t6500-gc.sh| 15 +
 3 files changed, 71 insertions(+), 7 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index fc5a28a32..a684b7e3e 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1402,6 +1402,12 @@ gc.autoDetach::
Make `git gc --auto` return immediately and run in background
if the system supports it. Default is true.
 
+gc.logExpiry::
+   If the file gc.log exists, then `git gc --auto` won't run
+   unless that file is more than 'gc.logExpiry' old.  Default is
+   "1.day".  See `gc.pruneExpire` for more ways to specify its
+   value.
+
 gc.packRefs::
Running `git pack-refs` in a repository renders it
unclonable by Git versions prior to 1.5.1.2 over dumb
diff --git a/builtin/gc.c b/builtin/gc.c
index 331f21926..a2b9e8924 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -33,6 +33,8 @@ static int aggressive_window = 250;
 static int gc_auto_threshold = 6700;
 static int gc_auto_pack_limit = 50;
 static int detach_auto = 1;
+static unsigned long gc_log_expire_time;
+static const char *gc_log_expire = "1.day.ago";
 static const char *prune_expire = "2.weeks.ago";
 static const char *prune_worktrees_expire = "3.months.ago";
 
@@ -76,10 +78,28 @@ static void git_config_date_string(const char *key, const 
char **output)
 static void process_log_file(void)
 {
struct stat st;
-   if (!fstat(get_lock_file_fd(_lock), ) && st.st_size)
+   if (fstat(get_lock_file_fd(_lock), )) {
+   /*
+* Perhaps there was an i/o error or another
+* unlikely situation.  Try to make a note of
+* this in gc.log along with any existing
+* messages.
+*/
+   int saved_errno = errno;
+   fprintf(stderr, _("Failed to fstat %s: %s"),
+   get_tempfile_path(_lock.tempfile),
+   strerror(saved_errno));
+   fflush(stderr);
commit_lock_file(_lock);
-   else
+   errno = saved_errno;
+   } else if (st.st_size) {
+   /* There was some error recorded in the lock file */
+   commit_lock_file(_lock);
+   } else {
+   /* No error, clean up any old gc.log */
+   unlink(git_path("gc.log"));
rollback_lock_file(_lock);
+   }
 }
 
 static void process_log_file_at_exit(void)
@@ -113,6 +133,8 @@ static void gc_config(void)
git_config_get_bool("gc.autodetach", _auto);
git_config_date_string("gc.pruneexpire", _expire);
git_config_date_string("gc.worktreepruneexpire", 
_worktrees_expire);
+   git_config_date_string("gc.logexpiry", _log_expire);
+
git_config(git_default_config, NULL);
 }
 
@@ -290,19 +312,34 @@ static const char *lock_repo_for_gc(int force, pid_t* 
ret_pid)
 static int report_last_gc_error(void)
 {
struct strbuf sb = STRBUF_INIT;
-   int ret;
+   int ret = 0;
+   struct stat st;
+   char *gc_log_path = git_pathdup("gc.log");
 
-   ret = strbuf_read_file(, git_path("gc.log"), 0);
+   if (stat(gc_log_path, )) {
+   if (errno == ENOENT)
+   goto done;
+
+   ret = error_errno(_("Can't stat %s"), gc_log_path);
+

RE: [PATCH v3] gc: ignore old gc.log files

2017-02-10 Thread David Turner



> -Original Message-
> From: Jeff King [mailto:p...@peff.net]
> Sent: Friday, February 10, 2017 3:09 PM
> To: David Turner <david.tur...@twosigma.com>
> Cc: git@vger.kernel.org; pclo...@gmail.com
> Subject: Re: [PATCH v3] gc: ignore old gc.log files
> 
> On Fri, Feb 10, 2017 at 02:20:19PM -0500, David Turner wrote:
> 
> > @@ -76,10 +77,47 @@ static void git_config_date_string(const char
> > *key, const char **output)  static void process_log_file(void)  {
> > struct stat st;
> > -   if (!fstat(get_lock_file_fd(_lock), ) && st.st_size)
> > +   if (fstat(get_lock_file_fd(_lock), )) {
> > +   if (errno == ENOENT) {
> > +   /*
> > +* The user has probably intentionally deleted
> > +* gc.log.lock (perhaps because they're blowing
> > +* away the whole repo), so thre's no need to
> > +* report anything here.  But we also won't
> > +* delete gc.log, because we don't know what
> > +* the user's intentions are.
> > +*/
> 
> Hrm. Does fstat actually trigger ENOENT in that case? There's no pathname
> lookup happening at all. A simple test on Linux seems to show that it does 
> not.
> Build:
> 
>   #include 
>   #include 
>   #include 
> 
>   int main(int argc, char **argv)
>   {
>   struct stat st;
>   int fd = open(argv[1], O_WRONLY|O_CREAT|O_EXCL, 0600);
>   unlink(argv[1]);
>   fstat(fd, );
>   return 0;
>   }
> 
> and run:
> 
>   strace ./a.out tmp
> 
> which shows:
> 
>   open("tmp", O_WRONLY|O_CREAT|O_EXCL, 056660) = 3
>   unlink("tmp")   = 0
>   fstat(3, {st_mode=S_IFREG|S_ISUID|S_ISGID|0640, st_size=0, ...}) = 0
> 
> But maybe there are other systems emulating fstat() would trigger here.
> I dunno.

Yeah, I'm also not sure.  On the other hand, if we're going to catch fstat 
errors anyway, we might as well do something sensible with this one.

> > +   } else {
> > +   FILE *fp;
> > +   int fd;
> > +   int saved_errno = errno;
> > +   /*
> > +* Perhaps there was an i/o error or another
> > +* unlikely situation.  Try to make a note of
> > +* this in gc.log.  If this fails again,
> > +* give up and leave gc.log as it was.
> > +*/
> > +   rollback_lock_file(_lock);
> > +   fd = hold_lock_file_for_update(_lock,
> > +  git_path("gc.log"),
> > +  LOCK_DIE_ON_ERROR);
> 
> If there was an i/o error, then gc.log.lock still exists. And this attempt to 
> lock will
> therefore fail, calling die(). Which would trigger our atexit() to call 
> process_log(),
> which would hit this code again, and so forth. I'm not sure if we'd actually
> recurse when an atexit handler calls exit(). But it seems questionable.

No, because  we call rollback_log_file first.

> I'm also not sure why we need to re-open the file in the first place. We have 
> an
> open descriptor (and we even redirected stderr to it already).
> Why don't we just write to it?

If fstat failed, that probably indicates something bad about the old fd.  I'm 
not 
actually sure why fstat would ever fail, since in all likelihood, the kernel 
keeps 
information about inodes corresponding to open fds in-memory.  Maybe someone
forcibly unmounted the drive?  

> > @@ -113,6 +151,9 @@ static void gc_config(void)
> > git_config_get_bool("gc.autodetach", _auto);
> > git_config_date_string("gc.pruneexpire", _expire);
> > git_config_date_string("gc.worktreepruneexpire",
> > _worktrees_expire);
> > +   if (!git_config_get_value("gc.logexpiry", ))
> > +   parse_expiry_date(value, _log_expire_time);
> > +
> 
> Should you be using git_config_date_string() here? It looks like it does some
> extra sanity-checking. It annoyingly just gets the string, and doesn't parse 
> it.
> Perhaps it would be worth adding a
> git_config_date_value() helper.

That seems like a good idea, but I'm going to skip it for now and promise to
do it next time I need a date config.

> Or alternatively, save the date string here, and then parse once later on, 
> after
> having resolved all config (and overwritten the default value).

Sure.

[PATCH v5] gc: ignore old gc.log files

2017-02-10 Thread David Turner

A server can end up in a state where there are lots of unreferenced
loose objects (say, because many users are doing a bunch of rebasing
and pushing their rebased branches).  Running "git gc --auto" in
this state would cause a gc.log file to be created, preventing
future auto gcs, causing pack files to pile up.  Since many git
operations are O(n) in the number of pack files, this would lead to
poor performance.

Git should never get itself into a state where it refuses to do any
maintenance, just because at some point some piece of the maintenance
didn't make progress.

Teach Git to ignore gc.log files which are older than (by default)
one day old, which can be tweaked via the gc.logExpiry configuration
variable.  That way, these pack files will get cleaned up, if
necessary, at least once per day.  And operators who find a need for
more-frequent gcs can adjust gc.logExpiry to meet their needs.

There is also some cleanup: a successful manual gc, or a
warning-free auto gc with an old log file, will remove any old
gc.log files.

It might still happen that manual intervention is required
(e.g. because the repo is corrupt), but at the very least it won't
be because Git is too dumb to try again.

Signed-off-by: David Turner <dtur...@twosigma.com>
Helped-by: Jeff King <p...@peff.net>
Signed-off-by: Junio C Hamano <gits...@pobox.com>
---
 Documentation/config.txt |  6 +
 builtin/gc.c | 59 ++--
 t/t6500-gc.sh| 15 
 3 files changed, 73 insertions(+), 7 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index fc5a28a32..a684b7e3e 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1402,6 +1402,12 @@ gc.autoDetach::
Make `git gc --auto` return immediately and run in background
if the system supports it. Default is true.
 
+gc.logExpiry::
+   If the file gc.log exists, then `git gc --auto` won't run
+   unless that file is more than 'gc.logExpiry' old.  Default is
+   "1.day".  See `gc.pruneExpire` for more ways to specify its
+   value.
+
 gc.packRefs::
Running `git pack-refs` in a repository renders it
unclonable by Git versions prior to 1.5.1.2 over dumb
diff --git a/builtin/gc.c b/builtin/gc.c
index 331f21926..8d355feb0 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -33,6 +33,8 @@ static int aggressive_window = 250;
 static int gc_auto_threshold = 6700;
 static int gc_auto_pack_limit = 50;
 static int detach_auto = 1;
+static unsigned long gc_log_expire_time;
+static const char *gc_log_expire = "1.day.ago";
 static const char *prune_expire = "2.weeks.ago";
 static const char *prune_worktrees_expire = "3.months.ago";
 
@@ -76,10 +78,30 @@ static void git_config_date_string(const char *key, const 
char **output)
 static void process_log_file(void)
 {
struct stat st;
-   if (!fstat(get_lock_file_fd(_lock), ) && st.st_size)
+   if (fstat(get_lock_file_fd(_lock), )) {
+   /*
+* Perhaps there was an i/o error or another
+* unlikely situation.  Try to make a note of
+* this in gc.log along with any existing
+* messages.
+*/
+   FILE *fp;
+   int saved_errno = errno;
+   fp = fdopen(log_lock.tempfile.fd, "a");
+   fprintf(fp, _("Failed to fstat %s: %s"),
+   get_tempfile_path(_lock.tempfile),
+   strerror(saved_errno));
+   fclose(fp);
commit_lock_file(_lock);
-   else
+   errno = saved_errno;
+   } else if (st.st_size) {
+   /* There was some error recorded in the lock file */
+   commit_lock_file(_lock);
+   } else {
+   /* No error, clean up any old gc.log */
+   unlink(git_path("gc.log"));
rollback_lock_file(_lock);
+   }
 }
 
 static void process_log_file_at_exit(void)
@@ -113,6 +135,8 @@ static void gc_config(void)
git_config_get_bool("gc.autodetach", _auto);
git_config_date_string("gc.pruneexpire", _expire);
git_config_date_string("gc.worktreepruneexpire", 
_worktrees_expire);
+   git_config_date_string("gc.logexpiry", _log_expire);
+
git_config(git_default_config, NULL);
 }
 
@@ -290,19 +314,34 @@ static const char *lock_repo_for_gc(int force, pid_t* 
ret_pid)
 static int report_last_gc_error(void)
 {
struct strbuf sb = STRBUF_INIT;
-   int ret;
+   int ret = 0;
+   struct stat st;
+   char *gc_log_path = git_pathdup("gc.log");
 
-   ret = strbuf_read_file(, git_path("gc.log"), 0);
+   if (stat(gc_log_path, )) {
+   if (errno == ENOENT)
+   goto done;
+
+

[PATCH v4] gc: ignore old gc.log files

2017-02-10 Thread David Turner

Git should never get itself into a state where it refuses to do any
maintenance, just because at some point some piece of the maintenance
didn't make progress.

In this commit, git learns to ignore gc.log files which are older than
(by default) one day old.  It also learns about a config, gc.logExpiry
to manage this.  There is also some cleanup: a successful manual gc,
or a warning-free auto gc, will remove any old gc.log files.

It might still happen that manual intervention is required
(e.g. because the repo is corrupt), but at the very least it won't be
because Git is too dumb to try again.

Automatic gc was intended to make client repositories be
self-maintaining.  It would be good if automatic gc were also useful
to server operators.  A server can end up in a state whre there are
lots of unreferenced loose objects (say, because many users are doing
a bunch of rebasing and pushing their rebased branches). Before this
patch, this state would cause a gc.log file to be created, preventing
future auto gcs.  Then pack files could pile up.  Since many git
operations are O(n) in the number of pack files, this would lead to
poor performance.  Now, the pack files will get cleaned up, if
necessary, at least once per day.  And operators who find a need for
more-frequent gcs can adjust gc.logExpiry to meet their needs.

Signed-off-by: David Turner <dtur...@twosigma.com>
Helped-by: Jeff King <p...@peff.net>
---
 Documentation/config.txt |  6 
 builtin/gc.c | 76 +++-
 t/t6500-gc.sh| 12 
 3 files changed, 87 insertions(+), 7 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index fc5a28a32..a684b7e3e 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1402,6 +1402,12 @@ gc.autoDetach::
Make `git gc --auto` return immediately and run in background
if the system supports it. Default is true.
 
+gc.logExpiry::
+   If the file gc.log exists, then `git gc --auto` won't run
+   unless that file is more than 'gc.logExpiry' old.  Default is
+   "1.day".  See `gc.pruneExpire` for more ways to specify its
+   value.
+
 gc.packRefs::
Running `git pack-refs` in a repository renders it
unclonable by Git versions prior to 1.5.1.2 over dumb
diff --git a/builtin/gc.c b/builtin/gc.c
index 331f21926..55c441115 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -33,6 +33,7 @@ static int aggressive_window = 250;
 static int gc_auto_threshold = 6700;
 static int gc_auto_pack_limit = 50;
 static int detach_auto = 1;
+static unsigned long gc_log_expire_time;
 static const char *prune_expire = "2.weeks.ago";
 static const char *prune_worktrees_expire = "3.months.ago";
 
@@ -76,10 +77,48 @@ static void git_config_date_string(const char *key, const 
char **output)
 static void process_log_file(void)
 {
struct stat st;
-   if (!fstat(get_lock_file_fd(_lock), ) && st.st_size)
+   if (fstat(get_lock_file_fd(_lock), )) {
+   if (errno == ENOENT) {
+   /*
+* The user has probably intentionally deleted
+* gc.log.lock (perhaps because they're blowing
+* away the whole repo), so thre's no need to
+* report anything here.  But we also won't
+* delete gc.log, because we don't know what
+* the user's intentions are.
+*/
+   } else {
+   FILE *fp;
+   int fd;
+   int saved_errno = errno;
+   /*
+* Perhaps there was an i/o error or another
+* unlikely situation.  Try to make a note of
+* this in gc.log.  If this fails again,
+* give up and leave gc.log as it was.
+*/
+   rollback_lock_file(_lock);
+   fd = hold_lock_file_for_update(_lock,
+  git_path("gc.log"),
+  LOCK_DIE_ON_ERROR);
+
+   fp = fdopen(fd, "w");
+   fprintf(fp, _("Failed to fstat %s: %s"),
+   get_tempfile_path(_lock.tempfile),
+   strerror(saved_errno));
+   fclose(fp);
+   commit_lock_file(_lock);
+   errno = saved_errno;
+   }
+
+   } else if (st.st_size) {
+   /* There was some error recorded in the lock file */
commit_lock_file(_lock);
-   else
+   } else {
+   /* No error, clean up any old gc.log */
+   unlink(git_path(&q

[PATCH v3] gc: ignore old gc.log files

2017-02-10 Thread David Turner

Git should never get itself into a state where it refuses to do any
maintenance, just because at some point some piece of the maintenance
didn't make progress.

In this commit, git learns to ignore gc.log files which are older than
(by default) one day old.  It also learns about a config, gc.logExpiry
to manage this.  There is also some cleanup: a successful manual gc,
or a warning-free auto gc with an old log file, will remove any old
gc.log files.

It might still happen that manual intervention is required
(e.g. because the repo is corrupt), but at the very least it won't be
because Git is too dumb to try again.

Automatic gc was intended to make client repositories be
self-maintaining.  It would be good if automatic gc were also useful
to server operators.  A server can end up in a state whre there are
lots of unreferenced loose objects (say, because many users are doing
a bunch of rebasing and pushing their rebased branches). Before this
patch, this state would cause a gc.log file to be created, preventing
future auto gcs.  Then pack files could pile up.  Since many git
operations are O(n) in the number of pack files, this would lead to
poor performance.  Now, the pack files will get cleaned up, if
necessary, at least once per day.  And operators who find a need for
more-frequent gcs can adjust gc.logExpiry to meet their needs.

Signed-off-by: David Turner <dtur...@twosigma.com>
Helped-by: Jeff King <p...@peff.net>
---
 Documentation/config.txt |  6 
 builtin/gc.c | 75 +++-
 t/t6500-gc.sh| 13 +
 3 files changed, 87 insertions(+), 7 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index fc5a28a32..a684b7e3e 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1402,6 +1402,12 @@ gc.autoDetach::
Make `git gc --auto` return immediately and run in background
if the system supports it. Default is true.
 
+gc.logExpiry::
+   If the file gc.log exists, then `git gc --auto` won't run
+   unless that file is more than 'gc.logExpiry' old.  Default is
+   "1.day".  See `gc.pruneExpire` for more ways to specify its
+   value.
+
 gc.packRefs::
Running `git pack-refs` in a repository renders it
unclonable by Git versions prior to 1.5.1.2 over dumb
diff --git a/builtin/gc.c b/builtin/gc.c
index 331f21926..6f297aa91 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -33,6 +33,7 @@ static int aggressive_window = 250;
 static int gc_auto_threshold = 6700;
 static int gc_auto_pack_limit = 50;
 static int detach_auto = 1;
+static unsigned long gc_log_expire_time;
 static const char *prune_expire = "2.weeks.ago";
 static const char *prune_worktrees_expire = "3.months.ago";
 
@@ -76,10 +77,47 @@ static void git_config_date_string(const char *key, const 
char **output)
 static void process_log_file(void)
 {
struct stat st;
-   if (!fstat(get_lock_file_fd(_lock), ) && st.st_size)
+   if (fstat(get_lock_file_fd(_lock), )) {
+   if (errno == ENOENT) {
+   /*
+* The user has probably intentionally deleted
+* gc.log.lock (perhaps because they're blowing
+* away the whole repo), so thre's no need to
+* report anything here.  But we also won't
+* delete gc.log, because we don't know what
+* the user's intentions are.
+*/
+   } else {
+   FILE *fp;
+   int fd;
+   int saved_errno = errno;
+   /*
+* Perhaps there was an i/o error or another
+* unlikely situation.  Try to make a note of
+* this in gc.log.  If this fails again,
+* give up and leave gc.log as it was.
+*/
+   rollback_lock_file(_lock);
+   fd = hold_lock_file_for_update(_lock,
+  git_path("gc.log"),
+  LOCK_DIE_ON_ERROR);
+
+   fp = fdopen(fd, "w");
+   fprintf(fp, _("Failed to fstat %s: %s"),
+   get_tempfile_path(_lock.tempfile),
+   strerror(errno));
+   fclose(fp);
+   commit_lock_file(_lock);
+   }
+
+   } else if (st.st_size) {
+   /* There was some error recorded in the lock file */
commit_lock_file(_lock);
-   else
+   } else {
+   /* No error, clean up any old gc.log */
+   unlink(git_path("gc.log"));
rollback_loc

[PATCH v2] gc: ignore old gc.log files

2017-02-09 Thread David Turner

The intent of automatic gc is to have a git repository be relatively
low-maintenance from a server-operator perspective.  Of course, large
operators like GitHub will need a more complicated management strategy,
but for ordinary usage, git should just work.

In this commit, git learns to ignore gc.log files which are older than
(by default) one day old.  It also learns about a config, gc.logExpiry
to manage this.  There is also some cleanup: a successful manual gc,
or a warning-free auto gc with an old log file, will remove any old
gc.log files.

So git should never get itself into a state where it refuses to do any
maintenance, just because at some point some piece of the maintenance
didn't make progress.  That might still happen (e.g. because the repo
is corrupt), but at the very least it won't be because Git is too dumb
to try again.

Signed-off-by: David Turner <dtur...@twosigma.com>
Helped-by: Jeff King <p...@peff.net>
---
 Documentation/config.txt |  5 +
 builtin/gc.c | 42 +++---
 t/t6500-gc.sh| 13 +
 3 files changed, 53 insertions(+), 7 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index fc5a28a32..2c2c9c75c 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1402,6 +1402,11 @@ gc.autoDetach::
Make `git gc --auto` return immediately and run in background
if the system supports it. Default is true.
 
+gc.logExpiry::
+   If the file gc.log exists, then `git gc --auto` won't run
+   unless that file is more than 'gc.logExpiry' old.  Default is
+   "1.day".  See `gc.pruneExpire` for more possible values.
+
 gc.packRefs::
Running `git pack-refs` in a repository renders it
unclonable by Git versions prior to 1.5.1.2 over dumb
diff --git a/builtin/gc.c b/builtin/gc.c
index 331f21926..46edcff30 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -33,6 +33,7 @@ static int aggressive_window = 250;
 static int gc_auto_threshold = 6700;
 static int gc_auto_pack_limit = 50;
 static int detach_auto = 1;
+static unsigned long gc_log_expire_time;
 static const char *prune_expire = "2.weeks.ago";
 static const char *prune_worktrees_expire = "3.months.ago";
 
@@ -76,10 +77,12 @@ static void git_config_date_string(const char *key, const 
char **output)
 static void process_log_file(void)
 {
struct stat st;
-   if (!fstat(get_lock_file_fd(_lock), ) && st.st_size)
+   if (!fstat(get_lock_file_fd(_lock), ) && st.st_size) {
commit_lock_file(_lock);
-   else
+   } else {
+   unlink(git_path("gc.log"));
rollback_lock_file(_lock);
+   }
 }
 
 static void process_log_file_at_exit(void)
@@ -111,6 +114,11 @@ static void gc_config(void)
git_config_get_int("gc.auto", _auto_threshold);
git_config_get_int("gc.autopacklimit", _auto_pack_limit);
git_config_get_bool("gc.autodetach", _auto);
+
+   if (!git_config_get_value("gc.logexpiry", )) {
+   parse_expiry_date(value, _log_expire_time);
+   }
+
git_config_date_string("gc.pruneexpire", _expire);
git_config_date_string("gc.worktreepruneexpire", 
_worktrees_expire);
git_config(git_default_config, NULL);
@@ -290,19 +298,34 @@ static const char *lock_repo_for_gc(int force, pid_t* 
ret_pid)
 static int report_last_gc_error(void)
 {
struct strbuf sb = STRBUF_INIT;
-   int ret;
+   int ret = 0;
+   struct stat st;
+   char *gc_log_path = git_pathdup("gc.log");
 
-   ret = strbuf_read_file(, git_path("gc.log"), 0);
+   if (stat(gc_log_path, )) {
+   if (errno == ENOENT)
+   goto done;
+
+   ret = error(_("Can't read %s"), gc_log_path);
+   goto done;
+   }
+
+   if (st.st_mtime < gc_log_expire_time)
+   goto done;
+
+   ret = strbuf_read_file(, gc_log_path, 0);
if (ret > 0)
-   return error(_("The last gc run reported the following. "
+   ret = error(_("The last gc run reported the following. "
   "Please correct the root cause\n"
   "and remove %s.\n"
   "Automatic cleanup will not be performed "
   "until the file is removed.\n\n"
   "%s"),
-git_path("gc.log"), sb.buf);
+   gc_log_path, sb.buf);
strbuf_release();
-   return 0;
+done:
+   free(gc_log_path);
+   return ret;
 }
 
 static int gc_before_repack(void)
@@ -349,6 +372,8 @@ int cmd_gc(int argc, const char **argv, const char *prefi

Re: "disabling bitmap writing, as some objects are not being packed"?

2017-02-08 Thread David Turner

On Wed, 2017-02-08 at 09:44 -0800, Junio C Hamano wrote:
> Duy Nguyen  writes:
> 
> > On second thought, perhaps gc.autoDetach should default to false if
> > there's no tty, since its main point it to stop breaking interactive
> > usage. That would make the server side happy (no tty there).
> 
> Sounds like an idea, but wouldn't that keep the end-user coming over
> the network waiting after accepting a push until the GC completes, I
> wonder.  If an impatient user disconnects, would that end up killing
> an ongoing GC?  etc.

Regardless, it's impolite to keep the user waiting. So, I think we
should just not write the "too many unreachable loose objects" message
if auto-gc is on.  Does that sound OK?

[PATCH] gc: ignore old gc.log files

2017-02-08 Thread David Turner

The intent of automatic gc is to have a git repository be relatively
low-maintenance from a server-operator perspective.  Of course, large
operators like GitHub will need a more complicated management strategy,
but for ordinary usage, git should just work.

In this commit, git learns to ignore gc.log files which are older than
(by default) one day old.  It also learns about a config, gc.maxLogAge
to manage this.

So git should never get itself into a state where it refuses to do any
maintenance, just because at some point some piece of the maintenance
didn't make progress.  That might still happen (e.g. because the repo
is corrupt), but at the very least it won't be because Git is too dumb
to try again.

Signed-off-by: David Turner <dtur...@twosigma.com>
Helped-by: Jeff King <p...@peff.net>
---
 Documentation/config.txt |  5 +
 builtin/gc.c | 15 ++-
 t/t6500-gc.sh| 13 +
 3 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index fc5a28a32..6751371cf 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1402,6 +1402,11 @@ gc.autoDetach::
Make `git gc --auto` return immediately and run in background
if the system supports it. Default is true.
 
+gc.maxLogAge::
+   If the file gc.log exists, then `git gc --auto` won't run
+   unless that file is more than maxLogAge seconds old.  Default
+   is 86400, one day.
+
 gc.packRefs::
Running `git pack-refs` in a repository renders it
unclonable by Git versions prior to 1.5.1.2 over dumb
diff --git a/builtin/gc.c b/builtin/gc.c
index 331f21926..62fc84815 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -33,6 +33,7 @@ static int aggressive_window = 250;
 static int gc_auto_threshold = 6700;
 static int gc_auto_pack_limit = 50;
 static int detach_auto = 1;
+static int gc_max_log_age_seconds = 86400;
 static const char *prune_expire = "2.weeks.ago";
 static const char *prune_worktrees_expire = "3.months.ago";
 
@@ -111,6 +112,7 @@ static void gc_config(void)
git_config_get_int("gc.auto", _auto_threshold);
git_config_get_int("gc.autopacklimit", _auto_pack_limit);
git_config_get_bool("gc.autodetach", _auto);
+   git_config_get_int("gc.maxlogage", _max_log_age_seconds);
git_config_date_string("gc.pruneexpire", _expire);
git_config_date_string("gc.worktreepruneexpire", 
_worktrees_expire);
git_config(git_default_config, NULL);
@@ -291,8 +293,19 @@ static int report_last_gc_error(void)
 {
struct strbuf sb = STRBUF_INIT;
int ret;
+   struct stat st;
+   const char *gc_log_path = git_path("gc.log");
+
+   if (stat(gc_log_path, )) {
+   if (errno == ENOENT)
+   return 0;
+   return error(_("Can't read %s"), gc_log_path);
+   }
+
+   if (time(NULL) - st.st_mtime > gc_max_log_age_seconds)
+   return 0;
 
-   ret = strbuf_read_file(, git_path("gc.log"), 0);
+   ret = strbuf_read_file(, gc_log_path, 0);
if (ret > 0)
return error(_("The last gc run reported the following. "
   "Please correct the root cause\n"
diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
index 1762dfa6a..b69c2c190 100755
--- a/t/t6500-gc.sh
+++ b/t/t6500-gc.sh
@@ -67,5 +67,18 @@ test_expect_success 'auto gc with too many loose objects 
does not attempt to cre
test_line_count = 2 new # There is one new pack and its .idx
 '
 
+test_expect_success 'background auto gc does not run if gc.log is present and 
recent but does if it is old' '
+   keep=$(ls .git/objects/pack/*.pack|head -1|sed -e "s/pack$/keep/") &&
+   test_commit foo &&
+   test_commit bar &&
+   git repack &&
+   test_config gc.autopacklimit 1 &&
+   test_config gc.autodetach true &&
+   echo fleem> .git/gc.log &&
+   test_must_fail git gc --auto 2>err &&
+   test_i18ngrep "^error:" err &&
+   test-chmtime =-86401 .git/gc.log &&
+   git gc --auto
+'
 
 test_done
-- 
2.11.GIT

Re: "disabling bitmap writing, as some objects are not being packed"?

2017-02-08 Thread David Turner

On Wed, 2017-02-08 at 14:08 -0500, Jeff King wrote:
> On Wed, Feb 08, 2017 at 02:05:42PM -0500, David Turner wrote:
> 
> > On Wed, 2017-02-08 at 09:44 -0800, Junio C Hamano wrote:
> > > Duy Nguyen <pclo...@gmail.com> writes:
> > > 
> > > > On second thought, perhaps gc.autoDetach should default to false if
> > > > there's no tty, since its main point it to stop breaking interactive
> > > > usage. That would make the server side happy (no tty there).
> > > 
> > > Sounds like an idea, but wouldn't that keep the end-user coming over
> > > the network waiting after accepting a push until the GC completes, I
> > > wonder.  If an impatient user disconnects, would that end up killing
> > > an ongoing GC?  etc.
> > 
> > Regardless, it's impolite to keep the user waiting. So, I think we
> > should just not write the "too many unreachable loose objects" message
> > if auto-gc is on.  Does that sound OK?
> 
> I thought the point of that message was to prevent auto-gc from kicking
> in over and over again due to objects that won't actually get pruned.
> 
> I wonder if you'd want to either bump the auto-gc object limit, or
> possibly reduce the gc.pruneExpire limit to keep this situation from
> coming up in the first place (or at least mitigating the amount of time
> it's the case).

Auto-gc might not succeed in pruning objects, but it will at least
reduce the number of packs, which is super-important for performance.

I think the intent of automatic gc is to have a git repository be
relatively low-maintenance from a server-operator perspective.  (Side
note: it's fairly trivial for a user with push access to mess with the
check simply by pushing a bunch of objects whose shas start with 17).
It seems odd that git gets itself into a state where it refuses to do
any maintenance just because at some point some piece of the maintenance
didn't make progress.

Sure, I could change my configuration, but that doesn't help the other
folks (e.g. https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=813084 )
who run into this.

I have three thoughts on this:

Idea 1: when gc --auto would issue this message, instead it could create
a file named gc.too-much-garbage (instead of gc.log), with this message.
If that file exists, and it is less than one day (?) old, then we don't
attempt to do a full gc; instead we just run git repack -A -d.  (If it's
more than one day old, we just delete it and continue anyway).

Idea 2 : Like idea 1, but instead of repacking, just smash the existing
packs together into one big pack.  In other words, don't consider
dangling objects, or recompute deltas.  Twitter has a tool called "git
combine-pack" that does this:
https://github.com/dturner-tw/git/blob/dturner/journal/builtin/combine-pack.c

That's less space-efficient than a true repack, but it's no worse than
having the packs separate, and it's a win for read performance because
there's no need to do a linear search over N packs to find an object.

Idea 3: As I suggested last time, separate fatal and non-fatal errors.
If gc fails because of EIO or something, we probably don't want to touch
the disk anymore. But here, the worst consequence is that we waste some
processing power. And it's better to occasionally waste processing power
in a non-interactive setting than it is to do so when a user will be
blocked on it.  So non-fatal warnings should go to gc.log, and fatal
errors should go to gc.fatal.  gc.log won't block gc from running. I
think this is my preferred option.

Re: "disabling bitmap writing, as some objects are not being packed"?

2017-02-08 Thread David Turner

On Wed, 2017-02-08 at 13:45 +0700, Duy Nguyen wrote:
> On Wed, Feb 8, 2017 at 8:03 AM, David Turner <nova...@novalis.org> wrote:
> > On Sat, 2016-12-17 at 14:50 +0700, Duy Nguyen wrote:
> >> And we can't grep for fatal errors anyway. The problem that led to
> >> 329e6e8794 was this line
> >>
> >> warning: There are too many unreachable loose objects; run 'git
> >> prune' to remove them.
> >>
> >> which is not fatal.
> >
> > So, speaking of that message, I noticed that our git servers were
> > getting slow again and found that message in gc.log.
> >
> > I propose to make auto gc not write that message either. Any objections?
> 
> Does that really help? auto gc would run more often, but unreachable
> loose objects are still present and potentially make your servers
> slow? Should these servers run periodic and explicit gc/prune?

At least pack files wouldn't accumulate.  This is the major cause of
slowdown, since each pack file must be checked for each object.

(And, also, maybe those unreachable loose objects are too new to get
gc'd, but if we retry next week, we'll gc them).

Re: "disabling bitmap writing, as some objects are not being packed"?

2017-02-07 Thread David Turner

On Sat, 2016-12-17 at 14:50 +0700, Duy Nguyen wrote:
> And we can't grep for fatal errors anyway. The problem that led to
> 329e6e8794 was this line
> 
> warning: There are too many unreachable loose objects; run 'git
> prune' to remove them.
> 
> which is not fatal.

So, speaking of that message, I noticed that our git servers were
getting slow again and found that message in gc.log.

I propose to make auto gc not write that message either. Any objections?

Re: git cat-file on a submodule

2017-01-11 Thread David Turner

On Wed, 2017-01-11 at 07:53 -0500, Jeff King wrote:
> On Tue, Jan 10, 2017 at 07:11:40PM -0500, David Turner wrote:
> 
> > Why does git cat-file -t $sha:foo, where foo is a submodule, not work?
> 
> Because "cat-file" is about inspecting items in the object database, and
> typically the submodule commit is not present in the superproject's
> database. So we cannot know its type. You can infer what it _should_ be
> from the surrounding tree, but you cannot actually do the object lookup.
> 
> Likewise, "git cat-file -t $sha1:Makefile" is not just telling you that
> we found a 100644 entry in the tree, so we expect a blob. It's resolving
> to a sha1, and then checking the type of that sha1 in the database. It
> _should_ be a blob, but if it isn't, then cat-file is the tool that
> should tell you that it is not.
> 
> > git rev-parse $sha:foo works.
> 
> Right. Because that command is about resolving a name to a sha1, which
> we can do even without the object.
> 
> > By "why", I mean "would anyone complain if I fixed it?"  FWIW, I think
> > -p should just return the submodule's sha.
> 
> I'm not sure if I'm complaining or not. I can't immediately think of
> something that would be horribly broken. But it really feels like you
> are using the wrong tool, and patching the tool to handle this case will
> probably lead to weird cognitive dissonance down the road.

OK, this makes sense to me.  I tried cat-file because that is the tool I
was familiar with, but that doesn't mean that it was the right thing to
do.

> Maybe it would help to describe your use case more fully. If what you
> care about is the presumed type based on the surrounding tree, then
> maybe:
> 
>   git --literal-pathspecs ls-tree $sha -- foo

That (minus --literal-pathspecs, which does not exist in the version of
git I happen to be using) is fine for my purposes.  Thanks.

git cat-file on a submodule

2017-01-10 Thread David Turner

Why does git cat-file -t $sha:foo, where foo is a submodule, not work? 

git rev-parse $sha:foo works.  

By "why", I mean "would anyone complain if I fixed it?"  FWIW, I think
-p should just return the submodule's sha.

Re: [PATCH] submodule update --init: displays correct path from submodule

2017-01-06 Thread David Turner

(for the test)
Signed-off-by: David Turner <dtur...@twosigma.com>

TIL: super-prefix!


On Fri, 2017-01-06 at 16:19 -0800, Stefan Beller wrote:
> In the submodule helper we did not correctly handled the display path
> for initializing submodules when both the submodule is inside a
> subdirectory as well as the command being invoked from a subdirectory
> (as viewed from the superproject).
> 
> This was broken in 3604242f080, which was written at a time where
> there was no super-prefix available, so we abused the --prefix option
> for the same purpose and could get only one case right (the call from
> within a subdirectory, not the submodule being in a subdirectory).
> 
> Test-provided-by: David Turner <nova...@novalis.org>
> Signed-off-by: Stefan Beller <sbel...@google.com>
> ---
> 
>   applies on sb/submodule-embed-gitdir as that contains 89c862655
>   (submodule helper: support super prefix)
> 
>  builtin/submodule--helper.c | 13 +++--
>  git-submodule.sh|  2 +-
>  t/t7406-submodule-update.sh | 17 +
>  3 files changed, 25 insertions(+), 7 deletions(-)
> 
> diff --git a/builtin/submodule--helper.c b/builtin/submodule--helper.c
> index 242d9911a6..7b3f9fc293 100644
> --- a/builtin/submodule--helper.c
> +++ b/builtin/submodule--helper.c
> @@ -305,32 +305,36 @@ static int module_list(int argc, const char **argv, 
> const char *prefix)
>  
>   utf8_fprintf(stdout, "%s\n", ce->name);
>   }
>   return 0;
>  }
>  
>  static void init_submodule(const char *path, const char *prefix, int quiet)
>  {
>   const struct submodule *sub;
>   struct strbuf sb = STRBUF_INIT;
>   char *upd = NULL, *url = NULL, *displaypath;
>  
>   /* Only loads from .gitmodules, no overlay with .git/config */
>   gitmodules_config();
>  
> - if (prefix) {
> - strbuf_addf(, "%s%s", prefix, path);
> + if (prefix && get_super_prefix())
> + die("BUG: cannot have prefix and superprefix");
> + else if (prefix)
> + displaypath = xstrdup(relative_path(path, prefix, ));
> + else if (get_super_prefix()) {
> + strbuf_addf(, "%s%s", get_super_prefix(), path);
>   displaypath = strbuf_detach(, NULL);
>   } else
>   displaypath = xstrdup(path);
>  
>   sub = submodule_from_path(null_sha1, path);
>  
>   if (!sub)
>   die(_("No url found for submodule path '%s' in .gitmodules"),
>   displaypath);
>  
>   /*
>* Copy url setting when it is not set yet.
>* To look up the url in .git/config, we must not fall back to
>* .gitmodules, so look it up directly.
>*/
> @@ -391,33 +395,30 @@ static void init_submodule(const char *path, const char 
> *prefix, int quiet)
>   }
>   strbuf_release();
>   free(displaypath);
>   free(url);
>   free(upd);
>  }
>  
>  static int module_init(int argc, const char **argv, const char *prefix)
>  {
>   struct pathspec pathspec;
>   struct module_list list = MODULE_LIST_INIT;
>   int quiet = 0;
>   int i;
>  
>   struct option module_init_options[] = {
> - OPT_STRING(0, "prefix", ,
> -N_("path"),
> -N_("alternative anchor for relative paths")),
>   OPT__QUIET(, N_("Suppress output for initializing a 
> submodule")),
>   OPT_END()
>   };
>  
>   const char *const git_submodule_helper_usage[] = {
>   N_("git submodule--helper init []"),
>   NULL
>   };
>  
>   argc = parse_options(argc, argv, prefix, module_init_options,
>git_submodule_helper_usage, 0);
>  
>   if (module_list_compute(argc, argv, prefix, , ) < 0)
>   return 1;
>  
> @@ -1117,31 +1118,31 @@ static int absorb_git_dirs(int argc, const char 
> **argv, const char *prefix)
>  
>  struct cmd_struct {
>   const char *cmd;
>   int (*fn)(int, const char **, const char *);
>   unsigned option;
>  };
>  
>  static struct cmd_struct commands[] = {
>   {"list", module_list, 0},
>   {"name", module_name, 0},
>   {"clone", module_clone, 0},
>   {"update-clone", update_clone, 0},
>   {"relative-path", resolve_relative_path, 0},
>   {"resolve-relative-url", resolve_relative_url, 0},
>   {"resolve-relative-url-test", resolve_relative_url_test, 0},
>

minor bug: git submodule update --init from a subdir shows wrong path

2017-01-06 Thread David Turner

I believe (from bisection) that this was introduced in the transition to
C at 3604242f080.

I've attached a repro (against master).  At the time the bug was
introduced, the output in question went to stdout instead of stderr, so
the attached test case will not quite work on the older version.  But if
you run under -v, you'll be able to see the bad ("foo/foo/sub" instead
of "foo/sub" or just "sub") output.


diff --git a/t/t7406-submodule-update.sh b/t/t7406-submodule-update.sh
index 64f322c..e1deb17 100755
--- a/t/t7406-submodule-update.sh
+++ b/t/t7406-submodule-update.sh
@@ -140,6 +140,23 @@ test_expect_success 'submodule update --init --recursive from subdirectory' '
 	test_i18ncmp expect2 actual2
 '
 
+cat ../../actual2
+	 )
+	) &&
+	test_i18ncmp expect2 actual2
+'
+
 apos="'";
 test_expect_success 'submodule update does not fetch already present commits' '
 	(cd submodule &&

[PATCH v5 2/2] repack: die on incremental + write-bitmap-index

2016-12-28 Thread David Turner

The bitmap index only works for single packs, so requesting an
incremental repack with bitmap indexes makes no sense.

Signed-off-by: David Turner <dtur...@twosigma.com>
---
 builtin/repack.c| 9 +
 t/t5310-pack-bitmaps.sh | 8 +++-
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 80dd06b..677bc7c 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -18,6 +18,12 @@ static const char *const git_repack_usage[] = {
NULL
 };
 
+static const char incremental_bitmap_conflict_error[] = N_(
+"Incremental repacks are incompatible with bitmap indexes.  Use\n"
+"--no-write-bitmap-index or disable the pack.writebitmaps configuration."
+);
+
+
 static int repack_config(const char *var, const char *value, void *cb)
 {
if (!strcmp(var, "repack.usedeltabaseoffset")) {
@@ -206,6 +212,9 @@ int cmd_repack(int argc, const char **argv, const char 
*prefix)
if (pack_kept_objects < 0)
pack_kept_objects = write_bitmaps;
 
+   if (write_bitmaps && !(pack_everything & ALL_INTO_ONE))
+   die(_(incremental_bitmap_conflict_error));
+
packdir = mkpathdup("%s/pack", get_object_directory());
packtmp = mkpathdup("%s/.tmp-%d-pack", packdir, (int)getpid());
 
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index b4c7a6f..424bec7 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -118,12 +118,10 @@ test_expect_success 'fetch (partial bitmap)' '
test_cmp expect actual
 '
 
-test_expect_success 'incremental repack cannot create bitmaps' '
+test_expect_success 'incremental repack fails when bitmaps are requested' '
test_commit more-1 &&
-   find .git/objects/pack -name "*.bitmap" >expect &&
-   git repack -d &&
-   find .git/objects/pack -name "*.bitmap" >actual &&
-   test_cmp expect actual
+   test_must_fail git repack -d 2>err &&
+   test_i18ngrep "Incremental repacks are incompatible with bitmap" err
 '
 
 test_expect_success 'incremental repack can disable bitmaps' '
-- 
2.8.0.rc4.22.g8ae061a

[PATCH v5 1/2] auto gc: don't write bitmaps for incremental repacks

2016-12-28 Thread David Turner

When git gc --auto does an incremental repack of loose objects, we do
not expect to be able to write a bitmap; it is very likely that
objects in the new pack will have references to objects outside of the
pack.  So we shouldn't try to write a bitmap, because doing so will
likely issue a warning.

This warning was making its way into gc.log.  When the gc.log was
present, future auto gc runs would refuse to run.

Patch by Jeff King.
Bug report, test, and commit message by David Turner.

Signed-off-by: David Turner <dtur...@twosigma.com>
Signed-off-by: Jeff King <p...@peff.net>
---
 builtin/gc.c  |  9 -
 t/t6500-gc.sh | 25 +
 2 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index 069950d..331f219 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -191,6 +191,11 @@ static void add_repack_all_option(void)
}
 }
 
+static void add_repack_incremental_option(void)
+{
+   argv_array_push(, "--no-write-bitmap-index");
+}
+
 static int need_to_gc(void)
 {
/*
@@ -208,7 +213,9 @@ static int need_to_gc(void)
 */
if (too_many_packs())
add_repack_all_option();
-   else if (!too_many_loose_objects())
+   else if (too_many_loose_objects())
+   add_repack_incremental_option();
+   else
return 0;
 
if (run_hook_le(NULL, "pre-auto-gc", NULL))
diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
index 5d7d414..1762dfa 100755
--- a/t/t6500-gc.sh
+++ b/t/t6500-gc.sh
@@ -43,4 +43,29 @@ test_expect_success 'gc is not aborted due to a stale 
symref' '
)
 '
 
+test_expect_success 'auto gc with too many loose objects does not attempt to 
create bitmaps' '
+   test_config gc.auto 3 &&
+   test_config gc.autodetach false &&
+   test_config pack.writebitmaps true &&
+   # We need to create two object whose sha1s start with 17
+   # since this is what git gc counts.  As it happens, these
+   # two blobs will do so.
+   test_commit 263 &&
+   test_commit 410 &&
+   # Our first gc will create a pack; our second will create a second pack
+   git gc --auto &&
+   ls .git/objects/pack | sort >existing_packs &&
+   test_commit 523 &&
+   test_commit 790 &&
+
+   git gc --auto 2>err &&
+   test_i18ngrep ! "^warning:" err &&
+   ls .git/objects/pack/ | sort >post_packs &&
+   comm -1 -3 existing_packs post_packs >new &&
+   comm -2 -3 existing_packs post_packs >del &&
+   test_line_count = 0 del && # No packs are deleted
+   test_line_count = 2 new # There is one new pack and its .idx
+'
+
+
 test_done
-- 
2.8.0.rc4.22.g8ae061a

[PATCH v5 0/2] repack (oops)

2016-12-28 Thread David Turner

This version addresses Johannes Sixt's comments on v4.  Also, I
messed up the rebase on v4.

David Turner (2):
  auto gc: don't write bitmaps for incremental repacks
  repack: die on incremental + write-bitmap-index

 builtin/gc.c|  9 -
 builtin/repack.c|  9 +
 t/t5310-pack-bitmaps.sh |  8 +++-
 t/t6500-gc.sh   | 25 +
 4 files changed, 45 insertions(+), 6 deletions(-)

-- 
2.8.0.rc4.22.g8ae061a

[PATCH v4 1/2] auto gc: don't write bitmaps for incremental repacks

2016-12-28 Thread David Turner

When git gc --auto does an incremental repack of loose objects, we do
not expect to be able to write a bitmap; it is very likely that
objects in the new pack will have references to objects outside of the
pack.  So we shouldn't try to write a bitmap, because doing so will
likely issue a warning.

This warning was making its way into gc.log.  When the gc.log was
present, future auto gc runs would refuse to run.

Patch by Jeff King.
Bug report, test, and commit message by David Turner.

Signed-off-by: David Turner <dtur...@twosigma.com>
Signed-off-by: Jeff King <p...@peff.net>
---
 builtin/gc.c  |  9 -
 t/t6500-gc.sh | 25 +
 2 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index 069950d..331f219 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -191,6 +191,11 @@ static void add_repack_all_option(void)
}
 }
 
+static void add_repack_incremental_option(void)
+{
+   argv_array_push(, "--no-write-bitmap-index");
+}
+
 static int need_to_gc(void)
 {
/*
@@ -208,7 +213,9 @@ static int need_to_gc(void)
 */
if (too_many_packs())
add_repack_all_option();
-   else if (!too_many_loose_objects())
+   else if (too_many_loose_objects())
+   add_repack_incremental_option();
+   else
return 0;
 
if (run_hook_le(NULL, "pre-auto-gc", NULL))
diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
index 5d7d414..def2aca 100755
--- a/t/t6500-gc.sh
+++ b/t/t6500-gc.sh
@@ -43,4 +43,29 @@ test_expect_success 'gc is not aborted due to a stale 
symref' '
)
 '
 
+test_expect_success 'auto gc with too many loose objects does not attempt to 
create bitmaps' '
+   test_config gc.auto 3 &&
+   test_config gc.autodetach false &&
+   test_config pack.writebitmaps true &&
+   # We need to create two object whose sha1s start with 17
+   # since this is what git gc counts.  As it happens, these
+   # two blobs will do so.
+   test_commit 263 &&
+   test_commit 410 &&
+   # Our first gc will create a pack; our second will create a second pack
+   git gc --auto &&
+   ls .git/objects/pack | grep -v bitmap | sort >existing_packs &&
+   test_commit 523 &&
+   test_commit 790 &&
+
+   git gc --auto 2>err &&
+   test_i18ngrep ! "^warning:" err &&
+   ls .git/objects/pack/ | grep -v bitmap | sort >post_packs &&
+   comm --output-delimiter , -1 -3 existing_packs post_packs >new &&
+   comm --output-delimiter , -2 -3 existing_packs post_packs >del &&
+   test_line_count = 0 del && # No packs are deleted
+   test_line_count = 2 new # There is one new pack and its .idx
+'
+
+
 test_done
-- 
2.8.0.rc4.22.g8ae061a

[PATCH v4 2/2] repack: die on incremental + write-bitmap-index

2016-12-28 Thread David Turner

The bitmap index only works for single packs, so requesting an
incremental repack with bitmap indexes makes no sense.

Signed-off-by: David Turner <dtur...@twosigma.com>
---
 builtin/repack.c| 9 +
 t/t5310-pack-bitmaps.sh | 8 +++-
 t/t6500-gc.sh   | 8 
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 80dd06b..9c3dd09 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -18,6 +18,12 @@ static const char *const git_repack_usage[] = {
NULL
 };
 
+static const char incremental_bitmap_conflict_error[] = N_(
+"Incremental repacks are incompatible with bitmap indexes.  Use \n"
+"--no-write-bitmap-index or disable the pack.writebitmaps configuration."
+);
+
+
 static int repack_config(const char *var, const char *value, void *cb)
 {
if (!strcmp(var, "repack.usedeltabaseoffset")) {
@@ -206,6 +212,9 @@ int cmd_repack(int argc, const char **argv, const char 
*prefix)
if (pack_kept_objects < 0)
pack_kept_objects = write_bitmaps;
 
+   if (write_bitmaps && !(pack_everything & ALL_INTO_ONE))
+   die(incremental_bitmap_conflict_error);
+
packdir = mkpathdup("%s/pack", get_object_directory());
packtmp = mkpathdup("%s/.tmp-%d-pack", packdir, (int)getpid());
 
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index b4c7a6f..424bec7 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -118,12 +118,10 @@ test_expect_success 'fetch (partial bitmap)' '
test_cmp expect actual
 '
 
-test_expect_success 'incremental repack cannot create bitmaps' '
+test_expect_success 'incremental repack fails when bitmaps are requested' '
test_commit more-1 &&
-   find .git/objects/pack -name "*.bitmap" >expect &&
-   git repack -d &&
-   find .git/objects/pack -name "*.bitmap" >actual &&
-   test_cmp expect actual
+   test_must_fail git repack -d 2>err &&
+   test_i18ngrep "Incremental repacks are incompatible with bitmap" err
 '
 
 test_expect_success 'incremental repack can disable bitmaps' '
diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
index def2aca..1762dfa 100755
--- a/t/t6500-gc.sh
+++ b/t/t6500-gc.sh
@@ -54,15 +54,15 @@ test_expect_success 'auto gc with too many loose objects 
does not attempt to cre
test_commit 410 &&
# Our first gc will create a pack; our second will create a second pack
git gc --auto &&
-   ls .git/objects/pack | grep -v bitmap | sort >existing_packs &&
+   ls .git/objects/pack | sort >existing_packs &&
test_commit 523 &&
test_commit 790 &&
 
git gc --auto 2>err &&
test_i18ngrep ! "^warning:" err &&
-   ls .git/objects/pack/ | grep -v bitmap | sort >post_packs &&
-   comm --output-delimiter , -1 -3 existing_packs post_packs >new &&
-   comm --output-delimiter , -2 -3 existing_packs post_packs >del &&
+   ls .git/objects/pack/ | sort >post_packs &&
+   comm -1 -3 existing_packs post_packs >new &&
+   comm -2 -3 existing_packs post_packs >del &&
test_line_count = 0 del && # No packs are deleted
test_line_count = 2 new # There is one new pack and its .idx
 '
-- 
2.8.0.rc4.22.g8ae061a

[PATCH v3 0/2] pack bitmaps and incremental packing

2016-12-23 Thread David Turner

Cleaned up the first patch's test code.
Decided to remove the unnecessary checks from the second patch.

David Turner (2):
  auto gc: don't write bitmaps for incremental repacks
  repack: die on incremental + write-bitmap-index

 builtin/gc.c|  9 -
 builtin/repack.c|  9 +
 t/t5310-pack-bitmaps.sh |  5 +++--
 t/t6500-gc.sh   | 25 +
 4 files changed, 45 insertions(+), 3 deletions(-)

-- 
2.8.0.rc4.22.g8ae061a

[PATCH v3 2/2] repack: die on incremental + write-bitmap-index

2016-12-23 Thread David Turner

The bitmap index only works for single packs, so requesting an
incremental repack with bitmap indexes makes no sense.

Signed-off-by: David Turner <dtur...@twosigma.com>
---
 builtin/repack.c| 9 +
 t/t5310-pack-bitmaps.sh | 8 +++-
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 80dd06b..9c3dd09 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -18,6 +18,12 @@ static const char *const git_repack_usage[] = {
NULL
 };
 
+static const char incremental_bitmap_conflict_error[] = N_(
+"Incremental repacks are incompatible with bitmap indexes.  Use \n"
+"--no-write-bitmap-index or disable the pack.writebitmaps configuration."
+);
+
+
 static int repack_config(const char *var, const char *value, void *cb)
 {
if (!strcmp(var, "repack.usedeltabaseoffset")) {
@@ -206,6 +212,9 @@ int cmd_repack(int argc, const char **argv, const char 
*prefix)
if (pack_kept_objects < 0)
pack_kept_objects = write_bitmaps;
 
+   if (write_bitmaps && !(pack_everything & ALL_INTO_ONE))
+   die(incremental_bitmap_conflict_error);
+
packdir = mkpathdup("%s/pack", get_object_directory());
packtmp = mkpathdup("%s/.tmp-%d-pack", packdir, (int)getpid());
 
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index b4c7a6f..424bec7 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -118,12 +118,10 @@ test_expect_success 'fetch (partial bitmap)' '
test_cmp expect actual
 '
 
-test_expect_success 'incremental repack cannot create bitmaps' '
+test_expect_success 'incremental repack fails when bitmaps are requested' '
test_commit more-1 &&
-   find .git/objects/pack -name "*.bitmap" >expect &&
-   git repack -d &&
-   find .git/objects/pack -name "*.bitmap" >actual &&
-   test_cmp expect actual
+   test_must_fail git repack -d 2>err &&
+   test_i18ngrep "Incremental repacks are incompatible with bitmap" err
 '
 
 test_expect_success 'incremental repack can disable bitmaps' '
-- 
2.8.0.rc4.22.g8ae061a

[PATCH v3 1/2] auto gc: don't write bitmaps for incremental repacks

2016-12-23 Thread David Turner

When git gc --auto does an incremental repack of loose objects, we do
not expect to be able to write a bitmap; it is very likely that
objects in the new pack will have references to objects outside of the
pack.  So we shouldn't try to write a bitmap, because doing so will
likely issue a warning.

This warning was making its way into gc.log.  When the gc.log was
present, future auto gc runs would refuse to run.

Patch by Jeff King.
Bug report, test, and commit message by David Turner.

Signed-off-by: David Turner <dtur...@twosigma.com>
Signed-off-by: Jeff King <p...@peff.net>
---
 builtin/gc.c  |  9 -
 t/t6500-gc.sh | 25 +
 2 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index 069950d..331f219 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -191,6 +191,11 @@ static void add_repack_all_option(void)
}
 }
 
+static void add_repack_incremental_option(void)
+{
+   argv_array_push(, "--no-write-bitmap-index");
+}
+
 static int need_to_gc(void)
 {
/*
@@ -208,7 +213,9 @@ static int need_to_gc(void)
 */
if (too_many_packs())
add_repack_all_option();
-   else if (!too_many_loose_objects())
+   else if (too_many_loose_objects())
+   add_repack_incremental_option();
+   else
return 0;
 
if (run_hook_le(NULL, "pre-auto-gc", NULL))
diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
index 5d7d414..def2aca 100755
--- a/t/t6500-gc.sh
+++ b/t/t6500-gc.sh
@@ -43,4 +43,29 @@ test_expect_success 'gc is not aborted due to a stale 
symref' '
)
 '
 
+test_expect_success 'auto gc with too many loose objects does not attempt to 
create bitmaps' '
+   test_config gc.auto 3 &&
+   test_config gc.autodetach false &&
+   test_config pack.writebitmaps true &&
+   # We need to create two object whose sha1s start with 17
+   # since this is what git gc counts.  As it happens, these
+   # two blobs will do so.
+   test_commit 263 &&
+   test_commit 410 &&
+   # Our first gc will create a pack; our second will create a second pack
+   git gc --auto &&
+   ls .git/objects/pack | grep -v bitmap | sort >existing_packs &&
+   test_commit 523 &&
+   test_commit 790 &&
+
+   git gc --auto 2>err &&
+   test_i18ngrep ! "^warning:" err &&
+   ls .git/objects/pack/ | grep -v bitmap | sort >post_packs &&
+   comm --output-delimiter , -1 -3 existing_packs post_packs >new &&
+   comm --output-delimiter , -2 -3 existing_packs post_packs >del &&
+   test_line_count = 0 del && # No packs are deleted
+   test_line_count = 2 new # There is one new pack and its .idx
+'
+
+
 test_done
-- 
2.8.0.rc4.22.g8ae061a

[PATCH v2 1/2] auto gc: don't write bitmaps for incremental repacks

2016-12-23 Thread David Turner

When git gc --auto does an incremental repack of loose objects, we do
not expect to be able to write a bitmap; it is very likely that
objects in the new pack will have references to objects outside of the
pack.  So we shouldn't try to write a bitmap, because doing so will
likely issue a warning.

This warning was making its way into gc.log.  When the gc.log was
present, future auto gc runs would refuse to run.

Patch by Jeff King.
Bug report, test, and commit message by David Turner.

Signed-off-by: David Turner <dtur...@twosigma.com>
Signed-off-by: Jeff King <p...@peff.net>
---
 builtin/gc.c  |  9 -
 t/t6500-gc.sh | 22 ++
 2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index 069950d..331f219 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -191,6 +191,11 @@ static void add_repack_all_option(void)
}
 }
 
+static void add_repack_incremental_option(void)
+{
+   argv_array_push(, "--no-write-bitmap-index");
+}
+
 static int need_to_gc(void)
 {
/*
@@ -208,7 +213,9 @@ static int need_to_gc(void)
 */
if (too_many_packs())
add_repack_all_option();
-   else if (!too_many_loose_objects())
+   else if (too_many_loose_objects())
+   add_repack_incremental_option();
+   else
return 0;
 
if (run_hook_le(NULL, "pre-auto-gc", NULL))
diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
index 5d7d414..b83a08c 100755
--- a/t/t6500-gc.sh
+++ b/t/t6500-gc.sh
@@ -43,4 +43,26 @@ test_expect_success 'gc is not aborted due to a stale 
symref' '
)
 '
 
+test_expect_success 'auto gc with too many loose objects does not attempt to 
create bitmaps' '
+   test_config gc.auto 3 &&
+   test_config gc.autodetach false &&
+   test_config pack.writebitmaps true &&
+   # We need to create two object whose sha1s start with 17
+   # since this is what git gc counts.  As it happens, these
+   # two blobs will do so.
+   test_commit 263 &&
+   test_commit 410 &&
+   # Our first gc will create a pack; our second will create a second pack
+   git gc --auto &&
+   ls .git/objects/pack |grep -v bitmap >existing_packs &&
+   test_commit 523 &&
+   test_commit 790 &&
+
+   git gc --auto 2>err &&
+   test_i18ngrep ! "^warning:" err &&
+   ls .git/objects/pack/ | grep -v bitmap >post_packs &&
+   ! test_cmp existing_packs post_packs
+'
+
+
 test_done
-- 
2.8.0.rc4.22.g8ae061a

[PATCH v2 2/2] repack: die on incremental + write-bitmap-index

2016-12-23 Thread David Turner

The bitmap index only works for single packs, so requesting an
incremental repack with bitmap indexes makes no sense.

Signed-off-by: David Turner <dtur...@twosigma.com>
---
 builtin/repack.c| 9 +
 t/t5310-pack-bitmaps.sh | 5 +++--
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 80dd06b..9c3dd09 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -18,6 +18,12 @@ static const char *const git_repack_usage[] = {
NULL
 };
 
+static const char incremental_bitmap_conflict_error[] = N_(
+"Incremental repacks are incompatible with bitmap indexes.  Use \n"
+"--no-write-bitmap-index or disable the pack.writebitmaps configuration."
+);
+
+
 static int repack_config(const char *var, const char *value, void *cb)
 {
if (!strcmp(var, "repack.usedeltabaseoffset")) {
@@ -206,6 +212,9 @@ int cmd_repack(int argc, const char **argv, const char 
*prefix)
if (pack_kept_objects < 0)
pack_kept_objects = write_bitmaps;
 
+   if (write_bitmaps && !(pack_everything & ALL_INTO_ONE))
+   die(incremental_bitmap_conflict_error);
+
packdir = mkpathdup("%s/pack", get_object_directory());
packtmp = mkpathdup("%s/.tmp-%d-pack", packdir, (int)getpid());
 
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index b4c7a6f..e9a2771 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -118,10 +118,11 @@ test_expect_success 'fetch (partial bitmap)' '
test_cmp expect actual
 '
 
-test_expect_success 'incremental repack cannot create bitmaps' '
+test_expect_success 'incremental repack fails when bitmaps are requested' '
test_commit more-1 &&
find .git/objects/pack -name "*.bitmap" >expect &&
-   git repack -d &&
+   test_must_fail git repack -d 2>err &&
+   test_i18ngrep "Incremental repacks are incompatible with bitmap" err &&
find .git/objects/pack -name "*.bitmap" >actual &&
test_cmp expect actual
 '
-- 
2.8.0.rc4.22.g8ae061a

RE: [PATCHv4 0/5] git-rm absorbs submodule git directory before deletion

2016-12-19 Thread David Turner

Other than Brandon's issue on 3/5 (which I noticed myself but wanted to 
double-check that my mailer wasn't playing tricks on me and so lost the race to 
report), this version looks good.

> -Original Message-
> From: Stefan Beller [mailto:sbel...@google.com]
> Sent: Monday, December 19, 2016 6:28 PM
> To: gits...@pobox.com
> Cc: git@vger.kernel.org; bmw...@google.com; sand...@crustytoothpaste.net;
> David Turner; Stefan Beller
> Subject: [PATCHv4 0/5] git-rm absorbs submodule git directory before
> deletion
> 
> v4:
> * reworded commit messages of the last 2 patches
> * introduced a new patch introducing {run,start,finish}_command_or_die
> * found an existing function in dir.h to use to remove a directory
>   which deals gracefully with the cornercases, that Junio pointed out.
> 
> v3:
> * removed the patch to enhance ok_to_remove_submodule to absorb the
> submodule
>   if needed
> * Removed all the error reporting from git-rm that was related to
> submodule
>   git directories not absorbed.
> * instead just absorb the git repositories or let the absorb function die
>   with an appropriate error message.
> 
> v2:
> * new base where to apply the patch:
>   sb/submodule-embed-gitdir merged with sb/t3600-cleanup.
>   I got merge conflicts and resolved them this way:
> #@@@ -709,9 -687,10 +687,9 @@@ test_expect_success 'checking out a com
> #  git commit -m "submodule removal" submod &&
> #  git checkout HEAD^ &&
> #  git submodule update &&
> #- git checkout -q HEAD^ 2>actual &&
> #+ git checkout -q HEAD^ &&
> #  git checkout -q master 2>actual &&
> # -echo "warning: unable to rmdir submod: Directory not empty"
> >expected &&
> # -test_i18ncmp expected actual &&
> # +test_i18ngrep "^warning: unable to rmdir submod:" actual &&
> #  git status -s submod >actual &&
> #  echo "?? submod/" >expected &&
> #  test_cmp expected actual &&
> #
> 
> * improved commit message in "ok_to_remove_submodule: absorb the submodule
> git dir"
>   (David Turner offered me some advice on how to write better English off
> list)
> * simplified code in last patch:
>   -> dropped wrong comment for fallthrough
>   -> moved redundant code out of both bodies of an if-clause.
> * Fixed last patchs commit message to have "or_die" instead of or_dir.
> 
> v1:
> The "checkout --recurse-submodules" series got too large to comfortably
> send it out for review, so I had to break it up into smaller series'; this
> is the first subseries, but it makes sense on its own.
> 
> This series teaches git-rm to absorb the git directory of a submodule
> instead of failing and complaining about the git directory preventing
> deletion.
> 
> It applies on origin/sb/submodule-embed-gitdir.
> 
> Any feedback welcome!
> 
> Thanks,
> Stefan
> 
> Stefan Beller (5):
>   submodule.h: add extern keyword to functions
>   submodule: modernize ok_to_remove_submodule to use argv_array
>   run-command: add {run,start,finish}_command_or_die
>   submodule: add flags to ok_to_remove_submodule
>   rm: absorb a submodules git dir before deletion
> 
>  builtin/rm.c  | 82 +++---
> -
>  run-command.c | 28   run-command.h |  4 +++
>  submodule.c   | 27 ++--
>  submodule.h   | 58 --
>  t/t3600-rm.sh | 39 +++-
>  6 files changed, 114 insertions(+), 124 deletions(-)
> 
> --
> 2.11.0.rc2.53.gb7b3fba.dirty

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1485 matches

Mail list logo