RE: cherry-pick very slow on big repository

2017-11-10 Thread Kevin Willford
Since this is happening during a merge, you might need to use merge.renameLimit
or the merge strategy option of -Xno-renames.  Although the code does fallback
to use the diff.renameLimit but there is still a lot that is done before even 
checking
the rename limit so I would first try getting renames turned off.

Thanks,
Kevin

> -Original Message-
> From: git-ow...@vger.kernel.org [mailto:git-ow...@vger.kernel.org] On Behalf
> Of Peter Krefting
> Sent: Friday, November 10, 2017 7:05 AM
> To: Derrick Stolee 
> Cc: Jeff King ; Git Mailing List 
> Subject: Re: cherry-pick very slow on big repository
> 
> Derrick Stolee:
> 
> > Git is spending time detecting renames, which implies you probably
> > renamed a folder or added and deleted a large number of files. This
> > rename detection is quadratic (# adds times # deletes).
> 
> Yes, a couple of directories with a lot of template files have been
> renamed (and some removed, some added) between the current development
> branch and this old maintenance branch. I get the "Performing inexact
> rename detection" a lot when merging changes in the other direction.
> 
> However, none of them applies to these particular commits, which only
> touches files that are in the exact same location on both branches.
> 
> > You can remove this rename detection by running your cherry-pick
> > with `git -c diff.renameLimit=1 cherry-pick ...`
> 
> That didn't work, actually it failed to finish with this setting in
> effect, it hangs in such a way that I can't stop it with Ctrl+C
> (neither when running from the command line, nor when running inside
> gdb). It didn't finish in the 20 minutes I gave it.
> 
> I also tried with diff.renames=false, which also seemed to fail.
> 
> --
> \\// Peter -
> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.softw
> olves.pp.se%2F=02%7C01%7Ckewillf%40microsoft.com%7C6b831a75739e4
> 0428d3808d52844106c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636
> 459195209466999=kJtNLAs1LSoPy%2B%2BNADJkuEBPMZVcxkSkKzOEEeIG
> VpM%3D=0


RE: [PATCH 1/1] reset: fix reset when using the sparse-checkout feature.

2017-09-15 Thread Kevin Willford
From: Junio C Hamano [mailto:gits...@pobox.com]
Sent: Thursday, September 14, 2017 11:00 PM
> 
> Kevin Willford <kewi...@microsoft.com> writes:
> 
> > 1. Does this statement, "I only care about the files in this
> > sparse checkout, and do not concern me with anything else", mean
> > that git should not change files outside the sparse-checkout whether
> > that be in the working directory or in the index?  Or does that only
> > apply to the working directory and the index version of files can
> > change to whatever git the git command would do without using
> > sparse?  For example if I am doing a 'git reset HEAD~1'  should the
> > version in the index of files outside the sparse not be changed or
> > change to the HEAD~1 version with the skip-worktree bit on?
> 
> My understanding of the purpose of "sparse checkout" thing is that
> the user still wants to create correct whole-tree commit even the
> user does not have the whole-tree checkout.  The object names for
> blobs recorded in the index that are meant to be included in the
> next commit MUST be the same as those that would be in the index
> when the "sparse" feature is not in use.  "reset HEAD~1" should
> match the index entries to the tree entries in HEAD~1.  So, the
> latter, I would think, among your two alternatives.
> 
> IOW, after "git reset HEAD~", if you drop the skip-worktree bit from
> all index entries, "git diff --cached HEAD" must say "there is no
> changes".
> 
> The only difference between the "sparse" and normal would be that,
> because the "sparse" user does not intend to change anything outside
> the "sparse" area, these paths outside her "sparse" area would not
> materialize on the filesystem.  For the next "write-tree" out of the
> index to still write the correct tree out, the entries outside her
> "sparse" area in the index MUST match the tree of the commit she
> started working from.
> 

Makes sense.  And even though the reset might only change entries
outside the sparse area and the next status will report "nothing to commit,
working tree clean",  that's okay because the user hasn't changed
anything in their sparse area and intended to roll back the index to
whatever they specified in their reset command.

> > 2. How will this work with other git commands like merge, rebase,
> > cherry-pick, etc.?
> > 3. What about when there is a merge conflict with a file that is outside
> > the sparse checkout?
> 
> I would say, rather than forbidding such a merge, it should let her
> see and deal with the conflict by dropping the "this is outside the
> sparse area, so do not bother materializing it to the filesystem"
> bit, but tell her loudly what it did ("I checked out a half-merged
> file outside your sparse-checkout area because you'd need it while
> resolving the conflict").  By doing things that way, the user can
> decide if she wants to go ahead and complete the merge, even if the
> conflict is outside the area she is currently interested in, or
> postpone the merge and continue working on what she has been working
> on inside the narrowed-down area first.
> 
> I do not have a strong opinion whether the sparse-checkout
> configuration file should be adjusted to match when the command must
> tentatively bust the sparse checkout area; I'd imagine it can be
> argued either way.
> 
> Note that "sparse" is not my itch, and I would not be surprised if
> those who designed it may want it to work differently from my
> knee-jerk reaction in the previous two paragraphs, and I may even
> find such an alternative solution preferable.
> 
> But it is highly unlikely for any sensible solution would violate
> the basic premise, i.e. "the indexed contents will stay the same as
> the case without any 'sparse', so the next write-tree will do the
> right thing".

There was one other case that I thought about while implementing
this approach and it is when the user creates a file that is outside their
sparse definition.  From your explanation above I will attempt to
explain how I think it should work and please correct me if you see
it working differently.

The user creates a file that is outside the sparse area and it will show
up as untracked.  No problem here since the untracked are outside the
scope of using sparse.  Next the user adds the untracked file to the index.
The skip-worktree bit should be off and stay off since the user could make
additional changes and want to add them.  Once the user commits the
newly created file, I could see turning on the skip-worktree bit and
removing the file from the working tree after the c

RE: [PATCH 1/1] reset: fix reset when using the sparse-checkout feature.

2017-09-14 Thread Kevin Willford
> From: Junio C Hamano [mailto:gits...@pobox.com]
> Sent: Wednesday, September 13, 2017 4:18 PM
> 
> Jacob Keller  writes:
> 
> > By definition if you do a sparse checkout, you're telling git "I only
> > care about the files in this sparse checkout, and do not concern me
> > with anything else"... So the proposed fix is "since git cleared the
> > skip-worktree flag, we should actually also copy the file out again."
> > but I think the real problem is that we're clearing skip worktree to
> > begin with?
> 
> As this 100% agree with what I've been saying, I do not have
> anything to add to the discussion at the moment, so I'll stop
> speaking now but will continue to monitor the thread so that I may
> hear a new argument and datapoint that would make me change my mind.
> 
> Thanks for a healthy discussion.

Hi Junio,

Thanks for the feedback.  I will give this implementation a try.  I still
have the following unanswered questions, if you wouldn't mind
answering.

1. Does this statement, "I only care about the files in this
sparse checkout, and do not concern me with anything else", mean
that git should not change files outside the sparse-checkout whether
that be in the working directory or in the index?  Or does that only
apply to the working directory and the index version of files can
change to whatever git the git command would do without using
sparse?  For example if I am doing a 'git reset HEAD~1'  should the
version in the index of files outside the sparse not be changed or
change to the HEAD~1 version with the skip-worktree bit on?
2. How will this work with other git commands like merge, rebase,
cherry-pick, etc.?  
3. What about when there is a merge conflict with a file that is outside
the sparse checkout?  Should there not be a conflict because
git shouldn't be trying to merge the file since it is outside the sparse
checkout area?  Should the conflicted file get written outside the
sparse checkout so the user can resolve it?

Thanks,
Kevin



RE: [PATCH 1/1] reset: fix reset when using the sparse-checkout feature.

2017-09-13 Thread Kevin Willford
> From: Jacob Keller [mailto:jacob.kel...@gmail.com]
> Sent: Tuesday, September 12, 2017 7:39 PM
> 
> On Tue, Sep 12, 2017 at 4:30 PM, Kevin Willford <kewi...@microsoft.com> wrote:
> >
> > Sorry.  It was not in the sparse-checkout file.
> > $ git config core.sparsecheckout true
> > $ echo /i > .git/info/sparse-checkout
> > $ git checkout -f
> > was ran in the modified file case as well and "i" was the only file in the
> > working directory before reset.
> >
> 
> 
> I'm assuming then that you mean that some file "m" is modified by the
> commit, and do not mean to say that it has local modifications in the
> working tree? That is not what I had inferred from earlier, so I was
> very much confused.
> 

Yes

> In this example, the only file actually checked out is "i", as
> everything else will have the skip-worktree bit set..?
> 

Yes

> So doing git reset HEAD~1 will reset the branch back one commit, and
> now because of this reset is clearing the skip worktree flag, and
> since you never had a copy of it checked out it is notified as
> deleted, because it's no longer marked as skip-worktree?
> 

Correct

> 
> I think the real fix is to stop having reset clear skip-worktree, no?
> 
> By definition if you do a sparse checkout, you're telling git "I only
> care about the files in this sparse checkout, and do not concern me
> with anything else"... So the proposed fix is "since git cleared the
> skip-worktree flag, we should actually also copy the file out again."
> but I think the real problem is that we're clearing skip worktree to
> begin with?

This certainly is an option but I would have some questions with this
approach.  Does this statement, "I only care about the files in this
sparse checkout, and do not concern me with anything else", mean
that git should not change files outside the sparse-checkout whether
they are in the working directory or the index?  Or does that only
apply to the working directory and the index version of files can
change to whatever git decides?  So in the example above would
"m" be the HEAD~1 version of the file in the index or the HEAD
version before the reset?  Does this apply to all git commands,
merge, rebase, cherry-pick, etc?  What about when there is a merge
conflict with a file that is outside the sparse checkout?

Seems to me it is a lot more complex than only caring about the files
in the sparse checkout and no concern for anything else.  Personally
I would like to error on the side of letting the user decide what they
want to do, even if that means turning off the skip-worktree bit and
putting the working directory into an expected state.




RE: [PATCH 1/1] reset: fix reset when using the sparse-checkout feature.

2017-09-12 Thread Kevin Willford
> From: Jacob Keller [mailto:jacob.kel...@gmail.com]
> Sent: Tuesday, September 12, 2017 4:29 PM
> 
> On Tue, Sep 12, 2017 at 1:20 PM, Kevin Willford <kewi...@microsoft.com> wrote:
> >
> > I think this is where I need to do a better job of explaining so here is a
> > simple example.
> >
> > I have a file "a" that was added in the latest commit.
> > $ git log --oneline
> > c1fa646 (HEAD -> reset, master) add file a
> > 40b342c Initial commit with file i
> >
> > Running the reset without using a sparse-checkout file
> >
> > $ git reset HEAD~1
> > $ git status
> > On branch reset
> > Untracked files:
> >   (use "git add ..." to include in what will be committed)
> >
> > a
> >
> > nothing added to commit but untracked files present (use "git add" to track)
> >
> > Turning on sparse-checkout and running checkout to make my working
> > directory sparse
> >
> > $ git config core.sparsecheckout true
> > $ echo /i > .git/info/sparse-checkout
> > $ git checkout -f
> >
> > Running reset gives me
> > $ git reset HEAD~1
> > $ git status
> > On branch reset
> > nothing to commit, working tree clean
> > $ git ls-files
> > i
> >
> > file a is gone.  Not in the index and not in the working directory.
> > Nothing to let the user know that anything changed.
> >
> > With a modified file no sparse-checkout
> > $ git log --oneline
> > 6fbd34a (HEAD -> reset, modified) modified file m
> > c734d72 Initial commit with file i and m
> > $ git reset HEAD~1
> > Unstaged changes after reset:
> > M   m
> > $ git status
> > On branch reset
> > Changes not staged for commit:
> >   (use "git add ..." to update what will be committed)
> >   (use "git checkout -- ..." to discard changes in working directory)
> >
> > modified:   m
> >
> > no changes added to commit (use "git add" and/or "git commit -a")
> >
> > With sparse-checkout
> > $ git reset HEAD~1
> > Unstaged changes after reset:
> > D   m
> > $ git status
> > On branch reset
> > Changes not staged for commit:
> >   (use "git add/rm ..." to update what will be committed)
> >   (use "git checkout -- ..." to discard changes in working directory)
> >
> > deleted:m
> >
> > no changes added to commit (use "git add" and/or "git commit -a")
> >
> 
> Wasn't "m" outside the sparse checkout? Or was it a file in the sparse
> checkout? I mean to say, the file after setting up sparse checkout was
> one of the "interesting" files that sparse checked out?
> 
> Or was it in fact a separate file which wasn't there?

Sorry.  It was not in the sparse-checkout file. 
$ git config core.sparsecheckout true
$ echo /i > .git/info/sparse-checkout
$ git checkout -f
was ran in the modified file case as well and "i" was the only file in the
working directory before reset. 

> 
> I would think that in sparse-checkout world, you should only *ever*
> have the files you list in sparse.
> 
> So files outside sparse world should be ignored, not shown and not
> show up in status, but they should absolutely not show up in the
> working tree either.

Sparse checkout uses the skip-worktree bit which has the following
documentation:
https://git-scm.com/docs/git-update-index 
"Skip-worktree bit can be defined in one (long) sentence: When reading
an entry, if it is marked as skip-worktree, then Git pretends its working
directory version is up to date and read the index version instead.

To elaborate, "reading" means checking for file existence, reading file
attributes or file content. The working directory version may be present
or absent. If present, its content may match against the index version or
not. Writing is not affected by this bit, content safety is still first 
priority.
Note that Git can update working directory file, that is marked skip-worktree,
if it is safe to do so (i.e. working directory version matches index version)"

I'm not saying that is the right behavior, just pointing out the documentation.
And from my experience git will turn on and off the skip-worktree flag
depending on the command as we see happening with reset.

> 
> You're not "changing" any commits, because the status of the file at
> HEAD~1 is exactly what HEAD~1 says it is, but you just don't have a
> checked out copy of it.

In the case of reset yes and it matches HEAD and if the sparse flag was on,
the user does not know that the file was chan

RE: [PATCH 1/1] reset: fix reset when using the sparse-checkout feature.

2017-09-12 Thread Kevin Willford
> From: Junio C Hamano [mailto:gits...@pobox.com]
> Sent: Monday, September 11, 2017 9:57 PM
> 
> Let's imagine a path P that is outside the sparse checkout area.
> And we check out a commit that has P.  P would still be recorded in
> the index but it would not materialize in the working tree.  "git
> status" and friends are asked not to treat this as "locally removed",
> to prevent "commit -a" from recording a removal, of course.
> 
> Now, let's further imagine that you have a checkout of the same
> project but at a commit that does not have P.  Then you reset to
> another commit that does have P.  My understanding of what Kevin's
> first test wants to demonstrate is that the index is populated with
> P (because you did reset to a commit with that path) but it does not
> materialize in the working tree (perhaps because that is outside the
> sparse checkout area?), yet there is something missing compared to
> the earlier case where "git status" and friends are asked not to
> treat P as "locally removed".  They instead show P as locally removed,
> and "commit -a" would record a removal---that is indeed a problem.
> 
> Am I reading the problem description correctly so far?  If so, then
> my answer to my first question (are we solving a right problem?) is
> yes.
> 

I think this is where I need to do a better job of explaining so here is a
simple example.

I have a file "a" that was added in the latest commit.  
$ git log --oneline
c1fa646 (HEAD -> reset, master) add file a
40b342c Initial commit with file i

Running the reset without using a sparse-checkout file

$ git reset HEAD~1
$ git status
On branch reset
Untracked files:
  (use "git add ..." to include in what will be committed)

a

nothing added to commit but untracked files present (use "git add" to track)

Turning on sparse-checkout and running checkout to make my working
directory sparse

$ git config core.sparsecheckout true
$ echo /i > .git/info/sparse-checkout
$ git checkout -f

Running reset gives me 
$ git reset HEAD~1
$ git status
On branch reset
nothing to commit, working tree clean
$ git ls-files
i

file a is gone.  Not in the index and not in the working directory.
Nothing to let the user know that anything changed.

With a modified file no sparse-checkout
$ git log --oneline
6fbd34a (HEAD -> reset, modified) modified file m
c734d72 Initial commit with file i and m
$ git reset HEAD~1
Unstaged changes after reset:
M   m
$ git status
On branch reset
Changes not staged for commit:
  (use "git add ..." to update what will be committed)
  (use "git checkout -- ..." to discard changes in working directory)

modified:   m

no changes added to commit (use "git add" and/or "git commit -a")

With sparse-checkout
$ git reset HEAD~1
Unstaged changes after reset:
D   m
$ git status
On branch reset
Changes not staged for commit:
  (use "git add/rm ..." to update what will be committed)
  (use "git checkout -- ..." to discard changes in working directory)

deleted:m

no changes added to commit (use "git add" and/or "git commit -a")

I think we can agree that this is not the correct behavior.

> But this time, I am trying to see if the approach is good.  I am not
> so sure if the approach taken by this patch is so obviously good as
> you seem to think.  A logical consequence of the approach "git
> status thinks that P appears in the index and missing in the working
> tree is a local removal, so let's squelch it by creating the file in
> the working tree" is that we will end up fully populating the
> working tree with paths that are clearly outside the area the user
> designated as the sparse checkout area---surely that may squelch
> "status", but to me, it looks like it is breaking what the user
> wanted to achieve with "sparse checkout" in the first place.
> 

I don't think that we are trying to "squelch" status so much as make
it consistent with what the user would expect to happen.  If that means
not resetting entries with the skip-worktree bit or resetting the entries
but keeping the skip-worktree bit on, okay, but I would argue that is
not what the user wants because if you are now saying that sparse
means git will not change files outside the sparse-checkout entries,
what about merge, rebase, cherry-pick, apply?  Should those only
change the files that are in the sparse definition?  If so we would
be changing the commits from the original, i.e.  cherry-pick 123 would
create a different commit depending on whether or not you are using
sparse as well as a different commit depending on what is in your
sparse-checkout.  

I see reset being a similar scenario in that if everything is clean, after I
"reset HEAD~1" I should be able to run "add ." + "commit" and have
the same commit as before the reset.  If this is changed to only reset
the sparse entries, there will be staged changes after the reset because
HEAD has changed but we didn't update the index versions of the files.
If we do update the index with the "HEAD~1" 

RE: What's cooking in git.git (Sep 2017, #02; Mon, 11)

2017-09-11 Thread Kevin Willford
> 
> * kw/write-index-reduce-alloc (2017-09-08) 2 commits
>  - read-cache: fix index corruption with index v4
>  - Add t/helper/test-write-cache to .gitignore
> 
>  Expecting a reroll.
>  cf.  q...@mail.gmail.com>
> 
I didn't submit these patches so what would you like
me to do?

The reroll for read-cache fix was submitted here
https://public-inbox.org/git/20170907192412.8085-1-t.gumme...@gmail.com/

- Kevin



RE: [PATCH 1/1] reset: fix reset when using the sparse-checkout feature.

2017-09-08 Thread Kevin Willford
> From: Junio C Hamano [mailto:gits...@pobox.com]
> Sent: Friday, September 8, 2017 9:18 PM
> 
> Kevin Willford <kewi...@microsoft.com> writes:
> 
> > 1. reset mixed when there were files that were added
> >
> > In this case the index will no longer have the entry at all because
> > the reset is making the index look like before the file was added
> > which didn't have it. When not using the sparse-checkout this is fine
> > because the file is in the working directory and the reset didn't touch
> > it.  But using the sparse-checkout on that file and if the file was not
> > in the working directory, the index gets reset and the entry for that
> > file is gone and if we don't put the index version of the file before the
> > reset into the working directory, then we have lost the content for
> > that file
> 
> I do not quite understand this argument.  If you do
> 
>   edit $path
>   git add $path
>   rm $path
>   git reset
> 
> for a $path that is not involved in the sparse thing, the version
> that was previously indexed will be lost, but that is fine---the
> user said that version is expendable by saying "reset".
> 
> How would that be different when the $path were not to be
> materialized in the working tree due to sparseness?  Where did that
> "blob" object in the index immediately before you called "reset"
> came from, and why do you say that the user does *not* consider that
> one expendable, unlike the case for non-sparse path example above?
> 

I guess that I should have said files that were newly added, meaning they
are new files that were created and added in the previous commit.
 
I think that the difference is that the user explicitly removed the file. When
using sparse it is git that is causing the removal of the file.  For example
if I have /file in my spare-checkout file so that I am only working on the one
file.  The previous commit had new file2 added and I run a git reset HEAD~1.
I as the user do not expect that file2 just disappear, but yet that is what
happens.  So from you example above if I do.

create $path
git add $path
git commit
git checkout // where $path is not in the sparse-checkout
git reset HEAD~1

$path will be gone yet I the user did not remove it.  I guess you could
argue that the user did when they specified their sparse-checkout and
ran checkout but I wouldn't know what would go missing unless I ran
a diff before the reset to see.  There could have been X number of
files created and added in the previous commit(s) and status after the
reset would not report them and they are gone.  So I could clone,
setup sparse-checkout, checkout, reset HEAD~X and possibly lose
data I didn't expect to.

In the modified case where the previous commits have modifications
to files outside the sparse-checkout at least the status after the reset reports
the file as deleted so the user sees that something has happened to it.

I suppose the entry could stay in the index with the skip-worktree bit
on and not removed like it is now so that a git reset will only apply
to the entries in the sparse-checkout?  That seems like it would be
changing the meaning of reset.

> I suspect that a similar reasoning would apply to your 2., but I
> didn't think it through.
> 
> The possible misconception, which I perceive in both of these, is
> that you are somehow disagreeing with this basic assumption: by
> saying "git reset []", the user is telling us that the
> version in the index, even if that is different from HEAD,
> , or the file in the working tree, is *unwanted* and be
> replaced with the one in HEAD (or  when given).  Touching
> the working tree files upon "git reset" is the last thing the user
> expects to happen.
> 

I agree with this when you are not dealing with a sparse-checkout.
When using a sparse-checkout I expect git not to touch things
outside of what I have specified in my sparse-checkout file.  If it
does, it should let me know or put my working directory in a
state that is expected.  Especially when it is changing the
skip-worktree bits causing files outside the sparse-checkout to be
reported incorrectly by status.

 



RE: [PATCH 1/1] reset: fix reset when using the sparse-checkout feature.

2017-09-08 Thread Kevin Willford
From: Junio C Hamano [mailto:gits...@pobox.com]
Sent: Friday, September 8, 2017 1:02 PM
> Kevin Willford <kewi...@microsoft.com> writes:
> 
> > diff --git a/builtin/reset.c b/builtin/reset.c
> > index d72c7d1c96..1b8bb45989 100644
> > --- a/builtin/reset.c
> > +++ b/builtin/reset.c
> > @@ -24,6 +24,7 @@
> >  #include "cache-tree.h"
> >  #include "submodule.h"
> >  #include "submodule-config.h"
> > +#include "dir.h"
> >
> >  static const char * const git_reset_usage[] = {
> > N_("git reset [--mixed | --soft | --hard | --merge | --keep] [-q]
> []"),
> > @@ -124,8 +125,32 @@ static void update_index_from_diff(struct
> diff_queue_struct *q,
> >
> > for (i = 0; i < q->nr; i++) {
> > struct diff_filespec *one = q->queue[i]->one;
> > +   struct diff_filespec *two = q->queue[i]->two;
> > +   int pos;
> > int is_missing = !(one->mode && !is_null_oid(>oid));
> > +   int was_missing = !two->mode && is_null_oid(>oid);
> > struct cache_entry *ce;
> > +   struct cache_entry *ceBefore;
> > +   struct checkout state = CHECKOUT_INIT;
> 
> The five new variables are only used in the new block, so it
> probably is better to limit their scope to the "we do something
> unusual when sparse checkout is in effect" block as well.  The scope
> for the latter two can further be narrowed down to "we do need to
> force a checkout of this entry" block.
> 
> We prefer to name our variables with underscore (e.g. ce_before)
> over camelCase (e.g. ceBefore) unless there is a compelling reason
> (e.g. a platform specific code in compat/ layer to match platform
> convention).
> 

Will update.

> > +
> > +   if (core_apply_sparse_checkout && !file_exists(two->path)) {
> > +   pos = cache_name_pos(two->path, strlen(two->path));
> > +   if ((pos >= 0 && ce_skip_worktree(active_cache[pos]))
> &&
> > +   (is_missing || !was_missing))
> > +   {
> > +   state.force = 1;
> > +   state.refresh_cache = 1;
> > +   state.istate = _index;
> > +   ceBefore = make_cache_entry(two->mode,
> > +   two->oid.hash,
> > +   two->path, 0, 0);
> > +   if (!ceBefore)
> > +   die(_("make_cache_entry failed for path
> '%s'"),
> > +   two->path);
> > +
> > +   checkout_entry(ceBefore, , NULL);
> > +   }
> > +   }
> 
> Can we tell between the case where the reason why the path was not
> there in the working tree was due to the path being excluded by the
> sparse checkout and the path being removed manually by the end user?
> 
> I guess ce_skip_worktree() check is sufficient; we force checkout only
> when the path is marked to be skipped due to "sparse" thing.
> 
> Do we have to worry about the reverse case, in which file_exists(two->path)
> is true (i.e. the user created a file there manually) even though
> the path is marked to be skipped due to "sparse" setting?
> 

I don't believe so because if the user has a file there whether they modified
it or not, it is what the user did and we just leave it there and a diff with
what the index gets reset to will show how the file is different from what the
index got reset to.

> Other than that, the patch looks quite cleanly done and well explained.
> 
> Thanks.
> 


RE: [PATCH 1/1] reset: fix reset when using the sparse-checkout feature.

2017-09-08 Thread Kevin Willford

From: Junio C Hamano [mailto:gits...@pobox.com]
Sent: Friday, September 8, 2017 1:12 PM
> Kevin Willford <kewi...@microsoft.com> writes:
> 
> > When using the sparse checkout feature the git reset command will add
> > entries to the index that will have the skip-worktree bit off but will
> > leave the working directory empty.  File data is lost because the index
> > version of the files have been changed but there is nothing that is in the
> > working directory.  This will cause the next status call to show either
> > deleted for files modified or deleting or nothing for files added.
> 
> Getting rid of sparseness may of course be one way to correct the
> discrepancy, but it is unclear to me if that is the best way to do
> so.  An alternative may be to add entries to the index that does
> have the bit on for paths that are meant to be skipped due to
> "sparse" thing, no?  Am I being naive, missing some reason why that
> would give us a worse result?

There are two cases that this is trying to solve.

1. reset mixed when there were files that were added

In this case the index will no longer have the entry at all because
the reset is making the index look like before the file was added
which didn't have it. When not using the sparse-checkout this is fine
because the file is in the working directory and the reset didn't touch
it.  But using the sparse-checkout on that file and if the file was not
in the working directory, the index gets reset and the entry for that
file is gone and if we don't put the index version of the file before the
reset into the working directory, then we have lost the content for
that file

2. reset mixed when there were files modified

This case is similar but with modified files there is an entry in the index
but it is getting changed to a previous version of the file.  If we don't get
the file on disk then the version of the file that, in the non sparse-checkout
case, would be on disk is lost and cannot be recovered.  So even if we turn
on the skip-worktree bit for this entry and it doesn't show up in status
calls, we lost the previous version of the file.




[PATCH 0/1] reset: fix mixed reset when using sparse-checkout

2017-09-08 Thread Kevin Willford
Original discussion is here
https://public-inbox.org/git/20170407192357.948-4-kewi...@microsoft.com/

When running a reset mixed and using the sparse-checkout the working
directory needs to be updated so that there is not data loss when the
index is updated.  This is because the index is getting updated
potentially removing entries without changing the working directory.
When using the sparse-checkout feature the entries removed might not
be on disk and are lost.

This patch writes the before version of the file to disk if the mixed
reset is going to change the index of a file that had the skip-wortree
bit so that the file contents before the reset is preserved on disk and
status will reports the correct results.

Kevin Willford (1):
  reset: fix reset when using the sparse-checkout feature.

 builtin/reset.c  | 25 +
 t/t7114-reset-sparse-checkout.sh | 60 
 2 files changed, 85 insertions(+)
 create mode 100755 t/t7114-reset-sparse-checkout.sh

-- 
2.14.1.474.g0558484247.dirty



[PATCH 1/1] reset: fix reset when using the sparse-checkout feature.

2017-09-08 Thread Kevin Willford
When using the sparse checkout feature the git reset command will add
entries to the index that will have the skip-worktree bit off but will
leave the working directory empty.  File data is lost because the index
version of the files have been changed but there is nothing that is in the
working directory.  This will cause the next status call to show either
deleted for files modified or deleting or nothing for files added.  The
added files should be shown as untracked and modified files should be
shown as modified.

To fix this when the reset is running if there is not a file in the working
directory and if it will be missing with the new index entry or was not
missing in the previous version, we create the previous index version of
the file in the working directory so that status will report correctly
and the files will be availble for the user to deal with.

Signed-off-by: Kevin Willford <kewi...@microsoft.com>
---
 builtin/reset.c  | 25 +
 t/t7114-reset-sparse-checkout.sh | 60 
 2 files changed, 85 insertions(+)
 create mode 100755 t/t7114-reset-sparse-checkout.sh

diff --git a/builtin/reset.c b/builtin/reset.c
index d72c7d1c96..1b8bb45989 100644
--- a/builtin/reset.c
+++ b/builtin/reset.c
@@ -24,6 +24,7 @@
 #include "cache-tree.h"
 #include "submodule.h"
 #include "submodule-config.h"
+#include "dir.h"
 
 static const char * const git_reset_usage[] = {
N_("git reset [--mixed | --soft | --hard | --merge | --keep] [-q] 
[]"),
@@ -124,8 +125,32 @@ static void update_index_from_diff(struct 
diff_queue_struct *q,
 
for (i = 0; i < q->nr; i++) {
struct diff_filespec *one = q->queue[i]->one;
+   struct diff_filespec *two = q->queue[i]->two;
+   int pos;
int is_missing = !(one->mode && !is_null_oid(>oid));
+   int was_missing = !two->mode && is_null_oid(>oid);
struct cache_entry *ce;
+   struct cache_entry *ceBefore;
+   struct checkout state = CHECKOUT_INIT;
+
+   if (core_apply_sparse_checkout && !file_exists(two->path)) {
+   pos = cache_name_pos(two->path, strlen(two->path));
+   if ((pos >= 0 && ce_skip_worktree(active_cache[pos])) &&
+   (is_missing || !was_missing))
+   {
+   state.force = 1;
+   state.refresh_cache = 1;
+   state.istate = _index;
+   ceBefore = make_cache_entry(two->mode,
+   two->oid.hash,
+   two->path, 0, 0);
+   if (!ceBefore)
+   die(_("make_cache_entry failed for path 
'%s'"),
+   two->path);
+
+   checkout_entry(ceBefore, , NULL);
+   }
+   }
 
if (is_missing && !intent_to_add) {
remove_file_from_cache(one->path);
diff --git a/t/t7114-reset-sparse-checkout.sh b/t/t7114-reset-sparse-checkout.sh
new file mode 100755
index 00..f2a5426847
--- /dev/null
+++ b/t/t7114-reset-sparse-checkout.sh
@@ -0,0 +1,60 @@
+#!/bin/sh
+
+test_description='reset when using a sparse-checkout'
+
+. ./test-lib.sh
+
+# reset using a sparse-checkout file
+
+test_expect_success 'setup' '
+   test_tick &&
+   echo "checkout file" >c &&
+   echo "modify file" >m &&
+   echo "delete file" >d &&
+   git add . &&
+   git commit -m "initial commit" &&
+   echo "added file" >a &&
+   echo "modification of a file" >m &&
+   git rm d &&
+   git add . &&
+   git commit -m "second commit" &&
+   git checkout -b endCommit
+'
+
+test_expect_success 'reset when there is a sparse-checkout' '
+   echo "/c" >.git/info/sparse-checkout &&
+   test_config core.sparsecheckout true &&
+   git checkout -b resetBranch &&
+   test_path_is_missing m &&
+   test_path_is_missing a &&
+   test_path_is_missing d &&
+   git reset HEAD~1 &&
+   test "checkout file" = "$(cat c)" &&
+   test "modification of a file" = "$(cat m)" &&
+   test "added file" = "$(cat a)" &&
+   test_path_is_missing d
+'
+
+test_expect_success 'reset after 

[PATCH v3 3/3] merge-recursive: change current file dir string_lists to hashmap

2017-09-07 Thread Kevin Willford
The code was using two string_lists, one for the directories and
one for the files.  The code never checks the lists independently
so we should be able to only use one list.  The string_list also
is a O(log n) for lookup and insertion.  Switching this to use a
hashmap will give O(1) which will save some time when there are
millions of paths that will be checked.

Signed-off-by: Kevin Willford <kewi...@microsoft.com>
---
 merge-recursive.c | 56 ---
 merge-recursive.h |  3 +--
 2 files changed, 46 insertions(+), 13 deletions(-)

diff --git a/merge-recursive.c b/merge-recursive.c
index d47f40ea81..35af3761ba 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -24,6 +24,31 @@
 #include "dir.h"
 #include "submodule.h"
 
+struct path_hashmap_entry {
+   struct hashmap_entry e;
+   char path[FLEX_ARRAY];
+};
+
+static int path_hashmap_cmp(const void *cmp_data,
+   const void *entry,
+   const void *entry_or_key,
+   const void *keydata)
+{
+   const struct path_hashmap_entry *a = entry;
+   const struct path_hashmap_entry *b = entry_or_key;
+   const char *key = keydata;
+
+   if (ignore_case)
+   return strcasecmp(a->path, key ? key : b->path);
+   else
+   return strcmp(a->path, key ? key : b->path);
+}
+
+static unsigned int path_hash(const char *path)
+{
+   return ignore_case ? strihash(path) : strhash(path);
+}
+
 static void flush_output(struct merge_options *o)
 {
if (o->buffer_output < 2 && o->obuf.len) {
@@ -314,15 +339,15 @@ static int save_files_dirs(const unsigned char *sha1,
struct strbuf *base, const char *path,
unsigned int mode, int stage, void *context)
 {
+   struct path_hashmap_entry *entry;
int baselen = base->len;
struct merge_options *o = context;
 
strbuf_addstr(base, path);
 
-   if (S_ISDIR(mode))
-   string_list_insert(>current_directory_set, base->buf);
-   else
-   string_list_insert(>current_file_set, base->buf);
+   FLEX_ALLOC_MEM(entry, path, base->buf, base->len);
+   hashmap_entry_init(entry, path_hash(entry->path));
+   hashmap_add(>current_file_dir_set, entry);
 
strbuf_setlen(base, baselen);
return (S_ISDIR(mode) ? READ_TREE_RECURSIVE : 0);
@@ -642,6 +667,7 @@ static void add_flattened_path(struct strbuf *out, const 
char *s)
 
 static char *unique_path(struct merge_options *o, const char *path, const char 
*branch)
 {
+   struct path_hashmap_entry *entry;
struct strbuf newpath = STRBUF_INIT;
int suffix = 0;
size_t base_len;
@@ -650,14 +676,16 @@ static char *unique_path(struct merge_options *o, const 
char *path, const char *
add_flattened_path(, branch);
 
base_len = newpath.len;
-   while (string_list_has_string(>current_file_set, newpath.buf) ||
-  string_list_has_string(>current_directory_set, newpath.buf) ||
+   while (hashmap_get_from_hash(>current_file_dir_set,
+path_hash(newpath.buf), newpath.buf) ||
   (!o->call_depth && file_exists(newpath.buf))) {
strbuf_setlen(, base_len);
strbuf_addf(, "_%d", suffix++);
}
 
-   string_list_insert(>current_file_set, newpath.buf);
+   FLEX_ALLOC_MEM(entry, path, newpath.buf, newpath.len);
+   hashmap_entry_init(entry, path_hash(entry->path));
+   hashmap_add(>current_file_dir_set, entry);
return strbuf_detach(, NULL);
 }
 
@@ -1941,8 +1969,14 @@ int merge_trees(struct merge_options *o,
if (unmerged_cache()) {
struct string_list *entries, *re_head, *re_merge;
int i;
-   string_list_clear(>current_file_set, 1);
-   string_list_clear(>current_directory_set, 1);
+   /*
+* Only need the hashmap while processing entries, so
+* initialize it here and free it when we are done running
+* through the entries. Keeping it in the merge_options as
+* opposed to decaring a local hashmap is for convenience
+* so that we don't have to pass it to around.
+*/
+   hashmap_init(>current_file_dir_set, path_hashmap_cmp, NULL, 
512);
get_files_dirs(o, head);
get_files_dirs(o, merge);
 
@@ -1978,6 +2012,8 @@ int merge_trees(struct merge_options *o,
string_list_clear(re_head, 0);
string_list_clear(entries, 1);
 
+   hashmap_free(>current_file_dir_set, 1);
+
free(re_merge);
free(re_head);
free(entries);
@@ -2179,8 +2215,6 @@ void init_mer

[PATCH v3 2/3] merge-recursive: remove return value from get_files_dirs

2017-09-07 Thread Kevin Willford
The return value of the get_files_dirs call is never being used.
Looking at the history of the file and it was originally only
being used for debug output statements.  Also when
read_tree_recursive return value is non zero it is changed to
zero.  This leads me to believe that it doesn't matter if
read_tree_recursive gets an error.

Since the debug output has been removed and the caller isn't
checking the return value there is no reason to keep calculating
and returning a value.

Signed-off-by: Kevin Willford <kewi...@microsoft.com>
---
 merge-recursive.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/merge-recursive.c b/merge-recursive.c
index 033d7cd406..d47f40ea81 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -328,15 +328,11 @@ static int save_files_dirs(const unsigned char *sha1,
return (S_ISDIR(mode) ? READ_TREE_RECURSIVE : 0);
 }
 
-static int get_files_dirs(struct merge_options *o, struct tree *tree)
+static void get_files_dirs(struct merge_options *o, struct tree *tree)
 {
-   int n;
struct pathspec match_all;
memset(_all, 0, sizeof(match_all));
-   if (read_tree_recursive(tree, "", 0, 0, _all, save_files_dirs, o))
-   return 0;
-   n = o->current_file_set.nr + o->current_directory_set.nr;
-   return n;
+   read_tree_recursive(tree, "", 0, 0, _all, save_files_dirs, o);
 }
 
 /*
-- 
2.14.1.329.gcdd497e120.dirty



[PATCH v3 0/3] merge-recursive: replace string_list with hashmap

2017-09-07 Thread Kevin Willford
Code was using two string_lists, one for the directories and
one for the files.  The code never checks the lists independently
so we should be able to only use one list.  The string_list also
is a O(log n) for lookup and insertion.  Switching this to use a
hashmap will give O(1) which will save some time when there are
millions of paths that will be checked.

Also cleaned up a memory leak and method where the return was not
being used.

Changes since last version:
1. Removed the function pointers and just check the ignore_case in the
compare and hash methods.
2. Added a comment about the hashmap and why it is getting initialized
and freed but not a local.
3. Use hashmap_get_from_hash and remove the dummy entry

Kevin Willford (3):
  merge-recursive: fix memory leak
  merge-recursive: remove return value from get_files_dirs
  merge-recursive: change current file dir string_lists to hashmap

 merge-recursive.c | 76 ---
 merge-recursive.h |  3 +--
 2 files changed, 57 insertions(+), 22 deletions(-)

-- 
2.14.1.329.gcdd497e120.dirty



[PATCH v3 1/3] merge-recursive: fix memory leak

2017-09-07 Thread Kevin Willford
In merge_trees if process_renames or process_entry returns less
than zero, the method will just return and not free re_merge,
re_head, or entries.

This change cleans up the allocated variables before returning
to the caller.

Signed-off-by: Kevin Willford <kewi...@microsoft.com>
---
 merge-recursive.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/merge-recursive.c b/merge-recursive.c
index 1494ffdb82..033d7cd406 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -1956,7 +1956,7 @@ int merge_trees(struct merge_options *o,
re_merge = get_renames(o, merge, common, head, merge, entries);
clean = process_renames(o, re_head, re_merge);
if (clean < 0)
-   return clean;
+   goto cleanup;
for (i = entries->nr-1; 0 <= i; i--) {
const char *path = entries->items[i].string;
struct stage_data *e = entries->items[i].util;
@@ -1964,8 +1964,10 @@ int merge_trees(struct merge_options *o,
int ret = process_entry(o, path, e);
if (!ret)
clean = 0;
-   else if (ret < 0)
-   return ret;
+   else if (ret < 0) {
+   clean = ret;
+   goto cleanup;
+   }
}
}
for (i = 0; i < entries->nr; i++) {
@@ -1975,6 +1977,7 @@ int merge_trees(struct merge_options *o,
entries->items[i].string);
}
 
+cleanup:
string_list_clear(re_merge, 0);
string_list_clear(re_head, 0);
string_list_clear(entries, 1);
@@ -1982,6 +1985,9 @@ int merge_trees(struct merge_options *o,
free(re_merge);
free(re_head);
free(entries);
+
+   if (clean < 0)
+   return clean;
}
else
clean = 1;
-- 
2.14.1.329.gcdd497e120.dirty



[PATCH v2] merge-recursive: change current file dir string_lists to hashmap

2017-09-06 Thread Kevin Willford
The code was using two string_lists, one for the directories and
one for the files.  The code never checks the lists independently
so we should be able to only use one list.  The string_list also
is a O(log n) for lookup and insertion.  Switching this to use a
hashmap will give O(1) which will save some time when there are
millions of paths that will be checked.

Also cleaned up a memory leak and method where the return was not
being used.

Signed-off-by: Kevin Willford <kewi...@microsoft.com>
---
 merge-recursive.c | 76 ---
 merge-recursive.h |  3 +--
 2 files changed, 57 insertions(+), 22 deletions(-)

diff --git a/merge-recursive.c b/merge-recursive.c
index 1494ffdb82..ebfe01017f 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -24,6 +24,31 @@
 #include "dir.h"
 #include "submodule.h"
 
+struct path_hashmap_entry {
+   struct hashmap_entry;
+   char path[FLEX_ARRAY];
+};
+
+static int path_hashmap_cmp(const void *cmp_data,
+   const void *entry,
+   const void *entry_or_key,
+   const void *keydata)
+{
+   const struct path_hashmap_entry *a = entry;
+   const struct path_hashmap_entry *b = entry_or_key;
+   const char *key = keydata;
+
+   if (ignore_case)
+   return strcasecmp(a->path, key ? key : b->path);
+   else
+   return strcmp(a->path, key ? key : b->path);
+}
+
+static unsigned int path_hash(const char *path)
+{
+   return ignore_case ? strihash(path) : strhash(path);
+}
+
 static void flush_output(struct merge_options *o)
 {
if (o->buffer_output < 2 && o->obuf.len) {
@@ -314,29 +339,25 @@ static int save_files_dirs(const unsigned char *sha1,
struct strbuf *base, const char *path,
unsigned int mode, int stage, void *context)
 {
+   struct path_hashmap_entry *entry;
int baselen = base->len;
struct merge_options *o = context;
 
strbuf_addstr(base, path);
 
-   if (S_ISDIR(mode))
-   string_list_insert(>current_directory_set, base->buf);
-   else
-   string_list_insert(>current_file_set, base->buf);
+   FLEX_ALLOC_MEM(entry, path, base->buf, base->len);
+   hashmap_entry_init(entry, path_hash(entry->path));
+   hashmap_add(>current_file_dir_set, entry);
 
strbuf_setlen(base, baselen);
return (S_ISDIR(mode) ? READ_TREE_RECURSIVE : 0);
 }
 
-static int get_files_dirs(struct merge_options *o, struct tree *tree)
+static void get_files_dirs(struct merge_options *o, struct tree *tree)
 {
-   int n;
struct pathspec match_all;
memset(_all, 0, sizeof(match_all));
-   if (read_tree_recursive(tree, "", 0, 0, _all, save_files_dirs, o))
-   return 0;
-   n = o->current_file_set.nr + o->current_directory_set.nr;
-   return n;
+   read_tree_recursive(tree, "", 0, 0, _all, save_files_dirs, o);
 }
 
 /*
@@ -646,6 +667,7 @@ static void add_flattened_path(struct strbuf *out, const 
char *s)
 
 static char *unique_path(struct merge_options *o, const char *path, const char 
*branch)
 {
+   struct path_hashmap_entry *entry;
struct strbuf newpath = STRBUF_INIT;
int suffix = 0;
size_t base_len;
@@ -654,14 +676,16 @@ static char *unique_path(struct merge_options *o, const 
char *path, const char *
add_flattened_path(, branch);
 
base_len = newpath.len;
-   while (string_list_has_string(>current_file_set, newpath.buf) ||
-  string_list_has_string(>current_directory_set, newpath.buf) ||
+   while (hashmap_get_from_hash(>current_file_dir_set,
+path_hash(newpath.buf), newpath.buf) ||
   (!o->call_depth && file_exists(newpath.buf))) {
strbuf_setlen(, base_len);
strbuf_addf(, "_%d", suffix++);
}
 
-   string_list_insert(>current_file_set, newpath.buf);
+   FLEX_ALLOC_MEM(entry, path, newpath.buf, newpath.len);
+   hashmap_entry_init(entry, path_hash(entry->path));
+   hashmap_add(>current_file_dir_set, entry);
return strbuf_detach(, NULL);
 }
 
@@ -1945,8 +1969,14 @@ int merge_trees(struct merge_options *o,
if (unmerged_cache()) {
struct string_list *entries, *re_head, *re_merge;
int i;
-   string_list_clear(>current_file_set, 1);
-   string_list_clear(>current_directory_set, 1);
+   /*
+* Only need the hashmap while processing entries, so
+* initialize it here and free it when we are done running
+* through the entries. Keeping it in the merge_options as
+* opposed to decaring a local hashmap

RE: [PATCH] read-cache: fix index corruption with index v4

2017-09-05 Thread Kevin Willford
> From: Thomas Gummerer [mailto:t.gumme...@gmail.com]
> Sent: Monday, September 4, 2017 4:58 PM
> 
> ce012deb98 ("read-cache: avoid allocating every ondisk entry when
> writing", 2017-08-21) changed the way cache entries are written to the
> index file.  While previously it wrote the name to an struct that was
> allocated using xcalloc(), it now uses ce_write() directly.  Previously
> ce_namelen - common bytes were written to the cache entry, which would
> automatically make it nul terminated, as it was allocated using calloc.
> 
> Now we are writing ce_namelen - common + 1 bytes directly from the
> ce->name to the index.  As ce->name is however not nul terminated, and
> index-v4 needs the nul terminator to split between one index entry and
> the next, this would end up in a corrupted index.
> 
> Fix that by only writing ce_namelen - common bytes directly from
> ce->name to the index, and adding the nul terminator in an extra call to
> ce_write.
> 
> This bug was turned up by setting TEST_GIT_INDEX_VERSION = 4 in
> config.mak and running the test suite (t1700 specifically broke).
> 
> Signed-off-by: Thomas Gummerer 
> ---
> 
> I unfortunately didn't have more time to dig so
> 
> > As ce->name is however not nul terminated
> 
> just comes from my memory and from the patch below actually fixing the
> corruption, so it's really the most likely cause.  Would be great if
> someone who can remember more about the index could confirm that this
> is indeed the case.
> 

Digging into this and ce->name IS nul terminated.  The issue comes in when
the CE_STRIP_NAME is set, which is only set when using a split index. 
This sets the ce->ce_namelen = 0 without changing the actual ce->name buffer.
When writing the entry for the split index version 4 it was using the first 
character
in the ce->name buffer because of the + 1, which obviously isn't correct.
Before
it was using a newly allocated name buffer from the ondisk struct which was
allocated based on the ce_namelen of zero.

>  read-cache.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/read-cache.c b/read-cache.c
> index 40da87ea71..80830ddcfc 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -2103,7 +2103,9 @@ static int ce_write_entry(git_SHA_CTX *c, int fd, struct
> cache_entry *ce,
>   if (!result)
>   result = ce_write(c, fd, to_remove_vi, prefix_size);
>   if (!result)
> - result = ce_write(c, fd, ce->name + common,
> ce_namelen(ce) - common + 1);
> + result = ce_write(c, fd, ce->name + common,
> ce_namelen(ce) - common);
> + if (!result)
> + result = ce_write(c, fd, "\0", 1);

You could use the padding variable here as well which is used in the < version 4
ce_write.

> 
>   strbuf_splice(previous_name, common, to_remove,
> ce->name + common, ce_namelen(ce) - common);
> --
> 2.14.1.480.gb18f417b89

While looking at the code I was wondering if we could get around the
whole setting ce->ce_namelen to zero bit but that would be much bigger
patch and possibly introduce other bugs so this seems the appropriate
fix for now.

Thanks for finding this!
Kevin 


RE: [PATCH 2/3] merge-recursive: remove return value from get_files_dirs

2017-08-29 Thread Kevin Willford
> 
> On Mon, Aug 28, 2017 at 02:28:28PM -0600, Kevin Willford wrote:
> 
> > The return value of the get_files_dirs call is never being used.
> > Looking at the history of the file and it was originally only
> > being used for debug output statements.  Also when
> > read_tree_recursive return value is non zero it is changed to
> > zero.  This leads me to believe that it doesn't matter if
> > read_tree_recursive gets an error.
> 
> Or that the function is buggy. :)

That was one of my questions as well.  Should read_tree_recursive
be propagating a -1 and merge_trees be checking for that and bail
when the call to get_files_dirs return is < 0?  I made a commit with
this change and ran the tests and they all still passed so either this
return really doesn't matter or there are not sufficient tests covering
it.

I went with this change because it was not changing any of the
current functionality and if we find a case where it matters that
read_tree_recursive fails due to bad tree or something else we
can address it then.

> 
> I'm tempted to say that we should probably die() when
> read_tree_recursive fails. This should only happen if we fail to parse
> the tree, or if our callback (save_files_dirs here) returns failure, and
> the latter looks like it never happens.
> 
> > Since the debug output has been removed and the caller isn't
> > checking the return value there is no reason to keep calulating
> > and returning a value.
> 
> Agreed, and I'm happy to see dead code go.
> 
> Minor nit: s/calulating/calculating/ in your commit message.

When will that spell checker for git messages be ready? ;)

> 
> -Peff


[PATCH 3/3] merge-recursive: change current file dir string_lists to hashmap

2017-08-28 Thread Kevin Willford
The code was using two string_lists, one for the directories and
one for the files.  The code never checks the lists independently
so we should be able to only use one list.  The string_list also
is a O(log n) for lookup and insertion.  Switching this to use a
hashmap will give O(1) which will save some time when there are
millions of paths that will be checked.

Signed-off-by: Kevin Willford <kewi...@microsoft.com>
---
 merge-recursive.c | 48 +---
 merge-recursive.h |  3 +--
 2 files changed, 38 insertions(+), 13 deletions(-)

diff --git a/merge-recursive.c b/merge-recursive.c
index d47f40ea81..ef4fe5f09f 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -24,6 +24,26 @@
 #include "dir.h"
 #include "submodule.h"
 
+struct path_hashmap_entry {
+   struct hashmap_entry;
+   char path[FLEX_ARRAY];
+};
+
+static unsigned int (*pathhash)(const char *path);
+static int (*pathcmp)(const char *a, const char *b);
+
+static int path_hashmap_cmp(const void *cmp_data,
+   const void *entry,
+   const void *entry_or_key,
+   const void *keydata)
+{
+   const struct path_hashmap_entry *a = entry;
+   const struct path_hashmap_entry *b = entry_or_key;
+   const char *key = keydata;
+
+   return pathcmp(a->path, key ? key : b->path);
+}
+
 static void flush_output(struct merge_options *o)
 {
if (o->buffer_output < 2 && o->obuf.len) {
@@ -314,15 +334,15 @@ static int save_files_dirs(const unsigned char *sha1,
struct strbuf *base, const char *path,
unsigned int mode, int stage, void *context)
 {
+   struct path_hashmap_entry *entry;
int baselen = base->len;
struct merge_options *o = context;
 
strbuf_addstr(base, path);
 
-   if (S_ISDIR(mode))
-   string_list_insert(>current_directory_set, base->buf);
-   else
-   string_list_insert(>current_file_set, base->buf);
+   FLEX_ALLOC_MEM(entry, path, base->buf, base->len);
+   hashmap_entry_init(entry, pathhash(entry->path));
+   hashmap_add(>current_file_dir_set, entry);
 
strbuf_setlen(base, baselen);
return (S_ISDIR(mode) ? READ_TREE_RECURSIVE : 0);
@@ -642,6 +662,8 @@ static void add_flattened_path(struct strbuf *out, const 
char *s)
 
 static char *unique_path(struct merge_options *o, const char *path, const char 
*branch)
 {
+   struct path_hashmap_entry dummy;
+   struct path_hashmap_entry *entry;
struct strbuf newpath = STRBUF_INIT;
int suffix = 0;
size_t base_len;
@@ -650,14 +672,17 @@ static char *unique_path(struct merge_options *o, const 
char *path, const char *
add_flattened_path(, branch);
 
base_len = newpath.len;
-   while (string_list_has_string(>current_file_set, newpath.buf) ||
-  string_list_has_string(>current_directory_set, newpath.buf) ||
+   hashmap_entry_init(, pathhash(newpath.buf));
+   while (hashmap_get(>current_file_dir_set, , newpath.buf) ||
   (!o->call_depth && file_exists(newpath.buf))) {
strbuf_setlen(, base_len);
strbuf_addf(, "_%d", suffix++);
+   hashmap_entry_init(, pathhash(newpath.buf));
}
 
-   string_list_insert(>current_file_set, newpath.buf);
+   FLEX_ALLOC_MEM(entry, path, newpath.buf, newpath.len);
+   hashmap_entry_init(entry, pathhash(entry->path));
+   hashmap_add(>current_file_dir_set, entry);
return strbuf_detach(, NULL);
 }
 
@@ -1941,8 +1966,7 @@ int merge_trees(struct merge_options *o,
if (unmerged_cache()) {
struct string_list *entries, *re_head, *re_merge;
int i;
-   string_list_clear(>current_file_set, 1);
-   string_list_clear(>current_directory_set, 1);
+   hashmap_init(>current_file_dir_set, path_hashmap_cmp, NULL, 
512);
get_files_dirs(o, head);
get_files_dirs(o, merge);
 
@@ -1978,6 +2002,8 @@ int merge_trees(struct merge_options *o,
string_list_clear(re_head, 0);
string_list_clear(entries, 1);
 
+   hashmap_free(>current_file_dir_set, 1);
+
free(re_merge);
free(re_head);
free(entries);
@@ -2179,8 +2205,8 @@ void init_merge_options(struct merge_options *o)
if (o->verbosity >= 5)
o->buffer_output = 0;
strbuf_init(>obuf, 0);
-   string_list_init(>current_file_set, 1);
-   string_list_init(>current_directory_set, 1);
+   pathhash = ignore_case ? strihash : strhash;
+   pathcmp = ignore_case ? strcasecmp : strcmp;
string_list_init(>df_conflict_file_set, 1);
 }
 
diff --git a/merge-recurs

[PATCH 2/3] merge-recursive: remove return value from get_files_dirs

2017-08-28 Thread Kevin Willford
The return value of the get_files_dirs call is never being used.
Looking at the history of the file and it was originally only
being used for debug output statements.  Also when
read_tree_recursive return value is non zero it is changed to
zero.  This leads me to believe that it doesn't matter if
read_tree_recursive gets an error.

Since the debug output has been removed and the caller isn't
checking the return value there is no reason to keep calulating
and returning a value.

Signed-off-by: Kevin Willford <kewi...@microsoft.com>
---
 merge-recursive.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/merge-recursive.c b/merge-recursive.c
index 033d7cd406..d47f40ea81 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -328,15 +328,11 @@ static int save_files_dirs(const unsigned char *sha1,
return (S_ISDIR(mode) ? READ_TREE_RECURSIVE : 0);
 }
 
-static int get_files_dirs(struct merge_options *o, struct tree *tree)
+static void get_files_dirs(struct merge_options *o, struct tree *tree)
 {
-   int n;
struct pathspec match_all;
memset(_all, 0, sizeof(match_all));
-   if (read_tree_recursive(tree, "", 0, 0, _all, save_files_dirs, o))
-   return 0;
-   n = o->current_file_set.nr + o->current_directory_set.nr;
-   return n;
+   read_tree_recursive(tree, "", 0, 0, _all, save_files_dirs, o);
 }
 
 /*
-- 
2.14.1.329.g6edf0add19



[PATCH 0/3] merge-recursive: replace string_list with hashmap

2017-08-28 Thread Kevin Willford
Code was using two string_lists, one for the directories and
one for the files.  The code never checks the lists independently
so we should be able to only use one list.  The string_list also
is a O(log n) for lookup and insertion.  Switching this to use a
hashmap will give O(1) which will save some time when there are
millions of paths that will be checked.

Also cleaned up a memory leak and method where the return was not
being used.

Kevin Willford (3):
  merge-recursive: fix memory leak
  merge-recursive: remove return value from get_files_dirs
  merge-recursive: change current file dir string_lists to hashmap

 merge-recursive.c | 68 +++
 merge-recursive.h |  3 +--
 2 files changed, 49 insertions(+), 22 deletions(-)

-- 
2.14.1.329.g6edf0add19



[PATCH 1/3] merge-recursive: fix memory leak

2017-08-28 Thread Kevin Willford
In merge_trees if process_renames or process_entry returns less
than zero, the method will just return and not free re_merge,
re_head, or entries.

This change cleans up the allocated variables before returning
to the caller.

Signed-off-by: Kevin Willford <kewi...@microsoft.com>
---
 merge-recursive.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/merge-recursive.c b/merge-recursive.c
index 1494ffdb82..033d7cd406 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -1956,7 +1956,7 @@ int merge_trees(struct merge_options *o,
re_merge = get_renames(o, merge, common, head, merge, entries);
clean = process_renames(o, re_head, re_merge);
if (clean < 0)
-   return clean;
+   goto cleanup;
for (i = entries->nr-1; 0 <= i; i--) {
const char *path = entries->items[i].string;
struct stage_data *e = entries->items[i].util;
@@ -1964,8 +1964,10 @@ int merge_trees(struct merge_options *o,
int ret = process_entry(o, path, e);
if (!ret)
clean = 0;
-   else if (ret < 0)
-   return ret;
+   else if (ret < 0) {
+   clean = ret;
+   goto cleanup;
+   }
}
}
for (i = 0; i < entries->nr; i++) {
@@ -1975,6 +1977,7 @@ int merge_trees(struct merge_options *o,
entries->items[i].string);
}
 
+cleanup:
string_list_clear(re_merge, 0);
string_list_clear(re_head, 0);
string_list_clear(entries, 1);
@@ -1982,6 +1985,9 @@ int merge_trees(struct merge_options *o,
free(re_merge);
free(re_head);
free(entries);
+
+   if (clean < 0)
+   return clean;
}
else
clean = 1;
-- 
2.14.1.329.g6edf0add19



[PATCH 1/3] perf: add test for writing the index

2017-08-21 Thread Kevin Willford
A performance test for writing the index to be able to
determine if changes to allocating ondisk structure help.

Signed-off-by: Kevin Willford <kewi...@microsoft.com>
---
 Makefile|  1 +
 t/helper/test-write-cache.c | 23 +++
 t/perf/p0007-write-cache.sh | 29 +
 3 files changed, 53 insertions(+)
 create mode 100644 t/helper/test-write-cache.c
 create mode 100755 t/perf/p0007-write-cache.sh

diff --git a/Makefile b/Makefile
index 86ec29202b..c6b061086f 100644
--- a/Makefile
+++ b/Makefile
@@ -655,6 +655,7 @@ TEST_PROGRAMS_NEED_X += test-parse-options
 TEST_PROGRAMS_NEED_X += test-path-utils
 TEST_PROGRAMS_NEED_X += test-prio-queue
 TEST_PROGRAMS_NEED_X += test-read-cache
+TEST_PROGRAMS_NEED_X += test-write-cache
 TEST_PROGRAMS_NEED_X += test-ref-store
 TEST_PROGRAMS_NEED_X += test-regex
 TEST_PROGRAMS_NEED_X += test-revision-walking
diff --git a/t/helper/test-write-cache.c b/t/helper/test-write-cache.c
new file mode 100644
index 00..b7ee039669
--- /dev/null
+++ b/t/helper/test-write-cache.c
@@ -0,0 +1,23 @@
+#include "cache.h"
+#include "lockfile.h"
+
+static struct lock_file index_lock;
+
+int cmd_main(int argc, const char **argv)
+{
+   int i, cnt = 1, lockfd;
+   if (argc == 2)
+   cnt = strtol(argv[1], NULL, 0);
+   setup_git_directory();
+   read_cache();
+   for (i = 0; i < cnt; i++) {
+   lockfd = hold_locked_index(_lock, LOCK_DIE_ON_ERROR);
+   if (0 <= lockfd) {
+   write_locked_index(_index, _lock, 
COMMIT_LOCK);
+   } else {
+   rollback_lock_file(_lock);
+   }
+   }
+
+   return 0;
+}
diff --git a/t/perf/p0007-write-cache.sh b/t/perf/p0007-write-cache.sh
new file mode 100755
index 00..261fe92fd9
--- /dev/null
+++ b/t/perf/p0007-write-cache.sh
@@ -0,0 +1,29 @@
+#!/bin/sh
+
+test_description="Tests performance of writing the index"
+
+. ./perf-lib.sh
+
+test_perf_default_repo
+
+test_expect_success "setup repo" '
+   if git rev-parse --verify refs/heads/p0006-ballast^{commit}
+   then
+   echo Assuming synthetic repo from many-files.sh
+   git config --local core.sparsecheckout 1
+   cat >.git/info/sparse-checkout <<-EOF
+   /*
+   !ballast/*
+   EOF
+   else
+   echo Assuming non-synthetic repo...
+   fi &&
+   nr_files=$(git ls-files | wc -l)
+'
+
+count=3
+test_perf "write_locked_index $count times ($nr_files files)" "
+   test-write-cache $count
+"
+
+test_done
-- 
2.14.1.205.g2812f3410d



[PATCH 2/3] read-cache: fix memory leak in do_write_index

2017-08-21 Thread Kevin Willford
The previous_name_buf was never getting released when there
was an error in ce_write_entry or allow was false and execution
was returned to the caller.

Signed-off-by: Kevin Willford <kewi...@microsoft.com>
---
 read-cache.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/read-cache.c b/read-cache.c
index acfb028f48..47220cc30d 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -2192,7 +2192,7 @@ static int do_write_index(struct index_state *istate, 
struct tempfile *tempfile,
int newfd = tempfile->fd;
git_SHA_CTX c;
struct cache_header hdr;
-   int i, err, removed, extended, hdr_version;
+   int i, err = 0, removed, extended, hdr_version;
struct cache_entry **cache = istate->cache;
int entries = istate->cache_nr;
struct stat st;
@@ -2247,15 +2247,21 @@ static int do_write_index(struct index_state *istate, 
struct tempfile *tempfile,
if (allow)
warning(msg, ce->name);
else
-   return error(msg, ce->name);
+   err = error(msg, ce->name);
 
drop_cache_tree = 1;
}
if (ce_write_entry(, newfd, ce, previous_name) < 0)
-   return -1;
+   err = -1;
+
+   if (err)
+   break;
}
strbuf_release(_name_buf);
 
+   if (err)
+   return err;
+
/* Write extension data here */
if (!strip_extensions && istate->split_index) {
struct strbuf sb = STRBUF_INIT;
-- 
2.14.1.205.g2812f3410d



[PATCH 3/3] read-cache: avoid allocating every ondisk entry when writing

2017-08-21 Thread Kevin Willford
When writing the index for each entry an ondisk struct will be
allocated and freed in ce_write_entry.  We can do better by
using a ondisk struct on the stack for each entry.

This is accomplished by using a stack ondisk_cache_entry_extended
outside looping through the entries in do_write_index.  Only the
fixed fields of this struct are used when writing and depending on
whether it is extended or not the flags2 field will be written.
The name field is not used and instead the cache_entry name field
is used directly when writing out the name.  Because ce_write is
using a buffer and memcpy to fill the buffer before flushing to disk,
we don't have to worry about doing multiple ce_write calls.

Running the p0007-write-cache.sh tests would save anywhere
between 3-7% when the index had over a million entries with no
performance degradation on small repos.

Signed-off-by: Kevin Willford <kewi...@microsoft.com>
---
 read-cache.c | 50 +-
 1 file changed, 25 insertions(+), 25 deletions(-)

diff --git a/read-cache.c b/read-cache.c
index 47220cc30d..694bed8d82 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1499,6 +1499,7 @@ struct ondisk_cache_entry_extended {
 };
 
 /* These are only used for v3 or lower */
+#define align_padding_size(size, len) ((size + (len) + 8) & ~7) - (size + len)
 #define align_flex_name(STRUCT,len) ((offsetof(struct STRUCT,name) + (len) + 
8) & ~7)
 #define ondisk_cache_entry_size(len) align_flex_name(ondisk_cache_entry,len)
 #define ondisk_cache_entry_extended_size(len) 
align_flex_name(ondisk_cache_entry_extended,len)
@@ -2032,7 +2033,7 @@ static void ce_smudge_racily_clean_entry(struct 
cache_entry *ce)
 }
 
 /* Copy miscellaneous fields but not the name */
-static char *copy_cache_entry_to_ondisk(struct ondisk_cache_entry *ondisk,
+static void copy_cache_entry_to_ondisk(struct ondisk_cache_entry *ondisk,
   struct cache_entry *ce)
 {
short flags;
@@ -2056,32 +2057,35 @@ static char *copy_cache_entry_to_ondisk(struct 
ondisk_cache_entry *ondisk,
struct ondisk_cache_entry_extended *ondisk2;
ondisk2 = (struct ondisk_cache_entry_extended *)ondisk;
ondisk2->flags2 = htons((ce->ce_flags & CE_EXTENDED_FLAGS) >> 
16);
-   return ondisk2->name;
-   }
-   else {
-   return ondisk->name;
}
 }
 
 static int ce_write_entry(git_SHA_CTX *c, int fd, struct cache_entry *ce,
- struct strbuf *previous_name)
+ struct strbuf *previous_name, struct 
ondisk_cache_entry *ondisk)
 {
int size;
-   struct ondisk_cache_entry *ondisk;
int saved_namelen = saved_namelen; /* compiler workaround */
-   char *name;
int result;
+   static unsigned char padding[8] = { 0x00 };
 
if (ce->ce_flags & CE_STRIP_NAME) {
saved_namelen = ce_namelen(ce);
ce->ce_namelen = 0;
}
 
+   if (ce->ce_flags & CE_EXTENDED)
+   size = offsetof(struct ondisk_cache_entry_extended, name);
+   else
+   size = offsetof(struct ondisk_cache_entry, name);
+
if (!previous_name) {
-   size = ondisk_ce_size(ce);
-   ondisk = xcalloc(1, size);
-   name = copy_cache_entry_to_ondisk(ondisk, ce);
-   memcpy(name, ce->name, ce_namelen(ce));
+   int len = ce_namelen(ce);
+   copy_cache_entry_to_ondisk(ondisk, ce);
+   result = ce_write(c, fd, ondisk, size);
+   if (!result)
+   result = ce_write(c, fd, ce->name, len);
+   if (!result)
+   result = ce_write(c, fd, padding, 
align_padding_size(size, len));
} else {
int common, to_remove, prefix_size;
unsigned char to_remove_vi[16];
@@ -2094,16 +2098,12 @@ static int ce_write_entry(git_SHA_CTX *c, int fd, 
struct cache_entry *ce,
to_remove = previous_name->len - common;
prefix_size = encode_varint(to_remove, to_remove_vi);
 
-   if (ce->ce_flags & CE_EXTENDED)
-   size = offsetof(struct ondisk_cache_entry_extended, 
name);
-   else
-   size = offsetof(struct ondisk_cache_entry, name);
-   size += prefix_size + (ce_namelen(ce) - common + 1);
-
-   ondisk = xcalloc(1, size);
-   name = copy_cache_entry_to_ondisk(ondisk, ce);
-   memcpy(name, to_remove_vi, prefix_size);
-   memcpy(name + prefix_size, ce->name + common, ce_namelen(ce) - 
common);
+   copy_cache_entry_to_ondisk(ondisk, ce);
+   result = ce_write(c, fd, ondisk, size);
+   if (!result)
+   result = ce_write(c, fd, to_remove_vi, prefix_size);

[PATCH 0/3] read-cache: use stack ondisk struct when writing index

2017-08-21 Thread Kevin Willford
When writing out the index, the ondisk struct was being allocated and freed
within the loop of cache entries.  A better way would be to use a ondisk
struct on the stack and reuse it avoiding the alloc and free calls.

A test has been added to measure the performance of writing the index
(p0007-write-cache).  Running this test on smaller repos showed no
degradation in performance.  On larger repos there was ~3-7% improvement.

0007.2: write_locked_index 10 times (101 files)   5.98(0.00+0.04)   
5.75(0.01+0.04) -3.8%
0007.2: write_locked_index 10 times (101 files)   6.20(0.00+0.06)   
5.86(0.01+0.03) -5.5%
0007.2: write_locked_index 3 times (4394531 files)   10.29(0.04+0.03)   
9.75(0.04+0.01) -5.2%
0007.2: write_locked_index 3 times (4394531 files)   10.52(0.00+0.04)   
9.79(0.03+0.03) -6.9%

Kevin Willford (3):
  perf: add test for writing the index
  read-cache: fix memory leak in do_write_index
  read-cache: avoid allocating every ondisk entry when writing

 Makefile|  1 +
 read-cache.c| 62 +
 t/helper/test-write-cache.c | 23 +
 t/perf/p0007-write-cache.sh | 29 +
 4 files changed, 87 insertions(+), 28 deletions(-)
 create mode 100644 t/helper/test-write-cache.c
 create mode 100755 t/perf/p0007-write-cache.sh

-- 
2.14.1.205.g2812f3410d



RE: [PATCH v2] commit: skip discarding the index if there is no pre-commit hook

2017-08-14 Thread Kevin Willford
> On Mon, Aug 14, 2017 at 03:54:25PM -0600, Kevin Willford wrote:
> 
> > If there is not a pre-commit hook, there is no reason to discard
> > the index and reread it.
> >
> > This change checks to presence of a pre-commit hook and then only
> > discards the index if there was one.
> >
> > Signed-off-by: Kevin Willford <kewi...@microsoft.com>
> > ---
> >  builtin/commit.c | 15 +--
> >  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> Thanks, this looks nice and simple.
> 
> > diff --git a/builtin/commit.c b/builtin/commit.c
> > index e7a2cb6285..ab71b93518 100644
> > --- a/builtin/commit.c
> > +++ b/builtin/commit.c
> > @@ -940,12 +940,15 @@ static int prepare_to_commit(const char *index_file,
> const char *prefix,
> > return 0;
> > }
> >
> > -   /*
> > -* Re-read the index as pre-commit hook could have updated it,
> > -* and write it out as a tree.  We must do this before we invoke
> > -* the editor and after we invoke run_status above.
> > -*/
> > -   discard_cache();
> > +   if (!no_verify && find_hook("pre-commit")) {
> > +   /*
> > +* Re-read the index as pre-commit hook could have updated it,
> > +* and write it out as a tree.  We must do this before we invoke
> > +* the editor and after we invoke run_status above.
> > +*/
> > +   discard_cache();
> > +   }
> > +
> > read_cache_from(index_file);
> 
> This read_cache_from() should be a noop, right, because it immediately
> sees istate->initialized is set? So it shouldn't matter that it is not
> in the conditional with discard_cache(). Though if its only purpose is
> to re-read the just-discarded contents, perhaps it makes sense to put it
> there for readability.
> 
> -Peff

I thought about that and didn't know if there were cases when this would be 
called
and the cache has not been loaded.  It didn't look like it since it is only 
called from 
cmd_commit and prepare_index is called before it.  Also if in the future this 
call would
be made when it had not read the index yet so thought it was safest just to 
leave
this as always being called since it is basically a noop if the 
istate->initialized is set.

Also based on this commit
https://github.com/git/git/commit/2888605c649ccd423232161186d72c0e6c458a48
it looked like the discard_cache was added independent of the read_cache_from 
call,
which made me think that the two were not tied together.

Kevin


[PATCH v2] commit: skip discarding the index if there is no pre-commit hook

2017-08-14 Thread Kevin Willford
If there is not a pre-commit hook, there is no reason to discard
the index and reread it.

This change checks to presence of a pre-commit hook and then only
discards the index if there was one.

Signed-off-by: Kevin Willford <kewi...@microsoft.com>
---
 builtin/commit.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/builtin/commit.c b/builtin/commit.c
index e7a2cb6285..ab71b93518 100644
--- a/builtin/commit.c
+++ b/builtin/commit.c
@@ -940,12 +940,15 @@ static int prepare_to_commit(const char *index_file, 
const char *prefix,
return 0;
}
 
-   /*
-* Re-read the index as pre-commit hook could have updated it,
-* and write it out as a tree.  We must do this before we invoke
-* the editor and after we invoke run_status above.
-*/
-   discard_cache();
+   if (!no_verify && find_hook("pre-commit")) {
+   /*
+* Re-read the index as pre-commit hook could have updated it,
+* and write it out as a tree.  We must do this before we invoke
+* the editor and after we invoke run_status above.
+*/
+   discard_cache();
+   }
+
read_cache_from(index_file);
if (update_main_cache_tree(0)) {
error(_("Error building trees"));
-- 
2.14.1.481.g785c1dc9ae



Re: [PATCH] cache-tree: remove use of strbuf_addf in update_one

2017-08-10 Thread Kevin Willford



On 8/10/2017 3:03 PM, Jeff King wrote:

On Thu, Aug 10, 2017 at 11:58:34AM -0700, Stefan Beller wrote:


On Thu, Aug 10, 2017 at 11:47 AM, Kevin Willford <kcwillf...@gmail.com> wrote:

String formatting can be a performance issue when there are
hundreds of thousands of trees.

When changing this for the sake of performance, could you give
an example (which kind of repository you need for this to become
a bottleneck? I presume the large Windows repo? Or can I
reproduce it with a small repo such as linux.git or even git.git?)
and some numbers how this improves the performance?

I was about to say the same thing. Normally I don't mind a small
optimization without numbers if the result is obviously an improvement.

But in this case the result is a lot less readable, and it's not
entirely clear to me that it would always be an improvement (we now
always run 3 strbuf calls instead of one, and have to check the length
for each one).

What I'm wondering specifically is if vsnprintf() on Kevin's platform
(which I'll assume is Windows) is slow, and we would do better to
replace it with a faster compat/ routine.

-Peff


The strbuf_add call is essentially only having to do a memcpy whereas
the strbuf_addf will have to parse the string, determine the types,
convert the data, and then get it in the buffer.  That could be made
faster with a better compat/ routine but I fear still far from
the length check and memcpy.

void strbuf_add(struct strbuf *sb, const void *data, size_t len)
{
strbuf_grow(sb, len);
memcpy(sb->buf + sb->len, data, len);
strbuf_setlen(sb, sb->len + len);
}

Here are some of the performance numbers from the windows repo.
I will work on writing a perf test for this change so that we
have a better idea on smaller repo what the impact of this change
is on them.

 | w/o | with fix |
---
git checkout | 36.08 s | 33.34 s  |
---
git checkout | 32.54 s | 28.26 s  |
---
git checkout | 44.10 s | 38.13 s  |
---
git merge| 32.90 s | 30.56 s  |
---
git rebase   | 46.14 s | 42.18 s  |





[PATCH] commit: skip discarding the index if there is no pre-commit hook

2017-08-10 Thread Kevin Willford
If there is not a pre-commit hook, there is no reason to discard
the index and reread it.

This change checks to presence of a pre-commit hook and then only
discards the index if there was one.

Signed-off-by: Kevin Willford <kewi...@microsoft.com>
---
 builtin/commit.c | 29 +
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/builtin/commit.c b/builtin/commit.c
index e7a2cb6285..443949d87b 100644
--- a/builtin/commit.c
+++ b/builtin/commit.c
@@ -671,12 +671,22 @@ static int prepare_to_commit(const char *index_file, 
const char *prefix,
const char *hook_arg2 = NULL;
int clean_message_contents = (cleanup_mode != CLEANUP_NONE);
int old_display_comment_prefix;
+   const char *precommit_hook = NULL;
 
/* This checks and barfs if author is badly specified */
determine_author_info(author_ident);
 
-   if (!no_verify && run_commit_hook(use_editor, index_file, "pre-commit", 
NULL))
-   return 0;
+
+   if (!no_verify) {
+   /*
+* Check to see if there is a pre-commit hook
+* If there not one we can skip discarding the index later on
+*/
+   precommit_hook = find_hook("pre-commit");
+   if (precommit_hook &&
+   run_commit_hook(use_editor, index_file, "pre-commit", NULL))
+   return 0;
+   }
 
if (squash_message) {
/*
@@ -940,12 +950,15 @@ static int prepare_to_commit(const char *index_file, 
const char *prefix,
return 0;
}
 
-   /*
-* Re-read the index as pre-commit hook could have updated it,
-* and write it out as a tree.  We must do this before we invoke
-* the editor and after we invoke run_status above.
-*/
-   discard_cache();
+   if (!no_verify && precommit_hook) {
+   /*
+* Re-read the index as pre-commit hook could have updated it,
+* and write it out as a tree.  We must do this before we invoke
+* the editor and after we invoke run_status above.
+*/
+   discard_cache();
+   }
+
read_cache_from(index_file);
if (update_main_cache_tree(0)) {
error(_("Error building trees"));
-- 
2.14.0.rc0.286.g44127d70e4



[PATCH] cache-tree: remove use of strbuf_addf in update_one

2017-08-10 Thread Kevin Willford
String formatting can be a performance issue when there are
hundreds of thousands of trees.

Change to stop using the strbuf_addf and just add the strings
or characters individually.

There are a limited number of modes so added a switch for the
known ones and a default case if something comes through that
are not a known one for git.

Signed-off-by: Kevin Willford <kewi...@microsoft.com>
---
 cache-tree.c | 24 +++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/cache-tree.c b/cache-tree.c
index 2440d1dc89..41744b3db7 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -390,7 +390,29 @@ static int update_one(struct cache_tree *it,
continue;
 
strbuf_grow(, entlen + 100);
-   strbuf_addf(, "%o %.*s%c", mode, entlen, path + baselen, 
'\0');
+
+   switch (mode) {
+   case 0100644:
+   strbuf_add(, "100644 ", 7);
+   break;
+   case 0100664:
+   strbuf_add(, "100664 ", 7);
+   break;
+   case 0100755:
+   strbuf_add(, "100755 ", 7);
+   break;
+   case 012:
+   strbuf_add(, "12 ", 7);
+   break;
+   case 016:
+   strbuf_add(, "16 ", 7);
+   break;
+   default:
+   strbuf_addf(, "%o ", mode);
+   break;
+   }
+   strbuf_add(, path + baselen, entlen);
+   strbuf_addch(, '\0');
strbuf_add(, sha1, 20);
 
 #if DEBUG
-- 
2.14.0.rc0.286.g44127d70e4



[PATCH v2 0/2] Add progress for format-patch and rebase

2017-08-10 Thread Kevin Willford
Changes since last patch:
1. Use start_progress_delay so progress isn't shown if generating
   the patches is fast enough
2. Updated to have text of "Generating patches"
3. Only show progress when the --progress flag is passed
4. In the rebase script check stderr and the quiet option is not
   set before propagating the progress flag to format-patch

Kevin Willford (2):
  format-patch: have progress option while generating patches
  rebase: turn on progress option by default for format-patch

 Documentation/git-format-patch.txt |  4 
 builtin/log.c  | 10 ++
 git-rebase--am.sh  |  1 +
 git-rebase.sh  |  6 ++
 4 files changed, 21 insertions(+)

-- 
2.14.0.rc0.286.g44127d70e4



[PATCH v2 1/2] format-patch: have progress option while generating patches

2017-08-10 Thread Kevin Willford
When generating patches for the rebase command if the user does
not realize the branch they are rebasing onto is thousands of
commits different there is no progress indication after initial
rewinding message.

The progress meter as presented in this patch assumes the thousands of
patches to have a fine granularity as well as assuming to require all the
same amount of work/time for each, such that a steady progress bar
is achieved.

We do not want to estimate the time for each patch based e.g.
on their size or number of touched files (or parents) as that is too
expensive for just a progress meter.

This patch allows a progress option to be passed to format-patch
so that the user can be informed the progress of generating the
patch.  This option is then used by the rebase command when
calling format-patch.

Signed-off-by: Kevin Willford <kewi...@microsoft.com>
---
 Documentation/git-format-patch.txt |  4 
 builtin/log.c  | 10 ++
 2 files changed, 14 insertions(+)

diff --git a/Documentation/git-format-patch.txt 
b/Documentation/git-format-patch.txt
index c890328b02..6cbe462a77 100644
--- a/Documentation/git-format-patch.txt
+++ b/Documentation/git-format-patch.txt
@@ -23,6 +23,7 @@ SYNOPSIS
   [(--reroll-count|-v) ]
   [--to=] [--cc=]
   [--[no-]cover-letter] [--quiet] [--notes[=]]
+  [--progress]
   []
   [  |  ]
 
@@ -283,6 +284,9 @@ you can use `--suffix=-patch` to get 
`0001-description-of-my-change-patch`.
range are always formatted as creation patches, independently
of this flag.
 
+--progress::
+   Show progress reports on stderr as patches are generated.
+
 CONFIGURATION
 -
 You can specify extra mail header lines to be added to each message,
diff --git a/builtin/log.c b/builtin/log.c
index 725c7b8a1a..b07a5529c2 100644
--- a/builtin/log.c
+++ b/builtin/log.c
@@ -27,6 +27,7 @@
 #include "version.h"
 #include "mailmap.h"
 #include "gpg-interface.h"
+#include "progress.h"
 
 /* Set a default date-time format for git log ("log.date" config variable) */
 static const char *default_date_mode = NULL;
@@ -1422,6 +1423,8 @@ int cmd_format_patch(int argc, const char **argv, const 
char *prefix)
char *branch_name = NULL;
char *base_commit = NULL;
struct base_tree_info bases;
+   int show_progress = 0;
+   struct progress *progress = NULL;
 
const struct option builtin_format_patch_options[] = {
{ OPTION_CALLBACK, 'n', "numbered", , NULL,
@@ -1493,6 +1496,8 @@ int cmd_format_patch(int argc, const char **argv, const 
char *prefix)
OPT_FILENAME(0, "signature-file", _file,
N_("add a signature from a file")),
OPT__QUIET(, N_("don't print the patch filenames")),
+   OPT_BOOL(0, "progress", _progress,
+N_("show progress while generating patches")),
OPT_END()
};
 
@@ -1752,8 +1757,12 @@ int cmd_format_patch(int argc, const char **argv, const 
char *prefix)
start_number--;
}
rev.add_signoff = do_signoff;
+
+   if (show_progress)
+   progress = start_progress_delay(_("Generating patches"), total, 
0, 1);
while (0 <= --nr) {
int shown;
+   display_progress(progress, total - nr);
commit = list[nr];
rev.nr = total - nr + (start_number - 1);
/* Make the second and subsequent mails replies to the first */
@@ -1818,6 +1827,7 @@ int cmd_format_patch(int argc, const char **argv, const 
char *prefix)
if (!use_stdout)
fclose(rev.diffopt.file);
}
+   stop_progress();
free(list);
free(branch_name);
string_list_clear(_to, 0);
-- 
2.14.0.rc0.286.g44127d70e4



[PATCH v2 2/2] rebase: turn on progress option by default for format-patch

2017-08-10 Thread Kevin Willford
This change passes the progress option of format-patch checking
that stderr is attached and rebase is not being run in quiet mode.

Signed-off-by: Kevin Willford <kewi...@microsoft.com>
---
 git-rebase--am.sh | 1 +
 git-rebase.sh | 6 ++
 2 files changed, 7 insertions(+)

diff --git a/git-rebase--am.sh b/git-rebase--am.sh
index 375239341f..ff98fe3a73 100644
--- a/git-rebase--am.sh
+++ b/git-rebase--am.sh
@@ -53,6 +53,7 @@ else
 
git format-patch -k --stdout --full-index --cherry-pick --right-only \
--src-prefix=a/ --dst-prefix=b/ --no-renames --no-cover-letter \
+   $git_format_patch_opt \
"$revisions" ${restrict_revision+^$restrict_revision} \
>"$GIT_DIR/rebased-patches"
ret=$?
diff --git a/git-rebase.sh b/git-rebase.sh
index f8b3d1fd97..ad8415e3cf 100755
--- a/git-rebase.sh
+++ b/git-rebase.sh
@@ -74,6 +74,7 @@ test "$(git config --bool rebase.stat)" = true && diffstat=t
 autostash="$(git config --bool rebase.autostash || echo false)"
 fork_point=auto
 git_am_opt=
+git_format_patch_opt=
 rebase_root=
 force_rebase=
 allow_rerere_autoupdate=
@@ -445,6 +446,11 @@ else
state_dir="$apply_dir"
 fi
 
+if test -t 2 && test -z "$GIT_QUIET"
+then
+   git_format_patch_opt="$git_format_patch_opt --progress"
+fi
+
 if test -z "$rebase_root"
 then
case "$#" in
-- 
2.14.0.rc0.286.g44127d70e4



RE: [PATCH 2/2] rebase: turn on progress option by default for format-patch

2017-05-31 Thread Kevin Willford
> -Original Message-
> From: Stefan Beller [mailto:sbel...@google.com]
> Sent: Wednesday, May 31, 2017 1:09 PM
> To: Kevin Willford <kcwillf...@gmail.com>
> Cc: git@vger.kernel.org; Junio C Hamano <gits...@pobox.com>; Kevin
> Willford <kewi...@microsoft.com>
> Subject: Re: [PATCH 2/2] rebase: turn on progress option by default for
> format-patch
> 
> On Wed, May 31, 2017 at 8:04 AM, Kevin Willford <kcwillf...@gmail.com>
> wrote:
> > This change passes the progress option of format-patch by default and
> > passes the -q --quiet option through to the format-patch call so that
> > it is respected as well.
> 
> This is not conflicting with Johannes rewrite of rebase in C?
> (rebase is a huge beast IIUC)

I will check with Johannes and see what possible conflicts there could be.
Since these are flags that get passed to the format-patch code, it shouldn't 
take much to put it in the C code as well.

> 
> When passing the progress option by default to formatting patches, maybe
> we should use start_progress_delay in the previous patch instead to omit
> the progress for short lived patch formatting sessions?
> (say a delay of one second?)
> 

I thought about that and certainly could do it but I have found it nice to have 
the number of patches that are generated in the output even for a small number 
or commits.  For example when I run a `git rebase master` and expect there to 
be only 2 commits, the message "Generating patch: 100% (2/2), done."  Gives me 
that good feeling that I did it right and didn't mess something up.  I'm good 
either way though.

> Thanks,
> Stefan
> 
> >
> > Signed-off-by: Kevin Willford <kewi...@microsoft.com>
> > ---
> >  git-rebase--am.sh | 5 +++--
> >  git-rebase.sh | 2 ++
> >  2 files changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/git-rebase--am.sh b/git-rebase--am.sh index
> > 375239341f..ab2be30abf 100644
> > --- a/git-rebase--am.sh
> > +++ b/git-rebase--am.sh
> > @@ -51,8 +51,9 @@ then
> >  else
> > rm -f "$GIT_DIR/rebased-patches"
> >
> > -   git format-patch -k --stdout --full-index --cherry-pick 
> > --right-only \
> > -   --src-prefix=a/ --dst-prefix=b/ --no-renames 
> > --no-cover-letter \
> > +   git format-patch $git_format_patch_opt -k --stdout --full-index \
> > +   --cherry-pick --right-only --src-prefix=a/ --dst-prefix=b/ \
> > +   --no-renames --no-cover-letter --progress \
> > "$revisions" ${restrict_revision+^$restrict_revision} \
> > >"$GIT_DIR/rebased-patches"
> > ret=$?
> > diff --git a/git-rebase.sh b/git-rebase.sh index
> > db1deed846..b72e319ac9 100755
> > --- a/git-rebase.sh
> > +++ b/git-rebase.sh
> > @@ -73,6 +73,7 @@ test "$(git config --bool rebase.stat)" = true &&
> > diffstat=t  autostash="$(git config --bool rebase.autostash || echo false)"
> >  fork_point=auto
> >  git_am_opt=
> > +git_format_patch_opt=
> >  rebase_root=
> >  force_rebase=
> >  allow_rerere_autoupdate=
> > @@ -308,6 +309,7 @@ do
> > --quiet)
> > GIT_QUIET=t
> > git_am_opt="$git_am_opt -q"
> > +   git_format_patch_opt="$git_format_patch_opt -q"
> > verbose=
> > diffstat=
> > ;;
> > --
> > 2.13.0.92.g73a4ce6a77
> >


RE: [PATCH 1/2] format-patch: have progress option while generating patches

2017-05-31 Thread Kevin Willford
> -Original Message-
> From: git-ow...@vger.kernel.org [mailto:git-ow...@vger.kernel.org] On
> Behalf Of Stefan Beller
> Sent: Wednesday, May 31, 2017 12:40 PM
> To: Kevin Willford <kcwillf...@gmail.com>
> Cc: git@vger.kernel.org; Junio C Hamano <gits...@pobox.com>; Kevin
> Willford <kewi...@microsoft.com>
> Subject: Re: [PATCH 1/2] format-patch: have progress option while
> generating patches
> 
> On Wed, May 31, 2017 at 8:04 AM, Kevin Willford <kcwillf...@gmail.com>
> wrote:
> > When generating patches for the rebase command if the user does not
> > realize the branch they are rebasing onto is thousands of commits
> > different there is no progress indication after initial rewinding
> > message.
> >
> > This patch allows a progress option to be passed to format-patch so
> > that the user can be informed the progress of generating the patch.
> > This option will then be used by the rebase command when calling
> > format-patch.
> 
> After reading the code, I was looking for some explanation on the underlying
> assumptions, such as:
> 
>   The progress meter as presented in this patch assumes the thousands of
>   patches to have a fine granularity as well as assuming to require all the
>   same amount of work/time for each, such that a steady progress bar
>   is achieved.
> 
>   We do not want to estimate the time for each patch based e.g.
>   on their size or number of touched files (or parents) as that is too
>   expensive for just a progress meter.
> 

Sounds good.  I will add some explanation to the commit message.

> 
> >
> > Signed-off-by: Kevin Willford <kewi...@microsoft.com>
> > ---
> >  Documentation/git-format-patch.txt |  8 
> >  builtin/log.c  | 10 ++
> >  2 files changed, 18 insertions(+)
> >
> > diff --git a/Documentation/git-format-patch.txt
> > b/Documentation/git-format-patch.txt
> > index c890328b02..ee5f99f606 100644
> > --- a/Documentation/git-format-patch.txt
> > +++ b/Documentation/git-format-patch.txt
> > @@ -23,6 +23,7 @@ SYNOPSIS
> >[(--reroll-count|-v) ]
> >[--to=] [--cc=]
> >[--[no-]cover-letter] [--quiet] [--notes[=]]
> > +  [--progress]
> >[]
> >[  |  ]
> >
> > @@ -260,6 +261,7 @@ you can use `--suffix=-patch` to get `0001-
> description-of-my-change-patch`.
> >  -q::
> >  --quiet::
> > Do not print the names of the generated files to standard output.
> > +   Progress is not reported to the standard error stream.
> >
> >  --no-binary::
> > Do not output contents of changes in binary files, instead @@
> > -283,6 +285,12 @@ you can use `--suffix=-patch` to get `0001-description-
> of-my-change-patch`.
> > range are always formatted as creation patches, independently
> > of this flag.
> >
> > +--progress::
> > +   Progress status is reported on the standard error stream
> > +   by default when it is attached to a terminal, unless -q
> > +   is specified. This flag forces progress status even if the
> > +   standard error stream is not directed to a terminal.
> > +
> >  CONFIGURATION
> >  -
> >  You can specify extra mail header lines to be added to each message,
> > diff --git a/builtin/log.c b/builtin/log.c index
> > 631fbc984f..02c50431b6 100644
> > --- a/builtin/log.c
> > +++ b/builtin/log.c
> > @@ -26,6 +26,7 @@
> >  #include "version.h"
> >  #include "mailmap.h"
> >  #include "gpg-interface.h"
> > +#include "progress.h"
> >
> >  /* Set a default date-time format for git log ("log.date" config
> > variable) */  static const char *default_date_mode = NULL; @@ -1409,6
> > +1410,8 @@ int cmd_format_patch(int argc, const char **argv, const char
> *prefix)
> > char *branch_name = NULL;
> > char *base_commit = NULL;
> > struct base_tree_info bases;
> > +   int show_progress = 0;
> > +   struct progress *progress = NULL;
> >
> > const struct option builtin_format_patch_options[] = {
> > { OPTION_CALLBACK, 'n', "numbered", , NULL,
> > @@ -1480,6 +1483,8 @@ int cmd_format_patch(int argc, const char **argv,
> const char *prefix)
> > OPT_FILENAME(0, "signature-file", _file,
> > N_("add a signature from a file")),
> >

[PATCH 0/2] Add progress to format-patch and rebase

2017-05-31 Thread Kevin Willford
This patch series is to add progress for the user when generating
patches in format-patch.  This is to aid the user when performing 
large rebases and have some indication to the user that progress
is being made.  

These patches just add the normal progress indication when
generating patches and makes it the default when running rebase.
It will also respect the -q[uiet] flag and not show when it
is present.

Kevin Willford (2):
  format-patch: have progress option while generating patches
  rebase: turn on progress option by default for format-patch

 Documentation/git-format-patch.txt |  8 
 builtin/log.c  | 10 ++
 git-rebase--am.sh  |  5 +++--
 git-rebase.sh  |  2 ++
 4 files changed, 23 insertions(+), 2 deletions(-)

-- 
2.13.0.92.g73a4ce6a77



[PATCH 1/2] format-patch: have progress option while generating patches

2017-05-31 Thread Kevin Willford
When generating patches for the rebase command if the user does
not realize the branch they are rebasing onto is thousands of
commits different there is no progress indication after initial
rewinding message.

This patch allows a progress option to be passed to format-patch
so that the user can be informed the progress of generating the
patch.  This option will then be used by the rebase command when
calling format-patch.

Signed-off-by: Kevin Willford <kewi...@microsoft.com>
---
 Documentation/git-format-patch.txt |  8 
 builtin/log.c  | 10 ++
 2 files changed, 18 insertions(+)

diff --git a/Documentation/git-format-patch.txt 
b/Documentation/git-format-patch.txt
index c890328b02..ee5f99f606 100644
--- a/Documentation/git-format-patch.txt
+++ b/Documentation/git-format-patch.txt
@@ -23,6 +23,7 @@ SYNOPSIS
   [(--reroll-count|-v) ]
   [--to=] [--cc=]
   [--[no-]cover-letter] [--quiet] [--notes[=]]
+  [--progress]
   []
   [  |  ]
 
@@ -260,6 +261,7 @@ you can use `--suffix=-patch` to get 
`0001-description-of-my-change-patch`.
 -q::
 --quiet::
Do not print the names of the generated files to standard output.
+   Progress is not reported to the standard error stream.
 
 --no-binary::
Do not output contents of changes in binary files, instead
@@ -283,6 +285,12 @@ you can use `--suffix=-patch` to get 
`0001-description-of-my-change-patch`.
range are always formatted as creation patches, independently
of this flag.
 
+--progress::
+   Progress status is reported on the standard error stream
+   by default when it is attached to a terminal, unless -q
+   is specified. This flag forces progress status even if the
+   standard error stream is not directed to a terminal.
+
 CONFIGURATION
 -
 You can specify extra mail header lines to be added to each message,
diff --git a/builtin/log.c b/builtin/log.c
index 631fbc984f..02c50431b6 100644
--- a/builtin/log.c
+++ b/builtin/log.c
@@ -26,6 +26,7 @@
 #include "version.h"
 #include "mailmap.h"
 #include "gpg-interface.h"
+#include "progress.h"
 
 /* Set a default date-time format for git log ("log.date" config variable) */
 static const char *default_date_mode = NULL;
@@ -1409,6 +1410,8 @@ int cmd_format_patch(int argc, const char **argv, const 
char *prefix)
char *branch_name = NULL;
char *base_commit = NULL;
struct base_tree_info bases;
+   int show_progress = 0;
+   struct progress *progress = NULL;
 
const struct option builtin_format_patch_options[] = {
{ OPTION_CALLBACK, 'n', "numbered", , NULL,
@@ -1480,6 +1483,8 @@ int cmd_format_patch(int argc, const char **argv, const 
char *prefix)
OPT_FILENAME(0, "signature-file", _file,
N_("add a signature from a file")),
OPT__QUIET(, N_("don't print the patch filenames")),
+   OPT_BOOL(0, "progress", _progress,
+N_("show progress")),
OPT_END()
};
 
@@ -1739,8 +1744,12 @@ int cmd_format_patch(int argc, const char **argv, const 
char *prefix)
start_number--;
}
rev.add_signoff = do_signoff;
+
+   if (show_progress && !quiet)
+   progress = start_progress(_("Generating patch"), total);
while (0 <= --nr) {
int shown;
+   display_progress(progress, total - nr);
commit = list[nr];
rev.nr = total - nr + (start_number - 1);
/* Make the second and subsequent mails replies to the first */
@@ -1805,6 +1814,7 @@ int cmd_format_patch(int argc, const char **argv, const 
char *prefix)
if (!use_stdout)
fclose(rev.diffopt.file);
}
+   stop_progress();
free(list);
free(branch_name);
string_list_clear(_to, 0);
-- 
2.13.0.92.g73a4ce6a77



[PATCH 2/2] rebase: turn on progress option by default for format-patch

2017-05-31 Thread Kevin Willford
This change passes the progress option of format-patch by
default and passes the -q --quiet option through to the
format-patch call so that it is respected as well.

Signed-off-by: Kevin Willford <kewi...@microsoft.com>
---
 git-rebase--am.sh | 5 +++--
 git-rebase.sh | 2 ++
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/git-rebase--am.sh b/git-rebase--am.sh
index 375239341f..ab2be30abf 100644
--- a/git-rebase--am.sh
+++ b/git-rebase--am.sh
@@ -51,8 +51,9 @@ then
 else
rm -f "$GIT_DIR/rebased-patches"
 
-   git format-patch -k --stdout --full-index --cherry-pick --right-only \
-   --src-prefix=a/ --dst-prefix=b/ --no-renames --no-cover-letter \
+   git format-patch $git_format_patch_opt -k --stdout --full-index \
+   --cherry-pick --right-only --src-prefix=a/ --dst-prefix=b/ \
+   --no-renames --no-cover-letter --progress \
"$revisions" ${restrict_revision+^$restrict_revision} \
>"$GIT_DIR/rebased-patches"
ret=$?
diff --git a/git-rebase.sh b/git-rebase.sh
index db1deed846..b72e319ac9 100755
--- a/git-rebase.sh
+++ b/git-rebase.sh
@@ -73,6 +73,7 @@ test "$(git config --bool rebase.stat)" = true && diffstat=t
 autostash="$(git config --bool rebase.autostash || echo false)"
 fork_point=auto
 git_am_opt=
+git_format_patch_opt=
 rebase_root=
 force_rebase=
 allow_rerere_autoupdate=
@@ -308,6 +309,7 @@ do
--quiet)
GIT_QUIET=t
git_am_opt="$git_am_opt -q"
+   git_format_patch_opt="$git_format_patch_opt -q"
verbose=
diffstat=
;;
-- 
2.13.0.92.g73a4ce6a77



RE: [PATCH 3/3] reset.c: update files when using sparse to avoid data loss.

2017-04-12 Thread Kevin Willford

> -Original Message-
> From: git-ow...@vger.kernel.org [mailto:git-ow...@vger.kernel.org] On
> Behalf Of Duy Nguyen
> Sent: Wednesday, April 12, 2017 7:21 AM
> To: Kevin Willford <kewi...@microsoft.com>
> Cc: Kevin Willford <kcwillf...@gmail.com>; git@vger.kernel.org;
> gits...@pobox.com; p...@peff.net
> Subject: Re: [PATCH 3/3] reset.c: update files when using sparse to avoid
> data loss.
> 
> On Wed, Apr 12, 2017 at 5:30 AM, Kevin Willford <kewi...@microsoft.com>
> wrote:
> > The loss of the skip-worktree bits is part of the problem if you are
> > talking about modified files.  The other issue that I was having is
> > when running a reset and there were files added in the commit that is
> > being reset, there will not be an entry in the index and not a file on
> > disk so the data for that file is completely lost at that point.
> > "status" also doesn't include anything about this loss of data.  On
> > modified files status will at least have the file as deleted since
> > there is still an index entry but again the previous version of the file 
> > and it's
> data is lost.
> 
> Well, we could have "deleted" index entries, those marked with
> CE_REMOVE. They are never written down to file though, so 'status'
> won't benefit from that. Hopefully we can restore the file before the index
> file is written down and we really lose skip-worktree bits?

Because this is a reset --mixed it will never run through unpack_trees and 
The entries are never marked with CE_REMOVE.

> 
> > To me this is totally unexpected behavior, for example if I am on a
> > commit where there are only files that where added and run a reset
> > HEAD~1 and then a status, it will show a clean working directory.
> > Regardless of skip-worktree bits the user needs to have the data in
> > the working directory after the reset or data is lost which is always bad.
> 
> I agree we no longer have a place to save the skip-worktree bit, we have to
> restore the state back as if skip-worktree bit does not exist.
> It would be good if we could keep the logic in unpack_trees() though.
> For example, if the file is present on disk even if skip-worktree bit is on,
> unpack_trees() would abort instead of silently overwriting it.
> This is a difference between skip-worktree and assume-unchanged bits.
> If you do explicit checkout_entry() you might have to add more checks to
> keep behavior consistent.
> --
> Duy

Because this is a reset --mixed it will follow the code path calling 
read_from_tree
and ends up calling update_index_from_diff in the format_callback of the diff,
so unpack_trees() is never called in the --mixed case.  This code change also 
only applies
when the file does not exist and the skip-worktree bit is on and the updated 
index entry either will be missing (covers the added scenario) or was not 
missing
before (covers the modified scenario).  If there is a better way to get the 
previous
index entry to disk than what I am doing, I am happy to implement it correctly.

Thanks,
Kevin


RE: [PATCH 3/3] reset.c: update files when using sparse to avoid data loss.

2017-04-11 Thread Kevin Willford
> -Original Message-
> From: Duy Nguyen [mailto:pclo...@gmail.com]
> Sent: Monday, April 10, 2017 4:24 AM
> To: Kevin Willford <kcwillf...@gmail.com>
> Cc: git@vger.kernel.org; gits...@pobox.com; p...@peff.net; Kevin Willford
> <kewi...@microsoft.com>
> Subject: Re: [PATCH 3/3] reset.c: update files when using sparse to avoid
> data loss.
> 
> On Fri, Apr 07, 2017 at 12:23:57PM -0700, Kevin Willford wrote:
> > When using the sparse checkout feature the git reset command will add
> 
> "git reset" has three different modes. It would be good if you mention what
> mode is affected here. The tests are for --mixed only. I wonder if we need to
> do anything for --hard and --soft?
> 
> --soft touches branch SHA-1 index only, not worktree, so probably not.
> 
> --hard should be handled by unpack_trees(), I think.
> 
> But it would be good to cover these in the commit message as well to stop
> readers from wondering.

Sounds good.

> 
> > entries to the index that will have the skip-worktree bit off but will
> > leave the working directory empty.  File data is lost because the
> > index version of the files has been changed but there is nothing that
> > is in the working directory.  This will cause the next status call to
> > show either deleted for files modified or deleting or nothing for files
> added.
> > The added files should be shown as untracked and modified files should
> > be shown as modified.
> 
> Hmm.. reading --mixed documentation again ("Resets the index but not
> working tree"), I think the current behavior is expected regardless of skip-
> worktree bits.
> 
> Perhaps the problem is the loss of skip-worktree bits on entries added by
> update_index_from_diff()? If the bits are at the right place, then it should
> not matter if the same version exists on worktree or not and "status" or
> "commit" should work as expected, I think.
> 
> --
> Duy

The loss of the skip-worktree bits is part of the problem if you are talking
about modified files.  The other issue that I was having is when running a reset
and there were files added in the commit that is being reset, there will not
be an entry in the index and not a file on disk so the data for that file is
completely lost at that point.  "status" also doesn't include anything about
this loss of data.  On modified files status will at least have the file as 
deleted
since there is still an index entry but again the previous version of the file
and it's data is lost.

To me this is totally unexpected behavior, for example if I am on a commit
where there are only files that where added and run a reset HEAD~1 and
then a status, it will show a clean working directory. Regardless of 
skip-worktree bits the user needs to have the data in the working directory
after the reset or data is lost which is always bad.

Kevin



[PATCH 2/3] apply.c: do not checkout file when skip-worktree bit set

2017-04-07 Thread Kevin Willford
When using the sparse-checkout feature git should not write to
the working directory for files with the skip-worktree bit on.
With the skip-worktree bit on the file may or may not be in
the working directory and if it is not we don't want or need to
create it by calling checkout_entry.

There are two callers of checkout_target.  Both of which check
that the file does not exist before calling checkout_target.
load_current which make a call to lstat right before calling checkout_target
and check_preimage which will only run checkout_taret it stat_ret
is less than zero.  It sets stat_ret to zero and only if
!stat->cached will it lstat the file and set stat_ret to
something other than zero.

This patch checks if skip-worktree bit is on in checkout_target
and just returns so that the entry doesn't not end up in the
working directory.  This is so that apply will not create a file
in the working directory, then update the index but not keep the
working directory up to date with the changes that happened in the index.

Signed-off-by: Kevin Willford <kewi...@microsoft.com>
---
 apply.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/apply.c b/apply.c
index 0e2caeab9c..0da5d0b7c9 100644
--- a/apply.c
+++ b/apply.c
@@ -3327,6 +3327,24 @@ static int checkout_target(struct index_state *istate,
 {
struct checkout costate = CHECKOUT_INIT;
 
+   /*
+* Do not checkout the entry if the skipworktree bit is set
+*
+* Both callers of this method (check_preimage and load_current)
+* check for the existance of the file before calling this
+* method so we know that the file doesn't exist at this point
+* and we don't need to perform that check again here.
+* We just need to check the skip-worktree and return.
+*
+* This is to prevent git from creating a file in the
+* working directory that has the skip-worktree bit on,
+* then updating the index from the patch and not keeping
+* the working directory version up to date with what it
+* changed the index version to be.
+*/
+   if (ce_skip_worktree(ce))
+   return 0;
+
costate.refresh_cache = 1;
costate.istate = istate;
if (checkout_entry(ce, , NULL) || lstat(ce->name, st))
-- 
2.12.2.windows.2.1.g7df5db8d31



[PATCH 3/3] reset.c: update files when using sparse to avoid data loss.

2017-04-07 Thread Kevin Willford
When using the sparse checkout feature the git reset command will add
entries to the index that will have the skip-worktree bit off but will
leave the working directory empty.  File data is lost because the index
version of the files has been changed but there is nothing that is in the
working directory.  This will cause the next status call to show either
deleted for files modified or deleting or nothing for files added.
The added files should be shown as untracked and modified files should
be shown as modified.

To fix this when the reset is running if there is not a file in the
working directory and if it will be missing with the new index entry
or was not missing in the previous version, we create the previous index
version of the file in the working directory so that status will report
correctly and the files will be availble for the user to deal with.

Signed-off-by: Kevin Willford <kewi...@microsoft.com>
---
 builtin/reset.c  | 34 +++
 t/t7114-reset-sparse-checkout.sh | 58 
 2 files changed, 92 insertions(+)
 create mode 100755 t/t7114-reset-sparse-checkout.sh

diff --git a/builtin/reset.c b/builtin/reset.c
index 8ab915bfcb..8ba97999f4 100644
--- a/builtin/reset.c
+++ b/builtin/reset.c
@@ -21,6 +21,7 @@
 #include "parse-options.h"
 #include "unpack-trees.h"
 #include "cache-tree.h"
+#include "dir.h"
 
 static const char * const git_reset_usage[] = {
N_("git reset [--mixed | --soft | --hard | --merge | --keep] [-q] 
[]"),
@@ -117,12 +118,45 @@ static void update_index_from_diff(struct 
diff_queue_struct *q,
struct diff_options *opt, void *data)
 {
int i;
+   int pos;
int intent_to_add = *(int *)data;
 
for (i = 0; i < q->nr; i++) {
struct diff_filespec *one = q->queue[i]->one;
+   struct diff_filespec *two = q->queue[i]->two;
int is_missing = !(one->mode && !is_null_oid(>oid));
+   int was_missing = !two->mode && is_null_oid(>oid);
struct cache_entry *ce;
+   struct cache_entry *ceBefore;
+   struct checkout state = CHECKOUT_INIT;
+
+   /*
+* When using the sparse-checkout feature the cache entries 
that are
+* added here will not have the skip-worktree bit set.
+* Without this code there is data that is lost because the 
files that
+* would normally be in the working directory are not there and 
show as
+* deleted for the next status or in the case of added files 
just disappear.
+* We need to create the previous version of the files in the 
working
+* directory so that they will have the right content and the 
next
+* status call will show modified or untracked files correctly.
+*/
+   if (core_apply_sparse_checkout && !file_exists(two->path))
+   {
+   pos = cache_name_pos(two->path, strlen(two->path));
+   if ((pos >= 0 && ce_skip_worktree(active_cache[pos])) 
&& (is_missing || !was_missing))
+   {
+   state.force = 1;
+   state.refresh_cache = 1;
+   state.istate = _index;
+   ceBefore = make_cache_entry(two->mode, 
two->oid.hash, two->path,
+   0, 0);
+   if (!ceBefore)
+   die(_("make_cache_entry failed for path 
'%s'"),
+   two->path);
+
+   checkout_entry(ceBefore, , NULL);
+   }
+   }
 
if (is_missing && !intent_to_add) {
remove_file_from_cache(one->path);
diff --git a/t/t7114-reset-sparse-checkout.sh b/t/t7114-reset-sparse-checkout.sh
new file mode 100755
index 00..8dd88fd46d
--- /dev/null
+++ b/t/t7114-reset-sparse-checkout.sh
@@ -0,0 +1,58 @@
+#!/bin/sh
+
+test_description='reset when using a sparse-checkout'
+
+. ./test-lib.sh
+
+# reset using a sparse-checkout file
+
+test_expect_success 'setup' '
+   test_tick &&
+   echo "checkout file" >c &&
+   echo "modify file" >m &&
+   echo "delete file" >d &&
+   git add . &&
+   git commit -m "initial commit" &&
+   echo "added file" >a &&
+   echo "modification of a file" >m &&
+   git rm d &&
+   git add . &&
+   git commit -m "second commit" &a

[PATCH 0/3] fix working directory file issues while using sparse-checkout

2017-04-07 Thread Kevin Willford
While using the sparse-checkout feature there are scenarios where 
the working directory should and should not be updated.  This patch
series addresses issues found in reset, apply, and merge conflicts.

Kevin Willford (3):
  merge-recursive.c: conflict using sparse should update file
  apply.c: do not checkout file when skip-worktree bit set
  reset.c: update files when using sparse to avoid data loss.

 apply.c  | 18 +
 builtin/reset.c  | 34 +++
 merge-recursive.c|  8 ++
 t/t7114-reset-sparse-checkout.sh | 58 
 t/t7614-merge-sparse-checkout.sh | 27 +++
 5 files changed, 145 insertions(+)
 create mode 100755 t/t7114-reset-sparse-checkout.sh
 create mode 100755 t/t7614-merge-sparse-checkout.sh

-- 
2.12.2.windows.2.1.g7df5db8d31



[PATCH 1/3] merge-recursive.c: conflict using sparse should update file

2017-04-07 Thread Kevin Willford
Update the file when there is a conflict with a modify/delete scenario
when using the sparse-checkout feature since the file might not be on disk
because the skip-worktree bit is on and the user will need the file and
content to determine how to resolve the conflict.

Signed-off-by: Kevin Willford <kewi...@microsoft.com>
---
 merge-recursive.c|  8 
 t/t7614-merge-sparse-checkout.sh | 27 +++
 2 files changed, 35 insertions(+)
 create mode 100755 t/t7614-merge-sparse-checkout.sh

diff --git a/merge-recursive.c b/merge-recursive.c
index b7ff1ada3c..54fce93dae 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -1103,6 +1103,14 @@ static int handle_change_delete(struct merge_options *o,
   "and %s in %s. Version %s of %s left in tree."),
   change, path, o->branch2, change_past,
   o->branch1, o->branch1, path);
+   /*
+* In a sparse checkout the file may not exist. Sadly,
+* the CE_SKIP_WORKTREE flag is not preserved in the
+* case of conflicts, therefore we do the next best
+* thing: verify that the file exists.
+*/
+   if (!file_exists(path))
+   ret = update_file(o, 0, a_oid, a_mode, path);
} else {
output(o, 1, _("CONFLICT (%s/delete): %s deleted in %s "
   "and %s in %s. Version %s of %s left in tree at 
%s."),
diff --git a/t/t7614-merge-sparse-checkout.sh b/t/t7614-merge-sparse-checkout.sh
new file mode 100755
index 00..6922f0244f
--- /dev/null
+++ b/t/t7614-merge-sparse-checkout.sh
@@ -0,0 +1,27 @@
+#!/bin/sh
+
+test_description='merge can handle sparse-checkout'
+
+. ./test-lib.sh
+
+# merges with conflicts
+
+test_expect_success 'setup' '
+   test_commit a &&
+   test_commit file &&
+   git checkout -b delete-file &&
+   git rm file.t &&
+   test_tick &&
+   git commit -m "remove file" &&
+   git checkout master &&
+   test_commit modify file.t changed
+'
+
+test_expect_success 'merge conflict deleted file and modified' '
+   echo "/a" >.git/info/sparse-checkout &&
+   test_config core.sparsecheckout true &&
+   test_must_fail git merge delete-file &&
+   test_path_is_file file.t
+'
+
+test_done
-- 
2.12.2.windows.2.1.g7df5db8d31



[[PATCH v2] 2/4] patch-ids: replace the seen indicator with a commit pointer

2016-07-29 Thread Kevin Willford
From: Kevin Willford <kewi...@microsoft.com>

The cherry_pick_list was looping through the original side checking the
seen indicator and setting the cherry_flag on the commit.  If we save
off the commit in the patch_id we can set the cherry_flag on the correct
commit when running through the other side when a patch_id match is found.

Signed-off-by: Kevin Willford <kcwillf...@gmail.com>
---
 patch-ids.c |  1 +
 patch-ids.h |  2 +-
 revision.c  | 18 +++---
 3 files changed, 5 insertions(+), 16 deletions(-)

diff --git a/patch-ids.c b/patch-ids.c
index db31fa6..bafaae2 100644
--- a/patch-ids.c
+++ b/patch-ids.c
@@ -43,6 +43,7 @@ static int init_patch_id_entry(struct patch_id *patch,
   struct commit *commit,
   struct patch_ids *ids)
 {
+   patch->commit = commit;
if (commit_patch_id(commit, >diffopts, patch->patch_id))
return -1;
 
diff --git a/patch-ids.h b/patch-ids.h
index 9569ee0..dea1ecd 100644
--- a/patch-ids.h
+++ b/patch-ids.h
@@ -4,7 +4,7 @@
 struct patch_id {
struct hashmap_entry ent;
unsigned char patch_id[GIT_SHA1_RAWSZ];
-   char seen;
+   struct commit *commit;
 };
 
 struct patch_ids {
diff --git a/revision.c b/revision.c
index edba5b7..19cc067 100644
--- a/revision.c
+++ b/revision.c
@@ -846,7 +846,7 @@ static void cherry_pick_list(struct commit_list *list, 
struct rev_info *revs)
 */
if (left_first != !!(flags & SYMMETRIC_LEFT))
continue;
-   commit->util = add_commit_patch_id(commit, );
+   add_commit_patch_id(commit, );
}
 
/* either cherry_mark or cherry_pick are true */
@@ -873,21 +873,9 @@ static void cherry_pick_list(struct commit_list *list, 
struct rev_info *revs)
id = has_commit_patch_id(commit, );
if (!id)
continue;
-   id->seen = 1;
-   commit->object.flags |= cherry_flag;
-   }
 
-   /* Now check the original side for seen ones */
-   for (p = list; p; p = p->next) {
-   struct commit *commit = p->item;
-   struct patch_id *ent;
-
-   ent = commit->util;
-   if (!ent)
-   continue;
-   if (ent->seen)
-   commit->object.flags |= cherry_flag;
-   commit->util = NULL;
+   commit->object.flags |= cherry_flag;
+   id->commit->object.flags |= cherry_flag;
}
 
free_patch_ids();
-- 
2.9.2.gvfs.2.42.gb7633a3

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[[PATCH v2] 3/4] patch-ids: add flag to create the diff patch id using header only data

2016-07-29 Thread Kevin Willford
From: Kevin Willford <kewi...@microsoft.com>

This will allow a diff patch id to be created using only the header data
so that the contents of the file will not have to be loaded.

Signed-off-by: Kevin Willford <kcwillf...@gmail.com>
---
 diff.c  | 16 ++--
 diff.h  |  2 +-
 patch-ids.c |  2 +-
 3 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/diff.c b/diff.c
index 7d03419..28a4190 100644
--- a/diff.c
+++ b/diff.c
@@ -4455,7 +4455,7 @@ static void patch_id_consume(void *priv, char *line, 
unsigned long len)
 }
 
 /* returns 0 upon success, and writes result into sha1 */
-static int diff_get_patch_id(struct diff_options *options, unsigned char *sha1)
+static int diff_get_patch_id(struct diff_options *options, unsigned char 
*sha1, int diff_header_only)
 {
struct diff_queue_struct *q = _queued_diff;
int i;
@@ -4490,9 +4490,6 @@ static int diff_get_patch_id(struct diff_options 
*options, unsigned char *sha1)
 
diff_fill_sha1_info(p->one);
diff_fill_sha1_info(p->two);
-   if (fill_mmfile(, p->one) < 0 ||
-   fill_mmfile(, p->two) < 0)
-   return error("unable to read files to diff");
 
len1 = remove_space(p->one->path, strlen(p->one->path));
len2 = remove_space(p->two->path, strlen(p->two->path));
@@ -4527,6 +4524,13 @@ static int diff_get_patch_id(struct diff_options 
*options, unsigned char *sha1)
len2, p->two->path);
git_SHA1_Update(, buffer, len1);
 
+   if (diff_header_only)
+   continue;
+
+   if (fill_mmfile(, p->one) < 0 ||
+   fill_mmfile(, p->two) < 0)
+   return error("unable to read files to diff");
+
if (diff_filespec_is_binary(p->one) ||
diff_filespec_is_binary(p->two)) {
git_SHA1_Update(, oid_to_hex(>one->oid),
@@ -4549,11 +4553,11 @@ static int diff_get_patch_id(struct diff_options 
*options, unsigned char *sha1)
return 0;
 }
 
-int diff_flush_patch_id(struct diff_options *options, unsigned char *sha1)
+int diff_flush_patch_id(struct diff_options *options, unsigned char *sha1, int 
diff_header_only)
 {
struct diff_queue_struct *q = _queued_diff;
int i;
-   int result = diff_get_patch_id(options, sha1);
+   int result = diff_get_patch_id(options, sha1, diff_header_only);
 
for (i = 0; i < q->nr; i++)
diff_free_filepair(q->queue[i]);
diff --git a/diff.h b/diff.h
index 125447b..7883729 100644
--- a/diff.h
+++ b/diff.h
@@ -342,7 +342,7 @@ extern int run_diff_files(struct rev_info *revs, unsigned 
int option);
 extern int run_diff_index(struct rev_info *revs, int cached);
 
 extern int do_diff_cache(const unsigned char *, struct diff_options *);
-extern int diff_flush_patch_id(struct diff_options *, unsigned char *);
+extern int diff_flush_patch_id(struct diff_options *, unsigned char *, int);
 
 extern int diff_result_code(struct diff_options *, int);
 
diff --git a/patch-ids.c b/patch-ids.c
index bafaae2..69a14a3 100644
--- a/patch-ids.c
+++ b/patch-ids.c
@@ -13,7 +13,7 @@ int commit_patch_id(struct commit *commit, struct 
diff_options *options,
else
diff_root_tree_sha1(commit->object.oid.hash, "", options);
diffcore_std(options);
-   return diff_flush_patch_id(options, sha1);
+   return diff_flush_patch_id(options, sha1, 0);
 }
 
 static int patch_id_cmp(struct patch_id *a,
-- 
2.9.2.gvfs.2.42.gb7633a3

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[[PATCH v2] 4/4] rebase: avoid computing unnecessary patch IDs

2016-07-29 Thread Kevin Willford
From: Kevin Willford <kewi...@microsoft.com>

The `rebase` family of Git commands avoid applying patches that were
already integrated upstream. They do that by using the revision walking
option that computes the patch IDs of the two sides of the rebase
(local-only patches vs upstream-only ones) and skipping those local
patches whose patch ID matches one of the upstream ones.

In many cases, this causes unnecessary churn, as already the set of
paths touched by a given commit would suffice to determine that an
upstream patch has no local equivalent.

This hurts performance in particular when there are a lot of upstream
patches, and/or large ones.

Therefore, let's introduce the concept of a "diff-header-only" patch ID,
compare those first, and only evaluate the "full" patch ID lazily.

Please note that in contrast to the "full" patch IDs, those
"diff-header-only" patch IDs are prone to collide with one another, as
adjacent commits frequently touch the very same files. Hence we now
have to be careful to allow multiple hash entries with the same hash.
We accomplish that by using the hashmap_add() function that does not even
test for hash collisions.  This also allows us to evaluate the full patch ID
lazily, i.e. only when we found commits with matching diff-header-only
patch IDs.

We add a performance test that demonstrates ~1-6% improvement.  In
practice this will depend on various factors such as how many upstream
changes and how big those changes are along with whether file system
caches are cold or warm.  As Git's test suite has no way of catching
performance regressions, we also add a regression test that verifies
that the full patch ID computation is skipped when the diff-header-only
computation suffices.

Signed-off-by: Kevin Willford <kcwillf...@gmail.com>
---
 builtin/log.c|  2 +-
 patch-ids.c  | 22 --
 patch-ids.h  |  2 +-
 t/perf/p3400-rebase.sh   | 36 
 t/t6007-rev-list-cherry-pick-file.sh | 17 +
 5 files changed, 71 insertions(+), 8 deletions(-)
 create mode 100644 t/perf/p3400-rebase.sh

diff --git a/builtin/log.c b/builtin/log.c
index fd1652f..b076993 100644
--- a/builtin/log.c
+++ b/builtin/log.c
@@ -1331,7 +1331,7 @@ static void prepare_bases(struct base_tree_info *bases,
struct object_id *patch_id;
if (commit->util)
continue;
-   if (commit_patch_id(commit, , sha1))
+   if (commit_patch_id(commit, , sha1, 0))
die(_("cannot get patch id"));
ALLOC_GROW(bases->patch_id, bases->nr_patch_id + 1, 
bases->alloc_patch_id);
patch_id = bases->patch_id + bases->nr_patch_id;
diff --git a/patch-ids.c b/patch-ids.c
index 69a14a3..0a4828a 100644
--- a/patch-ids.c
+++ b/patch-ids.c
@@ -5,7 +5,7 @@
 #include "patch-ids.h"
 
 int commit_patch_id(struct commit *commit, struct diff_options *options,
-   unsigned char *sha1)
+   unsigned char *sha1, int diff_header_only)
 {
if (commit->parents)
diff_tree_sha1(commit->parents->item->object.oid.hash,
@@ -13,13 +13,21 @@ int commit_patch_id(struct commit *commit, struct 
diff_options *options,
else
diff_root_tree_sha1(commit->object.oid.hash, "", options);
diffcore_std(options);
-   return diff_flush_patch_id(options, sha1, 0);
+   return diff_flush_patch_id(options, sha1, diff_header_only);
 }
 
 static int patch_id_cmp(struct patch_id *a,
struct patch_id *b,
-   void *keydata)
+   struct diff_options *opt)
 {
+   if (is_null_sha1(a->patch_id) &&
+   commit_patch_id(a->commit, opt, a->patch_id, 0))
+   return error("Could not get patch ID for %s",
+   oid_to_hex(>commit->object.oid));
+   if (is_null_sha1(b->patch_id) &&
+   commit_patch_id(b->commit, opt, b->patch_id, 0))
+   return error("Could not get patch ID for %s",
+   oid_to_hex(>commit->object.oid));
return hashcmp(a->patch_id, b->patch_id);
 }
 
@@ -43,11 +51,13 @@ static int init_patch_id_entry(struct patch_id *patch,
   struct commit *commit,
   struct patch_ids *ids)
 {
+   unsigned char header_only_patch_id[GIT_SHA1_RAWSZ];
+
patch->commit = commit;
-   if (commit_patch_id(commit, >diffopts, patch->patch_id))
+   if (commit_patch_id(commit, >diffopts, header_only_patch_id, 1))
return -1;
 
-   hashmap_entry_init(patch, sha1hash(patch->patch_id));
+   has

[[PATCH v2] 1/4] patch-ids: stop using a hand-rolled hashmap implementation

2016-07-29 Thread Kevin Willford
From: Kevin Willford <kewi...@microsoft.com>

This change will use the hashmap from the hashmap.h to keep track of the
patch_ids that have been encountered instead of using an internal
implementation.  This simplifies the implementation of the patch ids.

Signed-off-by: Kevin Willford <kcwillf...@gmail.com>
---
 patch-ids.c | 86 +
 patch-ids.h |  7 +++--
 2 files changed, 32 insertions(+), 61 deletions(-)

diff --git a/patch-ids.c b/patch-ids.c
index a4d0016..db31fa6 100644
--- a/patch-ids.c
+++ b/patch-ids.c
@@ -16,90 +16,62 @@ int commit_patch_id(struct commit *commit, struct 
diff_options *options,
return diff_flush_patch_id(options, sha1);
 }
 
-static const unsigned char *patch_id_access(size_t index, void *table)
+static int patch_id_cmp(struct patch_id *a,
+   struct patch_id *b,
+   void *keydata)
 {
-   struct patch_id **id_table = table;
-   return id_table[index]->patch_id;
+   return hashcmp(a->patch_id, b->patch_id);
 }
 
-static int patch_pos(struct patch_id **table, int nr, const unsigned char *id)
-{
-   return sha1_pos(id, table, nr, patch_id_access);
-}
-
-#define BUCKET_SIZE 190 /* 190 * 21 = 3990, with slop close enough to 4K */
-struct patch_id_bucket {
-   struct patch_id_bucket *next;
-   int nr;
-   struct patch_id bucket[BUCKET_SIZE];
-};
-
 int init_patch_ids(struct patch_ids *ids)
 {
memset(ids, 0, sizeof(*ids));
diff_setup(>diffopts);
DIFF_OPT_SET(>diffopts, RECURSIVE);
diff_setup_done(>diffopts);
+   hashmap_init(>patches, (hashmap_cmp_fn)patch_id_cmp, 256);
return 0;
 }
 
 int free_patch_ids(struct patch_ids *ids)
 {
-   struct patch_id_bucket *next, *patches;
-
-   free(ids->table);
-   for (patches = ids->patches; patches; patches = next) {
-   next = patches->next;
-   free(patches);
-   }
+   hashmap_free(>patches, 1);
return 0;
 }
 
-static struct patch_id *add_commit(struct commit *commit,
-  struct patch_ids *ids,
-  int no_add)
+static int init_patch_id_entry(struct patch_id *patch,
+  struct commit *commit,
+  struct patch_ids *ids)
 {
-   struct patch_id_bucket *bucket;
-   struct patch_id *ent;
-   unsigned char sha1[20];
-   int pos;
-
-   if (commit_patch_id(commit, >diffopts, sha1))
-   return NULL;
-   pos = patch_pos(ids->table, ids->nr, sha1);
-   if (0 <= pos)
-   return ids->table[pos];
-   if (no_add)
-   return NULL;
-
-   pos = -1 - pos;
+   if (commit_patch_id(commit, >diffopts, patch->patch_id))
+   return -1;
 
-   bucket = ids->patches;
-   if (!bucket || (BUCKET_SIZE <= bucket->nr)) {
-   bucket = xcalloc(1, sizeof(*bucket));
-   bucket->next = ids->patches;
-   ids->patches = bucket;
-   }
-   ent = >bucket[bucket->nr++];
-   hashcpy(ent->patch_id, sha1);
-
-   ALLOC_GROW(ids->table, ids->nr + 1, ids->alloc);
-   if (pos < ids->nr)
-   memmove(ids->table + pos + 1, ids->table + pos,
-   sizeof(ent) * (ids->nr - pos));
-   ids->nr++;
-   ids->table[pos] = ent;
-   return ids->table[pos];
+   hashmap_entry_init(patch, sha1hash(patch->patch_id));
+   return 0;
 }
 
 struct patch_id *has_commit_patch_id(struct commit *commit,
 struct patch_ids *ids)
 {
-   return add_commit(commit, ids, 1);
+   struct patch_id patch;
+
+   memset(, 0, sizeof(patch));
+   if (init_patch_id_entry(, commit, ids))
+   return NULL;
+
+   return hashmap_get(>patches, , NULL);
 }
 
 struct patch_id *add_commit_patch_id(struct commit *commit,
 struct patch_ids *ids)
 {
-   return add_commit(commit, ids, 0);
+   struct patch_id *key = xcalloc(1, sizeof(*key));
+
+   if (init_patch_id_entry(key, commit, ids)) {
+   free(key);
+   return NULL;
+   }
+
+   hashmap_add(>patches, key);
+   return key;
 }
diff --git a/patch-ids.h b/patch-ids.h
index eeb56b3..9569ee0 100644
--- a/patch-ids.h
+++ b/patch-ids.h
@@ -2,15 +2,14 @@
 #define PATCH_IDS_H
 
 struct patch_id {
-   unsigned char patch_id[20];
+   struct hashmap_entry ent;
+   unsigned char patch_id[GIT_SHA1_RAWSZ];
char seen;
 };
 
 struct patch_ids {
+   struct hashmap patches;
struct diff_options diffopts;
-   int nr, alloc;
-   struct patch_id **table;
-   struct patch_id_bucket *patches;
 };
 
 int commit_patch_id(struct commit *commit, struct diff_opt

[[PATCH v2] 0/4] Use header data patch ids for rebase to avoid loading file content

2016-07-29 Thread Kevin Willford
This patch series is to remove the hand rolled hashmap in the patch_ids
and use the hashmap.h implementation.  It also introduces the idea of having
a header only patch id so that the contents of the files do not have to be
loaded in order to determine if two commits are different.


Kevin Willford (4):
  patch-ids: stop using a hand-rolled hashmap implementation
  patch-ids: replace the seen indicator with a commit pointer
  patch-ids: add flag to create the diff patch id using header only data
  rebase: avoid computing unnecessary patch IDs

 builtin/log.c|  2 +-
 diff.c   | 16 +++---
 diff.h   |  2 +-
 patch-ids.c  | 99 +++-
 patch-ids.h  | 11 ++--
 revision.c   | 18 ++-
 t/perf/p3400-rebase.sh   | 36 +
 t/t6007-rev-list-cherry-pick-file.sh | 17 +++
 8 files changed, 114 insertions(+), 87 deletions(-)
 create mode 100644 t/perf/p3400-rebase.sh

-- 
2.9.2.windows.1

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Use path comparison for patch ids before the file content

2016-07-14 Thread Kevin Willford
When limiting the list in a revision walk using cherry pick, patch ids are
calculated by producing the diff of the content of the files.  This would
be more efficent by using a patch id looking at the paths that were
changed in the commits and only if all the file changed are the same fall
back to getting the content of the files in the commits to determine if
the commits are the same.

This change uses a hashmap to store entries with a hash of the patch id
calculated by just using the file paths.  The entries constist of the
commit and the patch id calculated using file contents which is initially
empty.  If there are commits that are found in the hashmap it means that
the same files were changed in the commits and the file contents need to
be checked in order to determine if the commits are truly the same.  The
patch id that is calcuated based on the file contents is then stored in the
hashmap entry if needed in later comparisons.  If the commits are determined to 
be
the same the cherry_flag is set on the commit being checked as well as the
commit in the hashmap entry saving running though the original list of
commits checking a seen flag.  This will speed up a rebase where the
upstream has many changes but none of them have been pulled into the
current branch.
---
 diff.c  |  16 +
 diff.h  |   2 +-
 patch-ids.c | 114 +---
 patch-ids.h |   7 ++--
 revision.c  |  19 ++
 5 files changed, 73 insertions(+), 85 deletions(-)

diff --git a/diff.c b/diff.c
index fa78fc1..f38b663 100644
--- a/diff.c
+++ b/diff.c
@@ -4449,7 +4449,7 @@ static void patch_id_consume(void *priv, char *line, 
unsigned long len)
 }
 
 /* returns 0 upon success, and writes result into sha1 */
-static int diff_get_patch_id(struct diff_options *options, unsigned char *sha1)
+static int diff_get_patch_id(struct diff_options *options, unsigned char 
*sha1, int use_path_only)
 {
struct diff_queue_struct *q = _queued_diff;
int i;
@@ -4484,9 +4484,6 @@ static int diff_get_patch_id(struct diff_options 
*options, unsigned char *sha1)
 
diff_fill_sha1_info(p->one);
diff_fill_sha1_info(p->two);
-   if (fill_mmfile(, p->one) < 0 ||
-   fill_mmfile(, p->two) < 0)
-   return error("unable to read files to diff");
 
len1 = remove_space(p->one->path, strlen(p->one->path));
len2 = remove_space(p->two->path, strlen(p->two->path));
@@ -4521,6 +4518,13 @@ static int diff_get_patch_id(struct diff_options 
*options, unsigned char *sha1)
len2, p->two->path);
git_SHA1_Update(, buffer, len1);
 
+   if (use_path_only)
+   continue;
+
+   if (fill_mmfile(, p->one) < 0 ||
+   fill_mmfile(, p->two) < 0)
+   return error("unable to read files to diff");
+
if (diff_filespec_is_binary(p->one) ||
diff_filespec_is_binary(p->two)) {
git_SHA1_Update(, sha1_to_hex(p->one->sha1), 40);
@@ -4541,11 +4545,11 @@ static int diff_get_patch_id(struct diff_options 
*options, unsigned char *sha1)
return 0;
 }
 
-int diff_flush_patch_id(struct diff_options *options, unsigned char *sha1)
+int diff_flush_patch_id(struct diff_options *options, unsigned char *sha1, int 
use_path_only)
 {
struct diff_queue_struct *q = _queued_diff;
int i;
-   int result = diff_get_patch_id(options, sha1);
+   int result = diff_get_patch_id(options, sha1, use_path_only);
 
for (i = 0; i < q->nr; i++)
diff_free_filepair(q->queue[i]);
diff --git a/diff.h b/diff.h
index 125447b..7883729 100644
--- a/diff.h
+++ b/diff.h
@@ -342,7 +342,7 @@ extern int run_diff_files(struct rev_info *revs, unsigned 
int option);
 extern int run_diff_index(struct rev_info *revs, int cached);
 
 extern int do_diff_cache(const unsigned char *, struct diff_options *);
-extern int diff_flush_patch_id(struct diff_options *, unsigned char *);
+extern int diff_flush_patch_id(struct diff_options *, unsigned char *, int);
 
 extern int diff_result_code(struct diff_options *, int);
 
diff --git a/patch-ids.c b/patch-ids.c
index a4d0016..f0262ce 100644
--- a/patch-ids.c
+++ b/patch-ids.c
@@ -4,8 +4,9 @@
 #include "sha1-lookup.h"
 #include "patch-ids.h"
 
-int commit_patch_id(struct commit *commit, struct diff_options *options,
-   unsigned char *sha1)
+
+static int ll_commit_patch_id(struct commit *commit, struct diff_options 
*options,
+   unsigned char *sha1, int use_path_only)
 {
if (commit->parents)
diff_tree_sha1(commit->parents->item->object.oid.hash,
@@ -13,93 +14,90 @@ int commit_patch_id(struct commit *commit, struct 
diff_options *options,
else
diff_root_tree_sha1(commit->object.oid.hash, "",