date:20180607

Re: [RFC PATCH v1] http: add http.keepRejectedCredentials config

2018-06-07 Thread Jeff King

On Thu, Jun 07, 2018 at 08:15:16PM -0700, Lars Schneider wrote:

> > In fact, this patch probably should give the user some advice in that
> > regard (either in the documentation, or as a warning when we skip the
> > rejection). If you _do_ have a bogus credential and set the new option,
> > you'd need to reject it manually (you can do it with "git credential
> > reject", but it's probably easier to just unset the option temporarily
> > and re-invoke the original command).
> 
> I like the advice idea very much!
> 
> How about this?
> 
> $ git fetch
> hint: Git has stored invalid credentials.
> hint: Reject them with 'git credential reject' or
> hint: disable the Git config 'http.keepRejectedCredentials'.
> remote: Invalid username or password.
> fatal: Authentication failed for 'https://server.com/myrepo.git/'
> 
> I am not really sure about the grammar :-)

It's probably not worth pointing the user at "git credential reject",
since it's not really meant to be friendly to users. In particular, you
have to speak the credential protocol on stdin.

I _think_

  echo https://server.com/myrepo.git | git credential reject

might be enough, but I didn't test. Probably better advice is to just
repeat the command. Maybe:

  hint: Git kept invalid credentials due to the value of
  hint: http.keepRejectedCredentials. If you wish to drop these
  hint: credentials and be prompted for new ones, re-run your
  hint: command with "git -c http.keepRejectedCredentials=false".

or something?

-Peff

Re: [ANNOUNCE] Git v2.16.0-rc0

2018-06-07 Thread Jeff King

On Tue, Jan 02, 2018 at 09:35:16PM -0800, Jonathan Nieder wrote:

> > bturner@ubuntu:~$ ssh -V
> > OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8, OpenSSL 1.0.1f 6 Jan 2014
> >
> > bturner@ubuntu:~$ ssh -G -p 7999 localhost
> > unknown option -- G
> > usage: ssh [-1246AaCfgKkMNnqsTtVvXxYy] [-b bind_address] [-c cipher_spec]
> [...]
> > Is it possible to adjust the check, somehow, so it doesn't impact
> > older OpenSSH versions like this? As it stands, it seems likely a fair
> > number of users who have an SSH command that does support -4, -6 and
> > -p are going to end up getting "penalized" because it doesn't also
> > support -G, and have to manually set their SSH variant to "ssh" (or
> > something other than "auto") to avoid the automatic detection.
> >
> > I'd love to say I have a brilliant idea for how to work around this,
> > oh and here's a patch, but I don't. One option might be trying to
> > actually review the output, and another might be to run "ssh -V", but
> > both of those have their own flaws (and the extra process forks aren't
> > "free").
> 
> I have tomorrow off, so I've filed https://crbug.com/git/7 to make
> sure I remember to follow up the day after.  Of course I'll be happy
> if someone updates that bug saying they've fixed it in the meantime.

It doesn't look like we ever applied anything to deal with this
regression. Just FYI, this bit me today when upgrading my git on a
system that has an ssh wrapper that understands "-p" just fine, but not
"-G". So the behavior described in [1], namely to just fallback to
assuming some basic openssh-ish options, would have worked for me.

To be honest, I could easily see an argument that I _should_ be setting
GIT_SSH_VARIANT to explain what my wrapper is expecting, even though it
happened to work before. But it seems like this discussion ended in
favor of calling this a regression that should be fixed, and AFAICT
nothing happened after. So I thought I'd ping and mention one more data
point.

-Peff

[1] https://public-inbox.org/git/xmqqk1wyhcey@gitster.mtv.corp.google.com/

Re: [L10N] Kickoff for Git 2.18.0 l10n round 2

2018-06-07 Thread Jiang Xin

All these 144 new messages in this round of Git l10n are introduced by
commit f318d73915 (generate-cmds.sh: export all commands to
command-list.h). The updated script `generate-cmds.sh` will export
more commands and one line introductions in file `command-list.h` than
v2.18.0-rc0:

$ git checkout -q v2.18.0-rc0 && ./generate-cmdlist.sh
command-list.txt | wc -l
  38

$ git checkout -q v2.18.0-rc1 && ./generate-cmdlist.sh
command-list.txt | wc -l
  207

2018-06-08 9:21 GMT+08:00 Jiang Xin :
> Hi,
>
> Git 2.18.0-rc1 has been released, and introduced 36 more messages (144
> total) need to be translated. Let's start the 2nd round of l10n for
> Git 2.18.0.
>
> The new "git.pot" is generated in commit v2.18.0-rc1:
>
> l10n: git.pot: v2.18.0 round 2 (144 new, 6 removed)
>
> Generate po/git.pot from v2.18.0-rc1 for git v2.18.0 l10n round 2.
>
> Signed-off-by: Jiang Xin 
>
> You can get it from the usual place:
>
> https://github.com/git-l10n/git-po/
>
> As how to update your XX.po and help to translate Git, please see
> "Updating a XX.po file" and other sections in "po/README" file.
>
> --
> Jiang Xin

Re: [RFC PATCH v1] http: add http.keepRejectedCredentials config

2018-06-07 Thread Lars Schneider



> On 04 Jun 2018, at 11:55, Jeff King  wrote:
> 
> On Mon, Jun 04, 2018 at 12:18:59PM -0400, Martin-Louis Bright wrote:
> 
>> Why must the credentials must be deleted after receiving the 401 (or
>> any) error? What's the rationale for this?
> 
> Because Git only tries a single credential per invocation. So if a
> helper provides one, it doesn't prompt. If you get a 401 and then the
> program aborts, invoking it again is just going to try the same
> credential over and over. Dropping the credential from the helper breaks
> out of that loop.
> 
> In fact, this patch probably should give the user some advice in that
> regard (either in the documentation, or as a warning when we skip the
> rejection). If you _do_ have a bogus credential and set the new option,
> you'd need to reject it manually (you can do it with "git credential
> reject", but it's probably easier to just unset the option temporarily
> and re-invoke the original command).

I like the advice idea very much!

How about this?

$ git fetch
hint: Git has stored invalid credentials.
hint: Reject them with 'git credential reject' or
hint: disable the Git config 'http.keepRejectedCredentials'.
remote: Invalid username or password.
fatal: Authentication failed for 'https://server.com/myrepo.git/'

I am not really sure about the grammar :-)

Thanks,
Lars

Re: GDPR compliance best practices?

2018-06-07 Thread Theodore Y. Ts'o

On Fri, Jun 08, 2018 at 01:21:29AM +0200, Peter Backes wrote:
> On Thu, Jun 07, 2018 at 03:38:49PM -0700, David Lang wrote:
> > > Again: The GDPR certainly allows you to keep a proof of copyright
> > > privately if you have it. However, it does not allow you to keep
> > > publishing it if someone exercises his right to be forgotten.
> > someone is granting the world the right to use the code and you are claiming
> > that the evidence that they have granted this right is illegal to have?
> 
> Hell no! Please read what I wrote:
> 
> - "allows you to keep a proof ... privately"
> - "However, it does not allow you to keep publishing it"

The problem is you've left undefined who is "you"?  With an open
source project, anyone who has contributed to open source project has
a copyright interest.  That hobbyist in German who submitted a patch?
They have a copyright interest.  That US Company based in Redmond,
Washington?  They own a copyright interest.  Huawei in China?  They
have a copyright interest.

So there is no "privately".  And "you" numbers in the thousands and
thousands of copyright holders of portions of the open source code.

And of course, that's the other thing you seem to fundamentally not
understand about how git works.  Every developer in the world working
on that open source project has their own copy.  There is
fundamentally no way that you can expunge that information from every
single git repository in the world.  You can remote a git note from a
single repository.  But that doesn't affect my copy of the repository
on my laptop.  And if I push that repository to my server, it git note
will be out there for the whole world to see.

So someone could *try* sending a public request to the entire world,
saying, "I am a European and I demand that you disassociate commit
DEADBEF12345 from my name".  They could try serving legal papers on
everyone.  But at this point, it's going to trigger something called
the "Streisand Effect".  If you haven't heard of it, I suggest you
look it up:

http://mentalfloss.com/article/67299/how-barbra-streisand-inspired-streisand-effect

Regards,

- Ted

Proposal

2018-06-07 Thread Mr. Fawaz KhE. Al Saleh





--
Good day,

i know you do not know me personally but i have checked your profile  
and i see generosity in you, There's an urgent offer attach

to your name here in the office of Mr. Fawaz KhE. Al Saleh Member of
the Board of Directors, Kuveyt Türk Participation Bank  (Turkey) and
head of private banking and wealth management
Regards,
Mr. Fawaz KhE. Al Saleh

[L10N] Kickoff for Git 2.18.0 l10n round 2

2018-06-07 Thread Jiang Xin

Hi,

Git 2.18.0-rc1 has been released, and introduced 36 more messages (144
total) need to be translated. Let's start the 2nd round of l10n for
Git 2.18.0.

The new "git.pot" is generated in commit v2.18.0-rc1:

l10n: git.pot: v2.18.0 round 2 (144 new, 6 removed)

Generate po/git.pot from v2.18.0-rc1 for git v2.18.0 l10n round 2.

Signed-off-by: Jiang Xin 

You can get it from the usual place:

https://github.com/git-l10n/git-po/

As how to update your XX.po and help to translate Git, please see
"Updating a XX.po file" and other sections in "po/README" file.

--
Jiang Xin

Re: [PATCH 0/8] Reroll of sb/diff-color-move-more

2018-06-07 Thread Jacob Keller

On Thu, May 17, 2018 at 12:46 PM, Stefan Beller  wrote:
>>> * sb/diff-color-move-more (2018-04-25) 7 commits
>>...
>>>
>>>  Will merge to 'next'.
>>
>>I did not get around to fix it up, there are still review
>>comments outstanding. (The test is broken in the last commit.)
>
> This is a reroll of sb/diff-color-move-more, with the test fixed as well
> as another extra patch, that would have caught the bad test.
>
> The range diff is below.
>
> Thanks,
> Stefan
>
> Stefan Beller (8):
>   xdiff/xdiff.h: remove unused flags
>   xdiff/xdiffi.c: remove unneeded function declarations
>   diff.c: do not pass diff options as keydata to hashmap
>   diff.c: adjust hash function signature to match hashmap expectation
>   diff.c: add a blocks mode for moved code detection
>   diff.c: decouple white space treatment from move detection algorithm
>   diff.c: add --color-moved-ignore-space-delta option
>   diff: color-moved white space handling options imply color-moved
>

I've been using this locally, and it's really nice. One question I
had, are there plans to make the whitespace options configurable? I
really like the option for enabling lines to count as moved when the
whitespace changes uniformly, (it helps make changes more obvious when
doing indentation changes such as wrapping code within a block).
However, it's rather a long option name to type out. I didn't see any
obvious config options to enable it by default though.

Thanks,
Jake

Re: GDPR compliance best practices?

2018-06-07 Thread David Lang


On Fri, 8 Jun 2018, Peter Backes wrote:


On Thu, Jun 07, 2018 at 03:38:49PM -0700, David Lang wrote:

Again: The GDPR certainly allows you to keep a proof of copyright
privately if you have it. However, it does not allow you to keep
publishing it if someone exercises his right to be forgotten.

someone is granting the world the right to use the code and you are claiming
that the evidence that they have granted this right is illegal to have?


Hell no! Please read what I wrote:

- "allows you to keep a proof ... privately"
- "However, it does not allow you to keep publishing it"


And you are incorrect to say that the GDPR lets you keep records privately
and only applies to publishing them. The GDPR is specifically targeted at
companies like Facebook and Google that want to keep lots of data privately.
It does no good to ask Facebook to not publish your info, they don't want to
publish it in the first place, they want to keep it internally and use it.


How can you misunderstand so badly what I wrote.

Sure the GDPR does not let you keep records privately at will. You
ultimately need to have overriding legitimate grounds for doing so.

However, overriding legitimate grounds for keeping private records are
rarely overriding legitimate grounds for publishing them.


the license is granted to the world, so the world has an interest in it.

Unless you are going to argue that the GDPR outlawed open source development.


In case of git history metadata, for publishing, you may have consent
or even legitimate interests, but not overriding legitimate grounds.
For keeping a private copy of the metadata, your probably have
overriding legitimate grounds, however.

The GDPR is not an "all or nothing" thing.

Facebook and Google certainly do not have overriding legitimate grounds
for most of the data they keep privately.

Is it that so hard to understand?


you are the one arguing that the GDPR prohibits Git from storing and revealing 
this license granting data, not me.


David Lang

Re: GDPR compliance best practices?

2018-06-07 Thread Peter Backes

On Thu, Jun 07, 2018 at 03:38:49PM -0700, David Lang wrote:
> > Again: The GDPR certainly allows you to keep a proof of copyright
> > privately if you have it. However, it does not allow you to keep
> > publishing it if someone exercises his right to be forgotten.
> someone is granting the world the right to use the code and you are claiming
> that the evidence that they have granted this right is illegal to have?

Hell no! Please read what I wrote:

- "allows you to keep a proof ... privately"
- "However, it does not allow you to keep publishing it"

> And you are incorrect to say that the GDPR lets you keep records privately
> and only applies to publishing them. The GDPR is specifically targeted at
> companies like Facebook and Google that want to keep lots of data privately.
> It does no good to ask Facebook to not publish your info, they don't want to
> publish it in the first place, they want to keep it internally and use it.

How can you misunderstand so badly what I wrote.

Sure the GDPR does not let you keep records privately at will. You 
ultimately need to have overriding legitimate grounds for doing so. 

However, overriding legitimate grounds for keeping private records are 
rarely overriding legitimate grounds for publishing them.

In case of git history metadata, for publishing, you may have consent 
or even legitimate interests, but not overriding legitimate grounds. 
For keeping a private copy of the metadata, your probably have 
overriding legitimate grounds, however.

The GDPR is not an "all or nothing" thing.

Facebook and Google certainly do not have overriding legitimate grounds 
for most of the data they keep privately.

Is it that so hard to understand?

Best wishes
Peter
-- 
Peter Backes, r...@helen.plasma.xg8.de

Re: GDPR compliance best practices?

2018-06-07 Thread David Lang


On Fri, 8 Jun 2018, Peter Backes wrote:


On Thu, Jun 07, 2018 at 10:28:47PM +0100, Philip Oakley wrote:

Some of Peter's fine distinctions may be technically valid, but that does
not stop there being legal grounds. The proof of copyright is a legal
grounds.


Again: The GDPR certainly allows you to keep a proof of copyright
privately if you have it. However, it does not allow you to keep
publishing it if someone exercises his right to be forgotten.


someone is granting the world the right to use the code and you are claiming 
that the evidence that they have granted this right is illegal to have?


the GDPR recognizes that there are legal reasons why records need to be kept and 
does not insist that they be deleted.


you can't sign a deal to buy something, then insist that the GDPR allows your 
name to be removed from the contract.


And you are incorrect to say that the GDPR lets you keep records privately and 
only applies to publishing them. The GDPR is specifically targeted at companies 
like Facebook and Google that want to keep lots of data privately. It does no 
good to ask Facebook to not publish your info, they don't want to publish it in 
the first place, they want to keep it internally and use it.


David Lang



There is simply no justification for publishing against the explicit
will of the subject, except for the rare circumstances where there are
overriding legitimate grounds for doing so. I hardly see those for the
average author entry in your everyday git repo. Such a justification is
extremely fragile.


Unfortunately once one gets into legal nitpicking the wording becomes
tortuous and helps no-one.


That's not nitpicking. If what you say were true, the GDPR would be
without any practical validity at all.


If one starts from an absolute "right to be forgotten" perspective one can
demand all evidence of wrong doing , or authority to do something, be
forgotten. The GDPR has the right to retain such evidence.


Yes, but not to keep it published.


I'll try and comment where I see the distinctions to be.


You're essentially repeating what you already said there.


Publishing (the meta data) is *distinct* from having it.


Absolutely right. That is my point.


You either start off public and stay public, or you start off private and
stay there.


Nope. The GDPR says you have to go from public to private if the
subject wishes so and there are no overriding legitimate grounds.

That is the entire purpose of the GDPR's right to be forgotten.

Best wishes
Peter

Re: GDPR compliance best practices?

2018-06-07 Thread Peter Backes

On Thu, Jun 07, 2018 at 10:28:47PM +0100, Philip Oakley wrote:
> Some of Peter's fine distinctions may be technically valid, but that does
> not stop there being legal grounds. The proof of copyright is a legal
> grounds.

Again: The GDPR certainly allows you to keep a proof of copyright 
privately if you have it. However, it does not allow you to keep 
publishing it if someone exercises his right to be forgotten.

There is simply no justification for publishing against the explicit 
will of the subject, except for the rare circumstances where there are 
overriding legitimate grounds for doing so. I hardly see those for the 
average author entry in your everyday git repo. Such a justification is 
extremely fragile.

> Unfortunately once one gets into legal nitpicking the wording becomes
> tortuous and helps no-one.

That's not nitpicking. If what you say were true, the GDPR would be 
without any practical validity at all.

> If one starts from an absolute "right to be forgotten" perspective one can
> demand all evidence of wrong doing , or authority to do something, be
> forgotten. The GDPR has the right to retain such evidence.

Yes, but not to keep it published.

> I'll try and comment where I see the distinctions to be.

You're essentially repeating what you already said there.

> Publishing (the meta data) is *distinct* from having it.

Absolutely right. That is my point.

> You either start off public and stay public, or you start off private and
> stay there.

Nope. The GDPR says you have to go from public to private if the 
subject wishes so and there are no overriding legitimate grounds.

That is the entire purpose of the GDPR's right to be forgotten.

Best wishes
Peter

-- 
Peter Backes, r...@helen.plasma.xg8.de

Re: GDPR compliance best practices?

2018-06-07 Thread Philip Oakley


Hi Peter, David,

I thought that the legal notice (aka 'disclaimer') was pretty reaonable.

Some of Peter's fine distinctions may be technically valid, but that does 
not stop there being legal grounds. The proof of copyright is a legal 
grounds.


Unfortunately once one gets into legal nitpicking the wording becomes 
tortuous and helps no-one.


If one starts from an absolute "right to be forgotten" perspective one can 
demand all evidence of wrong doing , or authority to do something, be 
forgotten. The GDPR has the right to retain such evidence.


I'll try and comment where I see the distinctions to be.

From: "Peter Backes" 


Hi David,

thanks for your input on the issue.


LEGAL GDPR NOTICE:
According to the European data protection laws (GDPR), we would like to 
make you

aware that contributing to rsyslog via git will permanently store the
name and email address you provide as well as the actual commit and the
time and date you made it inside git's version history.


This is simply an information statement


  This is inevitable,

because it is a main feature git.


The "inevitable" word creates a point of argument within the GDPR. Removing 
the word (and 'because/main') brings the sentance back to be an informative 
statement without a GDPR claim.


As we can, see, rsyslog tries to solve the issue by the already
discussed legal "technology" of disclaimers (which is certainly not
accepted as state of the art technology by the GDPR). In essence, they
are giving excuses for why they are not honoring the right to be
forgotten.

Disclaimers do not work. They have no legal effect, they are placebos.

The GDPR does not accept such excuses. If it would, companies could
arbitrarily design their data storage such as to make it "the main
feature" to not honor the right to be forgotten and/or other GDPR
rights. It is obvious that this cannot work, as it would completely
undermine those rights.

The GDPR honors technology as a means to protect the individual's
rights, not as a means to subvert them.


If you are concerned about your
privacy, we strongly recommend to use

--author "anonymous "

together with your commit.


The [key] missing information here is whether rsyslog has a DCO (Developer 
Certificate of Origin) and what that contains.


The git.git DCO is here 
https://github.com/git/git/blob/master/Documentation/SubmittingPatches#L304-L349


This will also help discriminate between the "name" part and the  
identifier, as both could be separately anonymised (given the right DCO). 
Thus it may be that the name is recored as "anonymous", but with a 
 that bridges the legal evidence/right to be forgotten 
bridge.


This can only be a solution if the project rejects any commits which
are not anonymous.


However, we have valid reasons why we cannot remove that information
later on. The reasons are:

* this would break git history and make future merges unworkable


This is not a valid excuse (see above).


Within the GDPR, that is correct. It (breaking history validation), of 
itself, should not be the reason.



 The technology has to be
designed or applied in such a way that the individuals rights are
honored, not the other way around.

In absence of other means, the project has to rewrite history if it
gets a valid request by someone exercising his right to be forgotten,
even if that causes a lot of hazzle for everyone.

* the rsyslog projects has legitimate interest to keep a permanent record 
of the

  contributor identity, once given, for
  - copyright verification
  - being able to provide proof should a malicious commit be made


True, but that doesn't justify publishing that information and keeping
it published even when someone exercises his right to be forgotten.


Publishing (the meta data) is *distinct* from having it.

However publishing the content and it's legal copyright is also associated 
with identifying the copyright holder (who has released it). This can be the 
uid if they hide behind a legal entity. This creates the catch 22 scenario. 
You either start off public and stay public, or you start off private and 
stay there.


Whether the rsyslog folk want to accept copyrighted work without appropriate 
legal release (who guards the guards, what's their badge number?) is part of 
the same information requirement.


Malicious intent makes the submission (commit) part of a legal evidence one 
needs to retain, so is supported by GDPR.




In that case, "legitimate interest" is not enough. There need to be
"overriding legitimate grounds". I don't see them here.

Please also note that your commit is public and as such will potentially 
be
processed by many third-parties. Git's distributed nature makes it 
impossible
to track where exactly your commit, and thus your personal data, will be 
stored
and be processed. If you would not like to accept this risk, please do 
either

commit anonymously or refrain from contributing to the rsyslog project.


The onward publishing and release

Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-07 Thread Johannes Sixt


Am 07.06.2018 um 16:53 schrieb g...@jeffhostetler.com:

From: Jeff Hostetler 

I've been working to add code to Git to optionally collect telemetry data.
The goal is to be able to collect performance data from Git commands and
allow it to be aggregated over a user community to find "slow commands".


Seriously? "add code to collect telemetry data" said by somebody whose 
email address ends with @microsoft.com is very irritating. I really 
don't want to have yet another switch that I must check after every 
update that it is still off.


-- Hannes

Re: git grep with leading inverted bracket expression

2018-06-07 Thread Ævar Arnfjörð Bjarmason



On Thu, Jun 07 2018, Matthew Wilcox wrote:

> On Thu, Jun 07, 2018 at 09:09:25PM +0200, Ævar Arnfjörð Bjarmason wrote:
>> On Thu, Jun 07 2018, Matthew Wilcox wrote:
>> > If the first atom of a regex is a bracket expression with an inverted 
>> > range,
>> > git grep is very slow.
>>
>> I have some WIP patches to fix all of this, which I'll hopefully submit
>> before 2.19 is out the door.
>>
>> What you've discovered here is how shitty your libc regex engine is,
>> because unless you provide -P and compile with a reasonably up-to-date
>> libpcre (preferably v2) with JIT that's what you'll get.
>
> I'm using Debian's build, and it is linked against a recent libpcre2:
> $ ldd /usr/lib/git-core/git
>   libpcre2-8.so.0 => /usr/lib/x86_64-linux-gnu/libpcre2-8.so.0 
> (0x7f59ad5f2000)
> $ dpkg --status libpcre2-8-0
> Version: 10.31-3
>
> But I wasn't using -P.  If I do, then I see the performance numbers you do:
>
> $ time git grep -P '[^t]truct_size' >/dev/null
> real  0m0.354s
> user  0m0.340s
> sys   0m0.639s
> $ time git grep -P 'struct_size' >/dev/null
> real  0m0.336s
> user  0m0.552s
> sys   0m0.457s
> $ time git grep 'struct_size' >/dev/null
> real  0m0.335s
> user  0m0.535s
> sys   0m0.474s
>
>> So you need to just use an up-to-date libpcre2 & -P and performance
>> won't suck.

Yeah that's recent enough & will get you all the benefits.

> I don't tend to use terribly advanced regexps, so I'll just set
> grep.patternType to 'perl' and then it'll automatically be fast for me
> without your patches ;-)

Indeed, if you're happy with that that'll do it.

>> My WIP patches will make us use PCRE for all grep modes, using an API it
>> has to convert basic & extended regexp syntax to its own syntax, so
>> we'll be able to do that transparently.
>
> That's clearly the right answer.  Thanks!

Yeah, unfortunately git-grep's default is "basic" regexp which has a
really atrocious syntax that's different enough from extended & Perl's
that we probably couldn't just switch it over.

That won't be needed with my patches, but maybe I'll follow-up with
something to s/basic/extended/g by default, because on side effect of
having the pattern converter is that we could have a warning whenever
the user has a pattern that would be different under extended/perl, so
we can see how common that is.

Re: git grep with leading inverted bracket expression

2018-06-07 Thread Matthew Wilcox

On Thu, Jun 07, 2018 at 09:09:25PM +0200, Ævar Arnfjörð Bjarmason wrote:
> On Thu, Jun 07 2018, Matthew Wilcox wrote:
> > If the first atom of a regex is a bracket expression with an inverted range,
> > git grep is very slow.
> 
> I have some WIP patches to fix all of this, which I'll hopefully submit
> before 2.19 is out the door.
> 
> What you've discovered here is how shitty your libc regex engine is,
> because unless you provide -P and compile with a reasonably up-to-date
> libpcre (preferably v2) with JIT that's what you'll get.

I'm using Debian's build, and it is linked against a recent libpcre2:
$ ldd /usr/lib/git-core/git
libpcre2-8.so.0 => /usr/lib/x86_64-linux-gnu/libpcre2-8.so.0 
(0x7f59ad5f2000)
$ dpkg --status libpcre2-8-0
Version: 10.31-3

But I wasn't using -P.  If I do, then I see the performance numbers you do:

$ time git grep -P '[^t]truct_size' >/dev/null
real0m0.354s
user0m0.340s
sys 0m0.639s
$ time git grep -P 'struct_size' >/dev/null
real0m0.336s
user0m0.552s
sys 0m0.457s
$ time git grep 'struct_size' >/dev/null
real0m0.335s
user0m0.535s
sys 0m0.474s

> So you need to just use an up-to-date libpcre2 & -P and performance
> won't suck.

I don't tend to use terribly advanced regexps, so I'll just set
grep.patternType to 'perl' and then it'll automatically be fast for me
without your patches ;-)

> My WIP patches will make us use PCRE for all grep modes, using an API it
> has to convert basic & extended regexp syntax to its own syntax, so
> we'll be able to do that transparently.

That's clearly the right answer.  Thanks!

Re: git grep with leading inverted bracket expression

2018-06-07 Thread Ævar Arnfjörð Bjarmason



On Thu, Jun 07 2018, Matthew Wilcox wrote:

> If the first atom of a regex is a bracket expression with an inverted range,
> git grep is very slow.
>
> $ time git grep 'struct_size' >/dev/null
>
> real  0m0.368s
> user  0m0.563s
> sys   0m0.453s
>
> $ time git grep '[^t]truct_size' >/dev/null
>
> real  0m31.529s
> user  1m54.909s
> sys   0m0.805s
>
> If the bracket expression is moved to even the second position in the string,
> it runs much faster:
>
> $ time git grep 's[^p]ruct_size' >/dev/null
>
> real  0m3.989s
> user  0m13.939s
> sys   0m0.403s
>
> It's pretty bad with even a '.' as the first character:
>
> $ time git grep '.truct_size' >/dev/null
>
> real  0m14.514s
> user  0m52.624s
> sys   0m0.598s
>
> $ git --version
> git version 2.17.1
>
> Setting LANG=C improves matters by a factor of 3-4 (depending if you
> count real or user time):
>
> $ time git grep '[^t]truct_size' >/dev/null
> real  0m10.035s
> user  0m28.795s
> sys   0m0.537s
>
> (this is using something pretty close to Linus' current HEAD of the
> linux repository, an i7-7500, 16GB memory).

I have some WIP patches to fix all of this, which I'll hopefully submit
before 2.19 is out the door.

What you've discovered here is how shitty your libc regex engine is,
because unless you provide -P and compile with a reasonably up-to-date
libpcre (preferably v2) with JIT that's what you'll get.

The reason stuff like 'struct_size' is so much faster is because there
we don't use any regex engine at all, but rather an optimized
fixed-string searcher.

With our own benchmarks modified per your E-Mail:

diff --git a/t/perf/p7820-grep-engines.sh b/t/perf/p7820-grep-engines.sh
index 8b09c5bf32..fe4c5681da 100755
--- a/t/perf/p7820-grep-engines.sh
+++ b/t/perf/p7820-grep-engines.sh
@@ -28,11 +28,10 @@ then
 fi

 for pattern in \
-   'how.to' \
-   '^how to' \
-   '[how] to' \
-   '\(e.t[^ ]*\|v.ry\) rare' \
-   'm\(ú\|u\)lt.b\(æ\|y\)te'
+   'struct size' \
+   '[^t]truct_size' \
+   's[^p]ruct_size' \
+   '.truct_size'
 do
for engine in basic extended perl
do

I get these results against linux.git:

$ GIT_PERF_LARGE_REPO=~/g/linux ./run p7820-grep-engines.sh
[...]
Test  this tree
--
7820.1: basic grep 'struct size'  0.23(0.52+0.76)
7820.2: extended grep 'struct size'   0.22(0.60+0.61)
7820.3: perl grep 'struct size'   0.22(0.56+0.65)
7820.5: basic grep '[^t]truct_size'   4.29(29.43+0.51)
7820.6: extended grep '[^t]truct_size'4.27(29.59+0.36)
7820.7: perl grep '[^t]truct_size'0.21(0.40+0.69)
7820.9: basic grep 's[^p]ruct_size'   0.49(2.22+0.49)
7820.10: extended grep 's[^p]ruct_size'   0.43(2.24+0.48)
7820.11: perl grep 's[^p]ruct_size'   0.21(0.38+0.71)
7820.13: basic grep '.truct_size' 4.42(31.29+0.44)
7820.14: extended grep '.truct_size'  4.50(31.18+0.46)
7820.15: perl grep '.truct_size'  0.21(0.35+0.75)

So you need to just use an up-to-date libpcre2 & -P and performance
won't suck.

My WIP patches will make us use PCRE for all grep modes, using an API it
has to convert basic & extended regexp syntax to its own syntax, so
we'll be able to do that transparently.

[PATCH 2/2] pack-bitmap: add free function

2018-06-07 Thread Jonathan Tan

Add a function to free struct bitmap_index instances, and use it where
needed (except when rebuild_existing_bitmaps() is used, since it creates
references to the bitmaps within the struct bitmap_index passed to it).

Note that the hashes field in struct bitmap_index is not freed because
it points to another field within the same struct. The documentation for
that field has been updated to clarify that.

Signed-off-by: Jonathan Tan 
---
 builtin/pack-objects.c |  1 +
 builtin/rev-list.c |  2 ++
 pack-bitmap-write.c|  4 
 pack-bitmap.c  | 35 +--
 pack-bitmap.h  |  1 +
 5 files changed, 37 insertions(+), 6 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index d064f944b..896e41300 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2945,6 +2945,7 @@ static int get_object_list_from_bitmap(struct rev_info 
*revs)
}
 
traverse_bitmap_commit_list(bitmap_git, _object_entry_from_bitmap);
+   free_bitmap_index(bitmap_git);
return 0;
 }
 
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index cce42ae1d..62776721f 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -521,6 +521,7 @@ int cmd_rev_list(int argc, const char **argv, const char 
*prefix)
if (max_count >= 0 && max_count < commit_count)
commit_count = max_count;
printf("%d\n", commit_count);
+   free_bitmap_index(bitmap_git);
return 0;
}
} else if (revs.max_count < 0 &&
@@ -528,6 +529,7 @@ int cmd_rev_list(int argc, const char **argv, const char 
*prefix)
struct bitmap_index *bitmap_git;
if ((bitmap_git = prepare_bitmap_walk())) {
traverse_bitmap_commit_list(bitmap_git, 
_object_fast);
+   free_bitmap_index(bitmap_git);
return 0;
}
}
diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
index 03e122563..7896fedd3 100644
--- a/pack-bitmap-write.c
+++ b/pack-bitmap-write.c
@@ -367,6 +367,10 @@ void bitmap_writer_reuse_bitmaps(struct packing_data 
*to_pack)
writer.reused = kh_init_sha1();
rebuild_existing_bitmaps(bitmap_git, to_pack, writer.reused,
 writer.show_progress);
+   /*
+* NEEDSWORK: rebuild_existing_bitmaps() makes writer.reused reference
+* some bitmaps in bitmap_git, so we can't free the latter.
+*/
 }
 
 static struct ewah_bitmap *find_reused_bitmap(const unsigned char *sha1)
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 7795444b0..c06b19f49 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -66,7 +66,7 @@ struct bitmap_index {
/* Number of bitmapped commits */
uint32_t entry_count;
 
-   /* Name-hash cache (or NULL if not present). */
+   /* If not NULL, this is a name-hash cache pointing into map. */
uint32_t *hashes;
 
/*
@@ -350,6 +350,7 @@ struct bitmap_index *prepare_bitmap_git(void)
if (!open_pack_bitmap(bitmap_git) && !load_pack_bitmap(bitmap_git))
return bitmap_git;
 
+   free_bitmap_index(bitmap_git);
return NULL;
 }
 
@@ -690,7 +691,7 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info 
*revs)
/* try to open a bitmapped pack, but don't parse it yet
 * because we may not need to use it */
if (open_pack_bitmap(bitmap_git) < 0)
-   return NULL;
+   goto cleanup;
 
for (i = 0; i < revs->pending.nr; ++i) {
struct object *object = revs->pending.objects[i].item;
@@ -723,11 +724,11 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info 
*revs)
 * optimize here
 */
if (haves && !in_bitmapped_pack(bitmap_git, haves))
-   return NULL;
+   goto cleanup;
 
/* if we don't want anything, we're done here */
if (!wants)
-   return NULL;
+   goto cleanup;
 
/*
 * now we're going to use bitmaps, so load the actual bitmap entries
@@ -735,7 +736,7 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info 
*revs)
 * becomes invalidated and we must perform the revwalk through bitmaps
 */
if (!bitmap_git->loaded && load_pack_bitmap(bitmap_git) < 0)
-   return NULL;
+   goto cleanup;
 
object_array_clear(>pending);
 
@@ -761,6 +762,10 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info 
*revs)
 
bitmap_free(haves_bitmap);
return bitmap_git;
+
+cleanup:
+   free_bitmap_index(bitmap_git);
+   return NULL;
 }
 
 int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git,
@@ -1001,7 +1006,7 @@ void

[PATCH 0/2] Object store refactoring: make bitmap_git not global

2018-06-07 Thread Jonathan Tan

This is a continuation of the object store refactoring effort.

We cannot truly free an object store without ensuring that any generated
bitmaps are first freed, so here are patches to drastically reduce the
lifetime of any bitmaps generated. As a bonus, the API is also improved,
and global state reduced.

Jonathan Tan (2):
  pack-bitmap: remove bitmap_git global variable
  pack-bitmap: add free function

 builtin/pack-objects.c |   7 +-
 builtin/rev-list.c |  13 +-
 pack-bitmap-write.c|  10 +-
 pack-bitmap.c  | 344 -
 pack-bitmap.h  |  20 ++-
 5 files changed, 234 insertions(+), 160 deletions(-)

-- 
2.17.0.768.g1526ddbba1.dirty

[PATCH 1/2] pack-bitmap: remove bitmap_git global variable

2018-06-07 Thread Jonathan Tan

Remove the bitmap_git global variable. Instead, generate on demand an
instance of struct bitmap_index for code that needs to access it.

This allows us significant control over the lifetime of instances of
struct bitmap_index. In particular, packs can now be closed without
worrying if an unnecessarily long-lived "pack" field in struct
bitmap_index still points to it.

The bitmap API is also clearer in that we need to first obtain a struct
bitmap_index, then we use it.

Helped-by: Stefan Beller 
Signed-off-by: Jonathan Tan 
---
 builtin/pack-objects.c |   6 +-
 builtin/rev-list.c |  11 +-
 pack-bitmap-write.c|   6 +-
 pack-bitmap.c  | 317 ++---
 pack-bitmap.h  |  19 ++-
 5 files changed, 201 insertions(+), 158 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 71056d829..d064f944b 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2929,11 +2929,13 @@ static int pack_options_allow_reuse(void)
 
 static int get_object_list_from_bitmap(struct rev_info *revs)
 {
-   if (prepare_bitmap_walk(revs) < 0)
+   struct bitmap_index *bitmap_git;
+   if (!(bitmap_git = prepare_bitmap_walk(revs)))
return -1;
 
if (pack_options_allow_reuse() &&
!reuse_partial_packfile_from_bitmap(
+   bitmap_git,
_packfile,
_packfile_objects,
_packfile_offset)) {
@@ -2942,7 +2944,7 @@ static int get_object_list_from_bitmap(struct rev_info 
*revs)
display_progress(progress_state, nr_result);
}
 
-   traverse_bitmap_commit_list(_object_entry_from_bitmap);
+   traverse_bitmap_commit_list(bitmap_git, _object_entry_from_bitmap);
return 0;
 }
 
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index fadd3ec14..cce42ae1d 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -16,6 +16,7 @@
 #include "reflog-walk.h"
 #include "oidset.h"
 #include "packfile.h"
+#include "object-store.h"
 
 static const char rev_list_usage[] =
 "git rev-list [OPTION] ... [ -- paths... ]\n"
@@ -514,8 +515,9 @@ int cmd_rev_list(int argc, const char **argv, const char 
*prefix)
if (revs.count && !revs.left_right && !revs.cherry_mark) {
uint32_t commit_count;
int max_count = revs.max_count;
-   if (!prepare_bitmap_walk()) {
-   count_bitmap_commit_list(_count, NULL, 
NULL, NULL);
+   struct bitmap_index *bitmap_git;
+   if ((bitmap_git = prepare_bitmap_walk())) {
+   count_bitmap_commit_list(bitmap_git, 
_count, NULL, NULL, NULL);
if (max_count >= 0 && max_count < commit_count)
commit_count = max_count;
printf("%d\n", commit_count);
@@ -523,8 +525,9 @@ int cmd_rev_list(int argc, const char **argv, const char 
*prefix)
}
} else if (revs.max_count < 0 &&
   revs.tag_objects && revs.tree_objects && 
revs.blob_objects) {
-   if (!prepare_bitmap_walk()) {
-   traverse_bitmap_commit_list(_object_fast);
+   struct bitmap_index *bitmap_git;
+   if ((bitmap_git = prepare_bitmap_walk())) {
+   traverse_bitmap_commit_list(bitmap_git, 
_object_fast);
return 0;
}
}
diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
index 7b2dc3e7d..03e122563 100644
--- a/pack-bitmap-write.c
+++ b/pack-bitmap-write.c
@@ -360,11 +360,13 @@ static int date_compare(const void *_a, const void *_b)
 
 void bitmap_writer_reuse_bitmaps(struct packing_data *to_pack)
 {
-   if (prepare_bitmap_git() < 0)
+   struct bitmap_index *bitmap_git;
+   if (!(bitmap_git = prepare_bitmap_git()))
return;
 
writer.reused = kh_init_sha1();
-   rebuild_existing_bitmaps(to_pack, writer.reused, writer.show_progress);
+   rebuild_existing_bitmaps(bitmap_git, to_pack, writer.reused,
+writer.show_progress);
 }
 
 static struct ewah_bitmap *find_reused_bitmap(const unsigned char *sha1)
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 06771113f..7795444b0 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -25,14 +25,14 @@ struct stored_bitmap {
 };
 
 /*
- * The currently active bitmap index. By design, repositories only have
+ * The active bitmap index for a repository. By design, repositories only have
  * a single bitmap index available (the index for the biggest packfile in
  * the repository), since bitmap indexes need full closure.
  *
  * If there is more than one bitmap index available (e.g. because of 
alternates),
  * the

Re: [PATCH v7 2/2] json-writer: t0019: add Python unit test

2018-06-07 Thread Ramsay Jones

On 07/06/18 03:23, Jeff King wrote:
> On Thu, Jun 07, 2018 at 01:16:14AM +0100, Ramsay Jones wrote:
> 
>>> Probably. We may want to go the same route as we did for perl in 
>>> a0e0ec9f7d (t: provide a perl() function which uses $PERL_PATH,
>>> 2013-10-28) so that test writers don't have to remember this.
>>>
>>> That said, I wonder if it would be hard to simply do the python bits
>>> here in perl. This is the first use of python in our test scripts (and
>>
>> Hmm, not quite the _first_ use:
>>
>> $ git grep PYTHON_PATH -- t
>> t/lib-git-p4.sh:(cd / && "$PYTHON_PATH" -c 'import time; 
>> print(int(time.time()))')
>> t/lib-git-p4.sh:"$PYTHON_PATH" "$TRASH_DIRECTORY/marshal-dump.py"
>> t/t9020-remote-svn.sh:export PATH PYTHON_PATH GIT_BUILD_DIR
>> t/t9020-remote-svn.sh:exec "$PYTHON_PATH" 
>> "$GIT_BUILD_DIR/contrib/svn-fe/svnrdump_sim.py" "$@"
>> t/t9802-git-p4-filetype.sh: "$PYTHON_PATH" 
>> "$TRASH_DIRECTORY/k_smush.py" <"$cli/k-text-k" >cli-k-text-k-smush &&
>> t/t9802-git-p4-filetype.sh: "$PYTHON_PATH" 
>> "$TRASH_DIRECTORY/ko_smush.py" <"$cli/k-text-ko" >cli-k-text-ko-smush &&
>> t/t9802-git-p4-filetype.sh: "$PYTHON_PATH" 
>> "$TRASH_DIRECTORY/gendouble.py" >%double.png &&
>> t/t9810-git-p4-rcs.sh:  "$PYTHON_PATH" "$TRASH_DIRECTORY/scrub_k.py" 
>> <"$git/$file" >"$scrub" &&
>> t/t9810-git-p4-rcs.sh:  "$PYTHON_PATH" "$TRASH_DIRECTORY/scrub_ko.py" 
>> <"$git/$file" >"$scrub" &&
>> $ 
> 
> OK, the first for a feature that is not already written in python
> (leading into my second claim that python is used only for fringe
> commands ;) ).
> 
> Though maybe I am wrong that the remote-svn stuff requires python. I
> thought it did, but poking around, it looks like it's all C, and just
> the "svnrdump_sim" helper is python.

Heh, I was a bit tired last night, so although I knew that
I required python to be installed to run the test-suite, I
could not remember why. As soon as I went to bed, ... :)

I recently installed Ubuntu 18.04, so that I could get a heads
up on any possible problems later this month when I install
Linux Mint 19. In order to get the test-suite to run, I had
to set 'PYTHON_PATH = /usr/bin/python3' in my config.mak file.
(yes, I could have set NO_PYTHON, but you get the idea).

Ubuntu 18.04 no longer installs python2 by default (it has
ported to python3), but presumably you can still install it
as '/usr/bin/python' (I didn't check).

In any event, it was t9020-remote-svn.sh that was failing
which, despite the name, does not depend on 'svn'. As you
already noted, it does run svnrdump_sim.py and the git-remote-testsvn.

> At any rate, I think the point still stands that perl is our main
> scripting language. I'd rather keep us to that unless there's a good
> reason not to.

Agreed. And I see it has come to pass ... :-D

ATB,
Ramsay Jones

Re: [PATCH 06/23] midx: struct midxed_git and 'read' subcommand

2018-06-07 Thread Duy Nguyen

On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee  wrote:
> diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
> index dcaeb1a91b..919283fdd8 100644
> --- a/Documentation/git-midx.txt
> +++ b/Documentation/git-midx.txt
> @@ -23,6 +23,11 @@ OPTIONS
> /packs/multi-pack-index for the current MIDX file, and
> /packs for the pack-files to index.
>
> +read::
> +   When given as the verb, read the current MIDX file and output
> +   basic information about its contents. Used for debugging
> +   purposes only.

On second thought. If you just need a temporary debugging interface,
adding a program in t/helper may be a better option. In the end we
might still need 'read' to dump a file out, but we should have some
stable output format (and json might be a good choice).

That's it I'm done for today. I will continue on the rest some day, hopefully.
-- 
Duy

Re: [PATCH 09/23] midx: write pack names in chunk

2018-06-07 Thread Duy Nguyen

On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee  wrote:
> @@ -74,6 +80,31 @@ struct midxed_git *load_midxed_git(const char *object_dir)
> m->num_chunks = *(m->data + 6);
> m->num_packs = get_be32(m->data + 8);
>
> +   for (i = 0; i < m->num_chunks; i++) {
> +   uint32_t chunk_id = get_be32(m->data + 12 + 
> MIDX_CHUNKLOOKUP_WIDTH * i);
> +   uint64_t chunk_offset = get_be64(m->data + 16 + 
> MIDX_CHUNKLOOKUP_WIDTH * i);

Would be good to reduce magic numbers like 12 and 16, I think you have
some header length constants for those already.

> +   switch (chunk_id) {
> +   case MIDX_CHUNKID_PACKNAMES:
> +   m->chunk_pack_names = m->data + chunk_offset;
> +   break;
> +
> +   case 0:
> +   die("terminating MIDX chunk id appears 
> earlier than expected");

_()

> +   break;
> +
> +   default:
> +   /*
> +* Do nothing on unrecognized chunks, 
> allowing future
> +* extensions to add optional chunks.
> +*/

I wrote about the chunk term reminding me of PNG format then deleted
it. But it may help to do similar to PNG here. The first letter can
let us know if the chunk is optional and can be safely ignored. E.g.
uppercase first letter cannot be ignored, lowercase go wild.

> +   break;
> +   }
> +   }
> +
> +   if (!m->chunk_pack_names)
> +   die("MIDX missing required pack-name chunk");

_()

> +
> return m;
>
>  cleanup_fail:
> @@ -99,18 +130,88 @@ static size_t write_midx_header(struct hashfile *f,
> return MIDX_HEADER_SIZE;
>  }
>
> +struct pack_pair {
> +   uint32_t pack_int_id;

can this be just pack_id?

> +   char *pack_name;
> +};
> +
> +static int pack_pair_compare(const void *_a, const void *_b)
> +{
> +   struct pack_pair *a = (struct pack_pair *)_a;
> +   struct pack_pair *b = (struct pack_pair *)_b;
> +   return strcmp(a->pack_name, b->pack_name);
> +}
> +
> +static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, 
> uint32_t *perm)
> +{
> +   uint32_t i;
> +   struct pack_pair *pairs;
> +
> +   ALLOC_ARRAY(pairs, nr_packs);
> +
> +   for (i = 0; i < nr_packs; i++) {
> +   pairs[i].pack_int_id = i;
> +   pairs[i].pack_name = pack_names[i];
> +   }
> +
> +   QSORT(pairs, nr_packs, pack_pair_compare);
> +
> +   for (i = 0; i < nr_packs; i++) {
> +   pack_names[i] = pairs[i].pack_name;
> +   perm[pairs[i].pack_int_id] = i;
> +   }

pairs[] is leaked?

> +}
> +
> +static size_t write_midx_pack_names(struct hashfile *f,
> +   char **pack_names,
> +   uint32_t num_packs)
> +{
> +   uint32_t i;
> +   unsigned char padding[MIDX_CHUNK_ALIGNMENT];
> +   size_t written = 0;
> +
> +   for (i = 0; i < num_packs; i++) {
> +   size_t writelen = strlen(pack_names[i]) + 1;
> +
> +   if (i && strcmp(pack_names[i], pack_names[i - 1]) <= 0)
> +   BUG("incorrect pack-file order: %s before %s",
> +   pack_names[i - 1],
> +   pack_names[i]);
> +
> +   hashwrite(f, pack_names[i], writelen);
> +   written += writelen;

side note. This pattern happens a lot. It may be a good idea to make
hashwrite() return writelen so we can just write

written += hashwrite(f, ..., writelen);

> +   }
> +
> +   /* add padding to be aligned */
> +   i = MIDX_CHUNK_ALIGNMENT - (written % MIDX_CHUNK_ALIGNMENT);
> +   if (i < MIDX_CHUNK_ALIGNMENT) {
> +   bzero(padding, sizeof(padding));
> +   hashwrite(f, padding, i);
> +   written += i;
> +   }
> +
> +   return written;
> +}
> +
>  int write_midx_file(const char *object_dir)
>  {
> -   unsigned char num_chunks = 0;
> +   unsigned char cur_chunk, num_chunks = 0;
> char *midx_name;
> struct hashfile *f;
> struct lock_file lk;
> struct packed_git **packs = NULL;
> +   char **pack_names = NULL;
> +   uint32_t *pack_perm;
> uint32_t i, nr_packs = 0, alloc_packs = 0;
> +   uint32_t alloc_pack_names = 0;
> DIR *dir;
> struct dirent *de;
> struct strbuf pack_dir = STRBUF_INIT;
> size_t pack_dir_len;
> +   uint64_t pack_name_concat_len = 0;
> +   uint64_t written = 0;
> +   uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
> +   uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];

This long list of local vars may be a good indicator that this
function needs split up into smaller ones.

>
> midx_name =

Re: [PATCH 08/23] midx: read packfiles from pack directory

2018-06-07 Thread Duy Nguyen

On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee  wrote:
> @@ -114,14 +119,56 @@ int write_midx_file(const char *object_dir)
>   midx_name);
> }
>
> +   strbuf_addf(_dir, "%s/pack", object_dir);
> +   dir = opendir(pack_dir.buf);
> +
> +   if (!dir) {
> +   error_errno("unable to open pack directory: %s",
> +   pack_dir.buf);

_()

> +   strbuf_release(_dir);
> +   return 1;
> +   }
> +
> +   strbuf_addch(_dir, '/');
> +   pack_dir_len = pack_dir.len;
> +   ALLOC_ARRAY(packs, alloc_packs);
> +   while ((de = readdir(dir)) != NULL) {
> +   if (is_dot_or_dotdot(de->d_name))
> +   continue;
> +
> +   if (ends_with(de->d_name, ".idx")) {
> +   ALLOC_GROW(packs, nr_packs + 1, alloc_packs);
> +
> +   strbuf_setlen(_dir, pack_dir_len);
> +   strbuf_addstr(_dir, de->d_name);
> +
> +   packs[nr_packs] = add_packed_git(pack_dir.buf,
> +pack_dir.len,
> +0);
> +   if (!packs[nr_packs])
> +   warning("failed to add packfile '%s'",
> +   pack_dir.buf);
> +   else
> +   nr_packs++;
> +   }
> +   }
> +   closedir(dir);
> +   strbuf_release(_dir);

Can we refactor and share this scanning-for-packs code with
packfile.c? I'm pretty sure it does something similar in there.

> -   write_midx_header(f, num_chunks, num_packs);
> +   write_midx_header(f, num_chunks, nr_packs);

Hmm.. could have stuck to one name from the beginning...
-- 
Duy

Re: [PATCH 06/23] midx: struct midxed_git and 'read' subcommand

2018-06-07 Thread Duy Nguyen

On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee  wrote:
> As we build the multi-pack-index feature by adding chunks at a time,
> we want to test that the data is being written correctly.
>
> Create struct midxed_git to store an in-memory representation of a

A word play on 'packed_git'? Amusing. Some more descriptive name would
be better though. midxed looks almost like random letters thrown
together.

> multi-pack-index and a memory-map of the binary file. Initialize this
> struct in load_midxed_git(object_dir).

> +static int read_midx_file(const char *object_dir)
> +{
> +   struct midxed_git *m = load_midxed_git(object_dir);
> +
> +   if (!m)
> +   return 0;

This looks like an error case, please don't just return zero,
typically used to say "success". I don't know if this command stays
"for debugging purposes" until the end. Of course in that case it does
not really matter.

> +struct midxed_git *load_midxed_git(const char *object_dir)
> +{
> +   struct midxed_git *m;
> +   int fd;
> +   struct stat st;
> +   size_t midx_size;
> +   void *midx_map;
> +   const char *midx_name = get_midx_filename(object_dir);

mem leak? This function returns allocated memory if I remember correctly.

> +
> +   fd = git_open(midx_name);
> +   if (fd < 0)
> +   return NULL;

do an error_errno() so we know what went wrong at least.

> +   if (fstat(fd, )) {
> +   close(fd);
> +   return NULL;

same here, we should know why fstat() fails.

> +   }
> +   midx_size = xsize_t(st.st_size);
> +
> +   if (midx_size < MIDX_MIN_SIZE) {
> +   close(fd);
> +   die("multi-pack-index file %s is too small", midx_name);

_()

The use of die() should be discouraged though. Many people still try
(or wish) to libify code and new die() does not help. I think error()
here would be enough then you can return NULL. Or you can go fancier
and store the error string in a strbuf like refs code.

> +   }
> +
> +   midx_map = xmmap(NULL, midx_size, PROT_READ, MAP_PRIVATE, fd, 0);
> +
> +   m = xcalloc(1, sizeof(*m) + strlen(object_dir) + 1);
> +   strcpy(m->object_dir, object_dir);
> +   m->data = midx_map;
> +
> +   m->signature = get_be32(m->data);
> +   if (m->signature != MIDX_SIGNATURE) {
> +   error("multi-pack-index signature %X does not match signature 
> %X",
> + m->signature, MIDX_SIGNATURE);

_(). Maybe 0x%08x instead of %x

> +   goto cleanup_fail;
> +   }
> +
> +   m->version = *(m->data + 4);

m->data[4] instead? shorter and easier to understand.

Same comment on "*(m->data + x)" and error() without _() for the rest.

> +   if (m->version != MIDX_VERSION) {
> +   error("multi-pack-index version %d not recognized",
> + m->version);

_()

> +   goto cleanup_fail;
> +   }
> +
> +   m->hash_version = *(m->data + 5);

m->data[5]

> +cleanup_fail:
> +   FREE_AND_NULL(m);
> +   munmap(midx_map, midx_size);
> +   close(fd);
> +   exit(1);

It's bad enough that you die() but exit() in this code seems too much.
Please just return NULL and let the caller handle the error.

> diff --git a/midx.h b/midx.h
> index 3a63673952..a1d18ed991 100644
> --- a/midx.h
> +++ b/midx.h
> @@ -1,4 +1,13 @@
> +#ifndef MIDX_H
> +#define MIDX_H
> +
> +#include "git-compat-util.h"
>  #include "cache.h"
> +#include "object-store.h"

I don't really think you need object-store here (git-compat-util.h
too). "struct mixed_git;" would be enough for load_midxed_git
declaration below.

>  #include "packfile.h"
>
> +struct midxed_git *load_midxed_git(const char *object_dir);
> +
>  int write_midx_file(const char *object_dir);
> +
> +#endif
> diff --git a/object-store.h b/object-store.h
> index d683112fd7..77cb82621a 100644
> --- a/object-store.h
> +++ b/object-store.h
> @@ -84,6 +84,25 @@ struct packed_git {
> char pack_name[FLEX_ARRAY]; /* more */
>  };
>
> +struct midxed_git {
> +   struct midxed_git *next;

Do we really have multiple midx files?

> +
> +   int fd;
> +
> +   const unsigned char *data;
> +   size_t data_len;
> +
> +   uint32_t signature;
> +   unsigned char version;
> +   unsigned char hash_version;
> +   unsigned char hash_len;
> +   unsigned char num_chunks;
> +   uint32_t num_packs;
> +   uint32_t num_objects;
> +
> +   char object_dir[FLEX_ARRAY];

Why do you need to keep object_dir when it could be easily retrieved
when the repo is available?

> +};
> +
>  struct raw_object_store {
> /*
>  * Path to the repository's object store.
-- 
Duy

Re: [PATCH 05/23] midx: write header information to lockfile

2018-06-07 Thread Duy Nguyen

On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee  wrote:
> +static char *get_midx_filename(const char *object_dir)
> +{
> +   struct strbuf midx_name = STRBUF_INIT;
> +   strbuf_addstr(_name, object_dir);
> +   strbuf_addstr(_name, "/pack/multi-pack-index");
> +   return strbuf_detach(_name, NULL);
> +}

I think this whole function can be written as
xstrfmt("%s/pack/multi-pack-index", object_dir);

> +
> +static size_t write_midx_header(struct hashfile *f,
> +   unsigned char num_chunks,
> +   uint32_t num_packs)
> +{
> +   char byte_values[4];

unsigned char just to be on the safe side? 'char' is signed on ARM if
I remember correctly.
-- 
Duy

Re: [PATCH 04/23] midx: add 'write' subcommand and basic wiring

2018-06-07 Thread Duy Nguyen

On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee  wrote:
> diff --git a/builtin/midx.c b/builtin/midx.c
> index 59ea92178f..dc0a5acd3f 100644
> --- a/builtin/midx.c
> +++ b/builtin/midx.c
> @@ -3,9 +3,10 @@
>  #include "config.h"
>  #include "git-compat-util.h"
>  #include "parse-options.h"
> +#include "midx.h"
>
>  static char const * const builtin_midx_usage[] ={
> -   N_("git midx [--object-dir ]"),
> +   N_("git midx [--object-dir ] [write]"),
> NULL
>  };
>
> @@ -34,5 +35,11 @@ int cmd_midx(int argc, const char **argv, const char 
> *prefix)
> if (!opts.object_dir)
> opts.object_dir = get_object_directory();
>
> +   if (argc == 0)
> +   return 0;

Isn't it better to die here when no verb is given? I don't see any
good use case for running a no-op "git midx" without verbs. It's more
likely a mistake (e.g. "git midx $foo" where foo happens to be empty)

> +
> +   if (!strcmp(argv[0], "write"))
> +   return write_midx_file(opts.object_dir);
> +
> return 0;
>  }
> diff --git a/midx.c b/midx.c
> new file mode 100644
> index 00..616af66b13
> --- /dev/null
> +++ b/midx.c
> @@ -0,0 +1,9 @@
> +#include "git-compat-util.h"
> +#include "cache.h"

Only one of the two is needed

> +#include "dir.h"

Not needed yet. It's better to include it in the patch that actually needs it.

> +#include "midx.h"
> +
> +int write_midx_file(const char *object_dir)
> +{
> +   return 0;
> +}
> diff --git a/midx.h b/midx.h
> new file mode 100644
> index 00..3a63673952
> --- /dev/null
> +++ b/midx.h
> @@ -0,0 +1,4 @@
> +#include "cache.h"
> +#include "packfile.h"

These includes are not needed, at least not now. And please protect
the header file with #ifndef __MINDX_H__ .. #endif.

> +
> +int write_midx_file(const char *object_dir);
> diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
> new file mode 100755
> index 00..a590137af7
> --- /dev/null
> +++ b/t/t5319-midx.sh
> @@ -0,0 +1,10 @@
> +#!/bin/sh
> +
> +test_description='multi-pack-indexes'
> +. ./test-lib.sh
> +
> +test_expect_success 'write midx with no pakcs' '

no packs


> +   git midx --object-dir=. write
> +'
> +
> +test_done
> --
> 2.18.0.rc1
>



-- 
Duy

Re: [PATCH v8 1/2] json_writer: new routines to create data in JSON format

2018-06-07 Thread Eric Sunshine

On Thu, Jun 7, 2018 at 10:12 AM,   wrote:
> Add a series of jw_ routines and "struct json_writer" structure to compose
> JSON data.  The resulting string data can then be output by commands wanting
> to support a JSON output format.
> [...]
> Signed-off-by: Jeff Hostetler 
> ---
> diff --git a/t/t0019-json-writer.sh b/t/t0019-json-writer.sh
> @@ -0,0 +1,236 @@
> +test_expect_success 'simple object' '
> +   cat >expect <<-\EOF &&
> +   {"a":"abc","b":42,"c":3.14,"d":true,"e":false,"f":null}
> +   EOF
> +   test-json-writer >actual \
> +   @object \
> +   @object-string a abc \
> +   @object-int b 42 \
> +   @object-double c 2 3.140 \
> +   @object-true d \
> +   @object-false e \
> +   @object-null f \
> +   @end &&
> +   test_cmp expect actual
> +'

To make it easier on people writing these tests, it might be nice for
this to be less noisy by getting rid of "@" and "\". To get rid of
"\", the test program could grab its script commands from stdin (one
instruction per line) rather than from argv[]. For instance:

test-json-writer >actual <<-\EOF &&
object
object-string a abc
...
end
EOF

Not a big deal, and certainly not worth a re-roll.

Re: [PATCH 03/23] midx: add midx builtin

2018-06-07 Thread Duy Nguyen

On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee  wrote:
> diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
> new file mode 100644
> index 00..2bd886f1a2
> --- /dev/null
> +++ b/Documentation/git-midx.txt
> @@ -0,0 +1,29 @@
> +git-midx(1)
> +
> +
> +NAME
> +
> +git-midx - Write and verify multi-pack-indexes (MIDX files).

No full stop. This head line is collected automatically with others
and its having a full stop while the rest does not looks strange/

> diff --git a/builtin/midx.c b/builtin/midx.c
> new file mode 100644
> index 00..59ea92178f
> --- /dev/null
> +++ b/builtin/midx.c
> @@ -0,0 +1,38 @@
> +#include "builtin.h"
> +#include "cache.h"
> +#include "config.h"
> +#include "git-compat-util.h"

You only need either cache.h or git-compat-util.h. If cache.h is here,
git-compat-util can be removed.

> +#include "parse-options.h"
> +
> +static char const * const builtin_midx_usage[] ={
> +   N_("git midx [--object-dir ]"),
> +   NULL
> +};
> +
> +static struct opts_midx {
> +   const char *object_dir;
> +} opts;
> +
> +int cmd_midx(int argc, const char **argv, const char *prefix)
> +{
> +   static struct option builtin_midx_options[] = {
> +   { OPTION_STRING, 0, "object-dir", _dir,

For paths (including dir), OPTION_FILENAME may be a better option to
handle correctly when the command is run in a subdir. See df217ed643
(parse-opts: add OPT_FILENAME and transition builtins - 2009-05-23)
for more info.

> + N_("dir"),
> + N_("The object directory containing set of packfile and 
> pack-index pairs.") },

Other help strings do not have full stop either (I only checked a
couple commands though)

Also, doesn't OPT_STRING() work here too (if you avoid OPTION_FILENAME
for some reason)?

> +   OPT_END(),
> +   };
> +
> +   if (argc == 2 && !strcmp(argv[1], "-h"))
> +   usage_with_options(builtin_midx_usage, builtin_midx_options);
> +
> +   git_config(git_default_config, NULL);
> +
> +   argc = parse_options(argc, argv, prefix,
> +builtin_midx_options,
> +builtin_midx_usage, 0);
> +
> +   if (!opts.object_dir)
> +   opts.object_dir = get_object_directory();
> +
> +   return 0;
> +}

> diff --git a/git.c b/git.c
> index c2f48d53dd..400fadd677 100644
> --- a/git.c
> +++ b/git.c
> @@ -503,6 +503,7 @@ static struct cmd_struct commands[] = {
> { "merge-recursive-theirs", cmd_merge_recursive, RUN_SETUP | 
> NEED_WORK_TREE | NO_PARSEOPT },
> { "merge-subtree", cmd_merge_recursive, RUN_SETUP | NEED_WORK_TREE | 
> NO_PARSEOPT },
> { "merge-tree", cmd_merge_tree, RUN_SETUP | NO_PARSEOPT },
> +   { "midx", cmd_midx, RUN_SETUP },

If it's a plumbing and can take an --object-dir, then I don't think
you should require it to run in a repo (with RUN_SETUP).
RUN_SETUP_GENTLY may be better. You could even leave it empty here and
only call setup_git_directory() only when --object-dir is not set.

> { "mktag", cmd_mktag, RUN_SETUP | NO_PARSEOPT },
> { "mktree", cmd_mktree, RUN_SETUP },
> { "mv", cmd_mv, RUN_SETUP | NEED_WORK_TREE },
-- 
Duy

[RFC PATCH 2/3] rebase: Implement --merge via git-rebase--interactive

2018-06-07 Thread Elijah Newren

Interactive rebases are implemented in terms of cherry-pick rather than
the merge-recursive builtin, but cherry-pick also calls into the recursive
merge machinery by default and can accept special merge strategies and/or
special strategy options.  As such, there really is not any need for
having both git-rebase--merge and git-rebase--interactive anymore.

Delete git-rebase--merge.sh and have the --merge option call in to
git-rebase--interactive.  Add one new variable ($actually_interactive) to
help keep cases separate, and adjust the detection of the special
circumstances under which a rebase boils down to a simple fast forward so
that it keeps working with the new structure.

Note that this change fixes a few known test failures (see t3421).

testcase modification notes:
  t3406: --interactive and --merge had slightly different progress output
 while running; adjust a test to match
  t3420: tests of precise output while running, but rebase--am,
 rebase--merge, and rebase--interactive all were built on very
 different commands (am, merge-recursive, cherry-pick), so the
 tests expected different output for each type.  Now we expect
 --merge and --interactive to have the same output.
  t3421: --interactive fixes some bugs in --merge!  Wahoo!
  t3425: topology linearization was inconsistent across flavors of rebase,
 as already noted in a TODO comment in the testcase.  This was not
 considered a bug before, so getting a different linearization due
 to switching out backends should not be considered a bug now.
  t5407: different rebase types varied slightly in how many times checkout
 or commit or equivalents were called based on a quick comparison
 of this tests and previous ones which covered different rebase
 flavors.  I think this is just attributable to this difference.

Signed-off-by: Elijah Newren 
---
 .gitignore|   1 -
 Documentation/git-rebase.txt  |  16 +--
 Makefile  |   1 -
 git-rebase--interactive.sh|   5 +-
 git-rebase--merge.sh  | 164 --
 git-rebase.sh |  49 -
 t/t3406-rebase-message.sh |   7 +-
 t/t3420-rebase-autostash.sh   |  78 ++
 t/t3421-rebase-topology-linear.sh |  10 +-
 t/t3425-rebase-topology-merges.sh |   6 +-
 t/t5407-post-rewrite-hook.sh  |   1 +
 11 files changed, 55 insertions(+), 283 deletions(-)
 delete mode 100644 git-rebase--merge.sh

diff --git a/.gitignore b/.gitignore
index 388cc4beee..747be69d10 100644
--- a/.gitignore
+++ b/.gitignore
@@ -118,7 +118,6 @@
 /git-rebase--am
 /git-rebase--helper
 /git-rebase--interactive
-/git-rebase--merge
 /git-receive-pack
 /git-reflog
 /git-remote
diff --git a/Documentation/git-rebase.txt b/Documentation/git-rebase.txt
index 451252c173..28d1658d7a 100644
--- a/Documentation/git-rebase.txt
+++ b/Documentation/git-rebase.txt
@@ -275,6 +275,10 @@ branch on top of the  branch.  Because of this, 
when a merge
 conflict happens, the side reported as 'ours' is the so-far rebased
 series, starting with , and 'theirs' is the working branch.  In
 other words, the sides are swapped.
++
+This uses the `--interactive` machinery internally, and as such,
+anything that is incompatible with --interactive is incompatible
+with this option.
 
 -s ::
 --strategy=::
@@ -328,8 +332,8 @@ which makes little sense.
and after each change.  When fewer lines of surrounding
context exist they all must match.  By default no context is
ever ignored.
-   Incompatible with the --merge and --interactive options, or
-   anything that implies those options or their machinery.
+   Incompatible with the --interactive option, or anything that
+   uses the `--interactive` machinery.
 
 -f::
 --force-rebase::
@@ -361,15 +365,15 @@ default is `--no-fork-point`, otherwise the default is 
`--fork-point`.
 --whitespace=::
These flag are passed to the 'git apply' program
(see linkgit:git-apply[1]) that applies the patch.
-   Incompatible with the --merge and --interactive options, or
-   anything that implies those options or their machinery.
+   Incompatible with the --interactive option, or anything that
+   uses the `--interactive` machinery.
 
 --committer-date-is-author-date::
 --ignore-date::
These flags are passed to 'git am' to easily change the dates
of the rebased commits (see linkgit:git-am[1]).
-   Incompatible with the --merge and --interactive options, or
-   anything that implies those options or their machinery.
+   Incompatible with the --interactive option, or anything that
+   uses the `--interactive` machinery.
 
 --signoff::
Add a Signed-off-by: trailer to all the rebased commits. Note
diff --git a/Makefile b/Makefile
index 1d27f36365..8d24aaf638 100644
--- a/Makefile
+++ b/Makefile
@@ -620,7 +620,6 @@

[RFC PATCH 3/3] git-rebase.sh: make git-rebase--interactive the default

2018-06-07 Thread Elijah Newren

am-based rebases suffer from an reduced ability to detect directory
renames upstream, which is fundamental to the fact that it throws away
information about the history: in particular, it dispenses with the
original commits involved by turning them into patches, and without the
knowledge of the original commits we cannot determine a proper merge base.

The am-based rebase will proceed by first trying git-apply, and, only if
it fails, will try to reconstruct a provisional merge base using
build_fake_ancestor().  This fake ancestor will ONLY contain the files
modified in the patch; without the full list of files in the real merge
base, renames of any missing files cannot be detected.  Directory rename
detection works by looking at individual file renames and deducing when a
full directory has been renamed.

Trying to modify build_fake_ancestor() to instead create a merge base that
includes common file information by looking for a commit that contained
all the same blobs as the pre-image of the patch would require a very
expensive search through history, may find the wrong commit, and outside
of rebase may not find any commit that matches.

But we had all the relevant information to start.  So, instead of
attempting to fix am which just doesn't have the relevant information,
instead note its strength and weaknesses, and change the default rebase
machinery to interactive since it does not suffer from the same problems.

Documentation updates:

  Several documentation updates were made as part of previous patches to
  note which sets of options were incompatible.  However, adding a new
  --am option allows us to make it clear which options imply this
  machinery and simplify the discussion of incompatible sets of options
  significantly.

testcase modification nodes:
  t3400: git-rebase--interactive.sh explicitly captures output and
 adds die "could not detach HEAD"; adjust to match
  t3401: fixes directory rename detection problems for rebase by default,
 though --am still fails on it
  t3404: testcase said it was for a non-interactive rebase, so add --am
  t3407: instead of having one test per rebase type, also add an extra one
 for the default rebase type.  That'll duplicate --merge for now,
 but will make switching default machinery again in the future
 clearer, if we ever choose to do so.  Also, previously there was
 a conspicuously missing test for all --quit combinations.
  t3420: command output varies by type of rebase and this test script
 has lots of helper functions providing the appropriate
 expectation.  Make sure we call the relevant one.
  t3425: Similar to t3407, add default rebase type to list of types to
 test.  See also comments about t3425 in the switchover of
 --merge to the interactive machinery.
  t5407: This suite has tests for different rebase types, so specify some
 are --am ones.
  t5520: interactive rebases capture and reprint informational message
 about conflicts to stdout rather than stderr.  Also, patches for
 interactive rebases are stored differently under .git than for
 am rebases.
  t9903: Add another testcase for am rebases.  Prompt for default rebases
 is now REBASE-m.

Signed-off-by: Elijah Newren 
---
 Documentation/git-rebase.txt  | 26 --
 git-rebase.sh | 30 +-
 t/t3400-rebase.sh |  7 ---
 t/t3401-rebase-and-am-rename.sh   | 18 +-
 t/t3404-rebase-interactive.sh |  2 +-
 t/t3407-rebase-abort.sh   | 28 +++-
 t/t3420-rebase-autostash.sh   | 15 ++-
 t/t3425-rebase-topology-merges.sh |  9 ++---
 t/t5407-post-rewrite-hook.sh  |  4 ++--
 t/t5520-pull.sh   |  9 ++---
 t/t9903-bash-prompt.sh| 12 +++-
 11 files changed, 117 insertions(+), 43 deletions(-)

diff --git a/Documentation/git-rebase.txt b/Documentation/git-rebase.txt
index 28d1658d7a..6cfb7479fd 100644
--- a/Documentation/git-rebase.txt
+++ b/Documentation/git-rebase.txt
@@ -240,13 +240,16 @@ leave out at most one of A and B, in which case it 
defaults to HEAD.
original branch. The index and working tree are also left
unchanged as a result.
 
+--am:
+   Use git-am internally to rebase.  This may run faster, but have
+   problems with rename detection on the upstream side.
+   Incompatible with the --interactive option, or anything that
+   uses the `--interactive` machinery.
+
 --keep-empty::
Keep the commits that do not change anything from its
parents in the result.
-+
-This uses the `--interactive` machinery internally, and as such,
-anything that is incompatible with --interactive is incompatible
-with this option.
+   This uses the `--interactive` machinery internally.
 
 --allow-empty-message::
By default, rebasing commits with

[RFC PATCH 0/3] Send more rebases through interactive machinery

2018-06-07 Thread Elijah Newren

This series:

  * deletes git-rebase--merge, making git-rebase--interactive handle
those cases instead
  * adds an --am option to git rebase, and changes the rebase default
from using git-rebase--am to git-rebase--interactive, fixing
directory rename detection for default rebases.

However:

  * this series has several minor conflicts with ag/rebase-p in pu.
But it's just an RFC for now; I can re-roll on top of that after
getting some feedback.

  * this series depends on the other preparatory fixups under
  
https://public-inbox.org/git/CABPp-BGxaroePB6aKWAkZeADLB7VE3y1CPy2RyNwpn=+c01...@mail.gmail.com/
To get this series with the preparatory fixups:

Fetch-It-Via: git fetch https://github.com/newren/git rebase-new-default

Elijah Newren (3):
  git-rebase, sequencer: add a --quiet option for the interactive
machinery
  rebase: Implement --merge via git-rebase--interactive
  git-rebase.sh: make git-rebase--interactive the default

 .gitignore|   1 -
 Documentation/git-rebase.txt  |  24 +++--
 Makefile  |   1 -
 git-rebase--interactive.sh|  14 ++-
 git-rebase--merge.sh  | 164 --
 git-rebase.sh |  75 --
 sequencer.c   |  19 ++--
 sequencer.h   |   1 +

Yes, there are a significant number of testcase changes below.  See
the relevant commit messages where I explain why each was changed.

 t/t3400-rebase.sh |   7 +-
 t/t3401-rebase-and-am-rename.sh   |  18 +++-
 t/t3404-rebase-interactive.sh |   2 +-
 t/t3406-rebase-message.sh |   7 +-
 t/t3407-rebase-abort.sh   |  28 -
 t/t3420-rebase-autostash.sh   |  89 +++-
 t/t3421-rebase-topology-linear.sh |  10 +-
 t/t3425-rebase-topology-merges.sh |  15 +--
 t/t5407-post-rewrite-hook.sh  |   5 +-
 t/t5520-pull.sh   |   9 +-
 t/t9903-bash-prompt.sh|  12 ++-
 19 files changed, 180 insertions(+), 321 deletions(-)
 delete mode 100644 git-rebase--merge.sh

-- 
2.18.0.rc1.12.g2996b9442d

[RFC PATCH 1/3] git-rebase, sequencer: add a --quiet option for the interactive machinery

2018-06-07 Thread Elijah Newren

While 'quiet' and 'interactive' may sound like antonyms, the interactive
machinery actually has logic that implements several
interactive_rebase=implied cases (--exec, --keep-empty, --rebase-merges)
which won't pop up an editor.  Further, we want to make the interactive
machinery also take over for git-rebase--merge and become the default
merge strategy, so it makes sense for these other cases to have a quiet
option.

git-rebase--interactive was already somewhat quieter than
git-rebase--merge and git-rebase--am, possibly because cherry-pick has
just traditionally been quieter.  As such, we only drop a few
informational messages -- "Rebasing (n/m)" and "Succesfully rebased..."

Signed-off-by: Elijah Newren 
---
 git-rebase--interactive.sh |  9 +++--
 git-rebase.sh  |  2 +-
 sequencer.c| 19 +--
 sequencer.h|  1 +
 4 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/git-rebase--interactive.sh b/git-rebase--interactive.sh
index 06a7b79307..1f2401f702 100644
--- a/git-rebase--interactive.sh
+++ b/git-rebase--interactive.sh
@@ -139,8 +139,12 @@ mark_action_done () {
if test "$last_count" != "$new_count"
then
last_count=$new_count
-   eval_gettext "Rebasing (\$new_count/\$total)"; printf "\r"
-   test -z "$verbose" || echo
+   if test -z "$GIT_QUIET"
+   then
+   eval_gettext "Rebasing (\$new_count/\$total)";
+   printf "\r"
+   test -z "$verbose" || echo
+   fi
fi
 }
 
@@ -713,6 +717,7 @@ Commit or stash your changes, and then run
"$hook" rebase < "$rewritten_list"
true # we don't care if this hook failed
fi &&
+   test -z "$GIT_QUIET" &&
warn "$(eval_gettext "Successfully rebased and updated 
\$head_name.")"
 
return 1 # not failure; just to break the do_rest loop
diff --git a/git-rebase.sh b/git-rebase.sh
index 7d1612b31b..b639c0d4fe 100755
--- a/git-rebase.sh
+++ b/git-rebase.sh
@@ -136,7 +136,7 @@ write_basic_state () {
echo "$head_name" > "$state_dir"/head-name &&
echo "$onto" > "$state_dir"/onto &&
echo "$orig_head" > "$state_dir"/orig-head &&
-   echo "$GIT_QUIET" > "$state_dir"/quiet &&
+   test t = "$GIT_QUIET" && : > "$state_dir"/quiet
test t = "$verbose" && : > "$state_dir"/verbose
test -n "$strategy" && echo "$strategy" > "$state_dir"/strategy
test -n "$strategy_opts" && echo "$strategy_opts" > \
diff --git a/sequencer.c b/sequencer.c
index a2843e3906..d418d40bee 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -143,6 +143,7 @@ static GIT_PATH_FUNC(rebase_path_refs_to_delete, 
"rebase-merge/refs-to-delete")
 static GIT_PATH_FUNC(rebase_path_gpg_sign_opt, "rebase-merge/gpg_sign_opt")
 static GIT_PATH_FUNC(rebase_path_orig_head, "rebase-merge/orig-head")
 static GIT_PATH_FUNC(rebase_path_verbose, "rebase-merge/verbose")
+static GIT_PATH_FUNC(rebase_path_quiet, "rebase-merge/quiet")
 static GIT_PATH_FUNC(rebase_path_signoff, "rebase-merge/signoff")
 static GIT_PATH_FUNC(rebase_path_head_name, "rebase-merge/head-name")
 static GIT_PATH_FUNC(rebase_path_onto, "rebase-merge/onto")
@@ -2251,6 +2252,9 @@ static int read_populate_opts(struct replay_opts *opts)
if (file_exists(rebase_path_verbose()))
opts->verbose = 1;
 
+   if (file_exists(rebase_path_quiet()))
+   opts->quiet = 1;
+
if (file_exists(rebase_path_signoff())) {
opts->allow_ff = 0;
opts->signoff = 1;
@@ -3171,10 +3175,11 @@ static int pick_commits(struct todo_list *todo_list, 
struct replay_opts *opts)
fprintf(f, "%d\n", todo_list->done_nr);
fclose(f);
}
-   fprintf(stderr, "Rebasing (%d/%d)%s",
-   todo_list->done_nr,
-   todo_list->total_nr,
-   opts->verbose ? "\n" : "\r");
+   if (!opts->quiet)
+   fprintf(stderr, "Rebasing (%d/%d)%s",
+   todo_list->done_nr,
+   todo_list->total_nr,
+   opts->verbose ? "\n" : "\r");
}
unlink(rebase_path_message());
unlink(rebase_path_author_script());
@@ -3385,8 +3390,10 @@ static int pick_commits(struct todo_list *todo_list, 
struct replay_opts *opts)
}
apply_autostash(opts);
 
-   fprintf(stderr, "Successfully rebased and updated %s.\n",
-

Re: [PATCH v8 2/2] json-writer: t0019: add perl unit test

2018-06-07 Thread Eric Sunshine

On Thu, Jun 7, 2018 at 10:12 AM,   wrote:
> Test json-writer output using perl script.
>
> Signed-off-by: Jeff Hostetler 
> ---
> diff --git a/t/t0019/parse_json.perl b/t/t0019/parse_json.perl
> @@ -0,0 +1,52 @@
> +#!/usr/bin/perl
> +use strict;
> +use warnings;
> +use JSON;

This new script is going to have to be protected by a prerequisite
since the JSON module is not part of the standard Perl installation,
thus will not necessarily be installed everywhere (it isn't on any of
my machines, for instance). Something like:

test_lazy_prereq PERLJSON '
perl -MJSON -e "exit 0"
'

which would be used like this:

test_expect_success PERLJSON 'parse JSON using Perl' '
...
'

Re: [PATCH 00/23] Multi-pack-index (MIDX)

2018-06-07 Thread Derrick Stolee


On 6/7/2018 10:45 AM, Ævar Arnfjörð Bjarmason wrote:

On Thu, Jun 07 2018, Derrick Stolee wrote:


To test the performance in this situation, I created a
script that organizes the Linux repository in a similar
fashion. I split the commit history into 50 parts by
creating branches on every 10,000 commits of the first-
parent history. Then, `git rev-list --objects A ^B`
provides the list of objects reachable from A but not B,
so I could send that to `git pack-objects` to create
these "time-based" packfiles. With these 50 packfiles
(deleting the old one from my fresh clone, and deleting
all tags as they were no longer on-disk) I could then
test 'git rev-list --objects HEAD^{tree}' and see:

 Before: 0.17s
 After:  0.13s
 % Diff: -23.5%

By adding logic to count hits and misses to bsearch_pack,
I was able to see that the command above calls that
method 266,930 times with a hit rate of 33%. The MIDX
has the same number of calls with a 100% hit rate.

Do you have the script you used for this? It would be very interesting
as something we could stick in t/perf/ to test this use-case in the
future.

How does this & the numbers below compare to just a naïve
--max-pack-size= on linux.git?

Is it possible for you to tar this test repo up and share it as a
one-off? I've been polishing the core.validateAbbrev series I have, and
it would be interesting to compare some of the (abbrev) numbers.


Here is what I used. You will want to adjust your constants for whatever 
repo you are using. This is for the Linux kernel which has a 
first-parent history of ~50,000 commits. It also leaves a bunch of extra 
files around, so it is nowhere near incorporating into the code.


#!/bin/bash

for i in `seq 1 50`
do
    ORDER=$((51 - $i))
    NUM_BACK=$((1000 * ($i - 1)))
    echo creating batch/$ORDER
    git branch -f batch/$ORDER HEAD~$NUM_BACK
    echo batch/$ORDER
    git rev-parse batch/$ORDER
done

lastbranch=""
for i in `seq 1 50`
do
    branch=batch/$i
    if [$lastbranch -eq ""]
    then
    echo "$branch"
    git rev-list --objects $branch | sed 's/ .*//' 
>objects-$i.txt

    else
    echo "$lastbranch"
    echo "$branch"
    git rev-list --objects $branch ^$lastbranch | sed 's/ 
.*//' >objects-$i.txt

    fi

    git pack-objects --no-reuse-delta 
.git/objects/pack/branch-split2 
    lastbranch=$branch
done


for tag in `git tag --list`
do
    git tag -d $tag
done

rm -rf .git/objects/pack/pack-*
git midx write

git grep with leading inverted bracket expression

2018-06-07 Thread Matthew Wilcox



If the first atom of a regex is a bracket expression with an inverted range,
git grep is very slow.

$ time git grep 'struct_size' >/dev/null

real0m0.368s
user0m0.563s
sys 0m0.453s

$ time git grep '[^t]truct_size' >/dev/null

real0m31.529s
user1m54.909s
sys 0m0.805s

If the bracket expression is moved to even the second position in the string,
it runs much faster:

$ time git grep 's[^p]ruct_size' >/dev/null

real0m3.989s
user0m13.939s
sys 0m0.403s

It's pretty bad with even a '.' as the first character:

$ time git grep '.truct_size' >/dev/null

real0m14.514s
user0m52.624s
sys 0m0.598s

$ git --version
git version 2.17.1

Setting LANG=C improves matters by a factor of 3-4 (depending if you
count real or user time):

$ time git grep '[^t]truct_size' >/dev/null
real0m10.035s
user0m28.795s
sys 0m0.537s

(this is using something pretty close to Linus' current HEAD of the
linux repository, an i7-7500, 16GB memory).

Re: [PATCH v4 04/23] unpack-tress: convert clear_ce_flags* to avoid the_index

2018-06-07 Thread Duy Nguyen

On Thu, Jun 7, 2018 at 9:41 AM, Elijah Newren  wrote:
> Subject line: unpack-trees rather than unpack-tress.
>
>
>
> On Wed, Jun 6, 2018 at 9:49 AM, Nguyễn Thái Ngọc Duy  
> wrote:
>> Prior to fba92be8f7, this code implicitly (and incorrectly) assumes
>> the_index when running the exclude machinery. fba92be8f7 helps show
>> this problem clearer because unpack-trees operation is supposed to
>> work on whatever index the caller specifies... not specifically
>> the_index.
>>
>> Update the code to use "istate" argument that's originally from
>> mark_new_skip_worktree(). From the call sites, both in unpack_trees(),
>> you can see that this function works on two separate indexes:
>> o->src_index and o->result. The second mark_new_skip_worktree() so far
>> has incorecctly applied exclude rules on o->src_index instead of
>> o->result. It's unclear what is the consequences of this, but it's
>> definitely wrong.
>>
>> [1] fba92be8f7 (dir: convert is_excluded_from_list to take an index -
>> 2017-05-05)
>>
>> Signed-off-by: Nguyễn Thái Ngọc Duy 
>
> A somewhat curious finding: when I was rebuilding and re-testing all
> 23 patches, I got a failure on this patch in test 31 of
> t7063-status-untracked-cache.sh. I did not get any test failures with
> any of the other patches.  However, after re-running that test or the
> whole suite half a dozen times with just up to this patch applied, I
> was not able to trigger the failure again.  Is there a rare race in
> that testcase?

Untracked cache tests are very time-sensitive. I'll try to run and
re-run them a couple times to understand more. Thanks for pointing it
out.
-- 
Duy

Re: [RFC PATCH 0/5] format-patch: automate cover letter range-diff

2018-06-07 Thread Duy Nguyen

On Thu, Jun 7, 2018 at 10:34 AM, Eric Sunshine  wrote:
> On Wed, Jun 6, 2018 at 3:16 PM, Duy Nguyen  wrote:
>> On Wed, May 30, 2018 at 10:03 AM, Eric Sunshine  
>> wrote:
>>> Dscho recently implemented a 'tbdiff' replacement as a Git builtin named
>>> git-branch-diff[1] which computes differences between two versions of a
>>> patch series. Such a diff can be a useful aid for reviewers when
>>> inserted into a cover letter. However, doing so requires manual
>>> generation (invoking git-branch-diff) and copy/pasting the result into
>>> the cover letter.
>>
>> Another option which I wanted to go is delegate part of cover letter
>> generation to a hook (or just a config key that contains a shell
>> command). This way it's easier to customize cover letters. We could
>> still have a good fallback that does shortlog, diffstat and tbdiff.
>
> It is common on this mailing list to turn down requests for new hooks
> when the requested functionality could just as easily be implemented
> via a wrapper script. So, my knee-jerk reaction is that a hook to
> customize the cover letter may be overkill when the same functionality
> could likely be implemented relatively easily by a shell script which
> invokes git-format-patch and customizes the cover letter
> after-the-fact. Same argument regarding a config key holding a shell
> command. But, perhaps there are cases which don't occur to me which
> could be helped by a config variable or such.

I think format-patch --cover-letter nowadays does more stuff that's
not so easy to simply rewrite it in a shell script. My original
problem with format-patch is it hard codes shortlog settings and you
can't list patches with patch number (e.g. "[1/2] foo bar"). The
simplest way is let format-patch does it stuff as usual and
"outsource" some cover letter's body generation to a script.

But it's ok. I could try to code the patch numbering thing in
format-patch and maybe submit a patch or two for that later.

> Of course, by the same reasoning, the --range-diff functionality
> implemented by this patch series, which is just a convenience, could
> be handled by a wrapper script, thus is not strictly needed. On the
> other hand, given that interdiffs and range-diffs are so regularly
> used in re-rolls on this list (and perhaps other mailing list-based
> projects) may be argument enough in favor of having such an option
> built into git-format-patch.
-- 
Duy

[RFC PATCH v1] telemetry design overview (part 1)

2018-06-07 Thread git

From: Jeff Hostetler 

I've been working to add code to Git to optionally collect telemetry data.
The goal is to be able to collect performance data from Git commands and
allow it to be aggregated over a user community to find "slow commands".

I'm going to break this up into several parts rather than sending one large
patch series.  I think it is easier to review in pieces and in stages.

Part 1 contains the overall design documentation.
Part 2 will contain the basic telemetry event mechanism and the cmd_start
and cmd_exit events.

I'll post part 2 shortly -- as soon as I can convert my tests from python
to perl.  :-)

Jeff Hostetler (1):
  telemetry: design documenation

 Documentation/technical/telemetry.txt | 475 ++
 1 file changed, 475 insertions(+)
 create mode 100644 Documentation/technical/telemetry.txt

-- 
2.9.3

[RFC PATCH v1] telemetry: design documenation

2018-06-07 Thread git

From: Jeff Hostetler 

Create design documentation to describe the telemetry feature.

Signed-off-by: Jeff Hostetler 
---
 Documentation/technical/telemetry.txt | 475 ++
 1 file changed, 475 insertions(+)
 create mode 100644 Documentation/technical/telemetry.txt

diff --git a/Documentation/technical/telemetry.txt 
b/Documentation/technical/telemetry.txt
new file mode 100644
index 000..0a708ad
--- /dev/null
+++ b/Documentation/technical/telemetry.txt
@@ -0,0 +1,475 @@
+Telemetry Design Notes
+==
+
+The telemetry feature allows Git to generate structured telemetry data
+for executed commands.  Data includes command line arguments, execution
+times, error codes and messages, and information about child processes.
+
+Structued data is produced in a JSON-like format.  (See the UTF-8 related
+"limitations" described in json-writer.h)
+
+Telemetry data can be written to a local file or sent to a dynamically
+loaded shared library via a plugin API.
+
+The telemetry feature is similar to the existing trace API (defined in
+Documentation/technical/api-trace.txt).  Telemetry events are generated
+thoughout the life of a Git command just like trace messages.  But where
+as trace messages are essentially developer debug messages, telemetry
+events are intended for logging and automated analysis.
+
+The goal of the telemetry feature is to be able to gather usage data across
+a group of production users to identify real-world performance problems in
+production.  Additionally, it might help identify common user errors and
+guide future user training.
+
+By default, telemetry is disabled.  Telemetry is controlled using config
+settings (see "telemetry.*" in Documentation/config.txt).
+
+
+Telemetry Events
+
+
+Telemetry data is generated as a series of events.  Each event is written
+as a self-describing JSON object.
+
+Events: cmd_start and cmd_exit
+--
+
+The `cmd_start` event is emitted the very beginning of the git.exe process
+in cmd_main() and `cmd_exit` event is emitted at the end of the process in
+the atexit cleanup routine.
+
+For example, running "git version" produces:
+
+{
+  "event_name": "cmd_start",
+  "argv": [
+"C:\\work\\gfw\\git.exe",
+"version"
+  ],
+  "clock": 1525978509976086000,
+  "pid": 25460,
+  "git_version": "2.17.0.windows.1",
+  "telemetry_version": "1",
+  "session_id": "1525978509976086000-25460"
+}
+{
+  "event_name": "cmd_exit",
+  "argv": [
+"C:\\work\\gfw\\git.exe",
+"version"
+  ],
+  "clock": 1525978509980903391,
+  "pid": 25460,
+  "git_version": "2.17.0.windows.1",
+  "telemetry_version": "1",
+  "session_id": "1525978509976086000-25460",
+  "is_interactive": false,
+  "exit_code": 0,
+  "elapsed_time_core": 0.004814,
+  "elapsed_time_total": 0.004817,
+  "builtin": {
+"name": "version"
+  }
+}
+
+Fields common to all events:
+ * `event_name` is the name of the event.
+ * `argv` is the array of command line arguments.
+ * `clock` is the time of the event in nanoseconds since the epoch.
+ * `pid` is the process id.
+ * `git_version` is the git version string.
+ * `telemetry_version` is the version of the telemetry format.
+ * `session_id` is described in a later section.
+
+Additional fields in cmd_exit:
+ * `is_interactive` is true if git.exe spawned an interactive child process,
+  such as a pager, editor, prompt, or gui tool.
+ * `exit_code` is the value passed to exit() from main().
+ * `error_message` (not shown) is the array of error messages.
+ * `elapsed-core-time` measures the time in seconds until exit() was called.
+ * `elapsed-total-time` measures the time until the atexit() routine starts
+  (which will include time spend in other atexit() routines cleaning up
+  child processes and etc.).
+ * `alias` (not shown) the updated argv after alias expansion.
+ * `builtin.name` is the canonical command name (from the cmd_struct[]
+  table) of a builtin command.
+ * `builtin.mode` (not shown) is shown for some commands that have different
+  major modes and performance times.  For example, checkout can switch
+  branches or repair a single file.
+ * `child_summary` (not shown) is described in a later section.
+ * `timers` (not shown) is described in a later section.
+ * `aux` (not shown) is described in a later section.
+
+
+Events: child_start and child_exit
+--
+
+The child-start event is emitted just before a child process is started.
+It includes a unique child-id and the child's command line arguments.
+
+The child-exit event is emitted after a child process exits and has
+been reaped.  This event extends the start event with the child's exit
+status and elapsed time.
+
+For example, during a "git fetch origin", git.exe runs gc in the background
+and these events are emitted by the fetch process before and after the
+child gc process:
+
+{
+  "event_name": "child_start",
+  "argv": [
+

Re: [PATCH 00/23] Multi-pack-index (MIDX)

2018-06-07 Thread Ævar Arnfjörð Bjarmason



On Thu, Jun 07 2018, Derrick Stolee wrote:

> To test the performance in this situation, I created a
> script that organizes the Linux repository in a similar
> fashion. I split the commit history into 50 parts by
> creating branches on every 10,000 commits of the first-
> parent history. Then, `git rev-list --objects A ^B`
> provides the list of objects reachable from A but not B,
> so I could send that to `git pack-objects` to create
> these "time-based" packfiles. With these 50 packfiles
> (deleting the old one from my fresh clone, and deleting
> all tags as they were no longer on-disk) I could then
> test 'git rev-list --objects HEAD^{tree}' and see:
>
> Before: 0.17s
> After:  0.13s
> % Diff: -23.5%
>
> By adding logic to count hits and misses to bsearch_pack,
> I was able to see that the command above calls that
> method 266,930 times with a hit rate of 33%. The MIDX
> has the same number of calls with a 100% hit rate.

Do you have the script you used for this? It would be very interesting
as something we could stick in t/perf/ to test this use-case in the
future.

How does this & the numbers below compare to just a naïve
--max-pack-size= on linux.git?

Is it possible for you to tar this test repo up and share it as a
one-off? I've been polishing the core.validateAbbrev series I have, and
it would be interesting to compare some of the (abbrev) numbers.

> Abbreviation Speedups
> -
>
> To fully disambiguate an abbreviation, we must iterate
> through all packfiles to ensure no collision exists in
> any packfile. This requires O(P log N) time. With the
> MIDX, this is only O(log N) time. Our standard test [2]
> is 'git log --oneline --parents --raw' because it writes
> many abbreviations while also doing a lot of other work
> (walking commits and trees to compute the raw diff).
>
> For a copy of the Linux repository with 50 packfiles
> split by time, we observed the following:
>
> Before: 100.5 s
> After:   58.2 s
> % Diff: -59.7%
>
>
> Request for Review Attention
> 
>
> I tried my best to take the feedback from the commit-graph
> feature and apply it to this feature. I also worked to
> follow the object-store refactoring as I could. I also have
> some local commits that create a 'verify' subcommand and
> integrate with 'fsck' similar to the commit-graph, but I'll
> leave those for a later series (and review is still underway
> for that part of the commit-graph).
>
> One place where I could use some guidance is related to the
> current state of 'the_hash_algo' patches. The file format
> allows a different "hash version" which then indicates the
> length of the hash. What's the best way to ensure this
> feature doesn't cause extra pain in the hash-agnostic series?
> This will inform how I go back and make the commit-graph
> feature better in this area, too.
>
>
> Thanks,
> -Stolee

[PATCH v8 1/2] json_writer: new routines to create data in JSON format

2018-06-07 Thread git

From: Jeff Hostetler 

Add a series of jw_ routines and "struct json_writer" structure to compose
JSON data.  The resulting string data can then be output by commands wanting
to support a JSON output format.

The json-writer routines can be used to generate structured data in a
JSON-like format.  We say "JSON-like" because we do not enforce the Unicode
(usually UTF-8) requirement on string fields.  Internally, Git does not
necessarily have Unicode/UTF-8 data for most fields, so it is currently
unclear the best way to enforce that requirement.  For example, on Linx
pathnames can contain arbitrary 8-bit character data, so a command like
"status" would not know how to encode the reported pathnames.  We may want
to revisit this (or double encode such strings) in the future.

The initial use for the json-writer routines is for generating telemetry
data for executed Git commands.  Later, we may want to use them in other
commands, such as status.

Helped-by: René Scharfe 
Helped-by: Wink Saville 
Helped-by: Ramsay Jones 
Signed-off-by: Jeff Hostetler 
---
 Makefile|   2 +
 json-writer.c   | 419 
 json-writer.h   | 113 +
 t/helper/test-json-writer.c | 572 
 t/t0019-json-writer.sh  | 236 ++
 5 files changed, 1342 insertions(+)
 create mode 100644 json-writer.c
 create mode 100644 json-writer.h
 create mode 100644 t/helper/test-json-writer.c
 create mode 100755 t/t0019-json-writer.sh

diff --git a/Makefile b/Makefile
index a1d8775..4ae6946 100644
--- a/Makefile
+++ b/Makefile
@@ -666,6 +666,7 @@ TEST_PROGRAMS_NEED_X += test-fake-ssh
 TEST_PROGRAMS_NEED_X += test-genrandom
 TEST_PROGRAMS_NEED_X += test-hashmap
 TEST_PROGRAMS_NEED_X += test-index-version
+TEST_PROGRAMS_NEED_X += test-json-writer
 TEST_PROGRAMS_NEED_X += test-lazy-init-name-hash
 TEST_PROGRAMS_NEED_X += test-line-buffer
 TEST_PROGRAMS_NEED_X += test-match-trees
@@ -820,6 +821,7 @@ LIB_OBJS += hashmap.o
 LIB_OBJS += help.o
 LIB_OBJS += hex.o
 LIB_OBJS += ident.o
+LIB_OBJS += json-writer.o
 LIB_OBJS += kwset.o
 LIB_OBJS += levenshtein.o
 LIB_OBJS += line-log.o
diff --git a/json-writer.c b/json-writer.c
new file mode 100644
index 000..f35ce19
--- /dev/null
+++ b/json-writer.c
@@ -0,0 +1,419 @@
+#include "cache.h"
+#include "json-writer.h"
+
+void jw_init(struct json_writer *jw)
+{
+   strbuf_reset(>json);
+   strbuf_reset(>open_stack);
+   strbuf_reset(>first_stack);
+   jw->pretty = 0;
+}
+
+void jw_release(struct json_writer *jw)
+{
+   strbuf_release(>json);
+   strbuf_release(>open_stack);
+   strbuf_release(>first_stack);
+}
+
+/*
+ * Append JSON-quoted version of the given string to 'out'.
+ */
+static void append_quoted_string(struct strbuf *out, const char *in)
+{
+   unsigned char c;
+
+   strbuf_addch(out, '"');
+   while ((c = *in++) != '\0') {
+   if (c == '"')
+   strbuf_addstr(out, "\\\"");
+   else if (c == '\\')
+   strbuf_addstr(out, "");
+   else if (c == '\n')
+   strbuf_addstr(out, "\\n");
+   else if (c == '\r')
+   strbuf_addstr(out, "\\r");
+   else if (c == '\t')
+   strbuf_addstr(out, "\\t");
+   else if (c == '\f')
+   strbuf_addstr(out, "\\f");
+   else if (c == '\b')
+   strbuf_addstr(out, "\\b");
+   else if (c < 0x20)
+   strbuf_addf(out, "\\u%04x", c);
+   else
+   strbuf_addch(out, c);
+   }
+   strbuf_addch(out, '"');
+}
+
+static inline void indent_pretty(struct json_writer *jw)
+{
+   int k;
+
+   if (!jw->pretty)
+   return;
+
+   for (k = 0; k < jw->open_stack.len; k++)
+   strbuf_addstr(>json, "  ");
+}
+
+/*
+ * Begin an object or array (either top-level or nested within the currently
+ * open object or array).
+ */
+static inline void begin(struct json_writer *jw, char ch_open, int pretty)
+{
+   jw->pretty = pretty;
+
+   strbuf_addch(>json, ch_open);
+
+   strbuf_addch(>open_stack, ch_open);
+   strbuf_addch(>first_stack, '1');
+}
+
+/*
+ * Assert that the top of the open-stack is an object.
+ */
+static inline void assert_in_object(const struct json_writer *jw, const char 
*key)
+{
+   if (!jw->open_stack.len)
+   BUG("json-writer: object: missing jw_object_begin(): '%s'", 
key);
+   if (jw->open_stack.buf[jw->open_stack.len - 1] != '{')
+   BUG("json-writer: object: not in object: '%s'", key);
+}
+
+/*
+ * Assert that the top of the open-stack is an array.
+ */
+static inline void assert_in_array(const struct json_writer *jw)
+{
+   if (!jw->open_stack.len)
+   BUG("json-writer: array: missing jw_array_begin()");
+   if

[PATCH v8 2/2] json-writer: t0019: add perl unit test

2018-06-07 Thread git

From: Jeff Hostetler 

Test json-writer output using perl script.

Signed-off-by: Jeff Hostetler 
---
 t/t0019-json-writer.sh  | 38 
 t/t0019/parse_json.perl | 52 +
 2 files changed, 90 insertions(+)
 create mode 100644 t/t0019/parse_json.perl

diff --git a/t/t0019-json-writer.sh b/t/t0019-json-writer.sh
index c9c2e23..fd61fe4 100755
--- a/t/t0019-json-writer.sh
+++ b/t/t0019-json-writer.sh
@@ -233,4 +233,42 @@ test_expect_success 'inline array with no members' '
test_cmp expect actual
 '
 
+# As a sanity check, ask Perl to parse our generated JSON and recursively
+# dump the resulting data in sorted order.  Confirm that that matches our
+# expectations.
+test_expect_success 'parse JSON using Perl' '
+   cat >expect <<-\EOF &&
+   row[0].a abc
+   row[0].b 42
+   row[0].sub1 hash
+   row[0].sub1.c 3.14
+   row[0].sub1.d 1
+   row[0].sub1.sub2 array
+   row[0].sub1.sub2[0] 0
+   row[0].sub1.sub2[1] hash
+   row[0].sub1.sub2[1].g 0
+   row[0].sub1.sub2[1].h 1
+   row[0].sub1.sub2[2] null
+   EOF
+   test-json-writer >output.json \
+   @object \
+   @object-string a abc \
+   @object-int b 42 \
+   @object-object "sub1" \
+   @object-double c 2 3.140 \
+   @object-true d \
+   @object-array "sub2" \
+   @array-false \
+   @array-object \
+   @object-int g 0 \
+   @object-int h 1 \
+   @end \
+   @array-null \
+   @end \
+   @end \
+   @end &&
+   perl "$TEST_DIRECTORY"/t0019/parse_json.perl actual &&
+   test_cmp expect actual
+'
+
 test_done
diff --git a/t/t0019/parse_json.perl b/t/t0019/parse_json.perl
new file mode 100644
index 000..ca4e5bf
--- /dev/null
+++ b/t/t0019/parse_json.perl
@@ -0,0 +1,52 @@
+#!/usr/bin/perl
+use strict;
+use warnings;
+use JSON;
+
+sub dump_array {
+my ($label_in, $ary_ref) = @_;
+my @ary = @$ary_ref;
+
+for ( my $i = 0; $i <= $#{ $ary_ref }; $i++ )
+{
+   my $label = "$label_in\[$i\]";
+   dump_item($label, $ary[$i]);
+}
+}
+
+sub dump_hash {
+my ($label_in, $obj_ref) = @_;
+my %obj = %$obj_ref;
+
+foreach my $k (sort keys %obj) {
+   my $label = (length($label_in) > 0) ? "$label_in.$k" : "$k";
+   my $value = $obj{$k};
+
+   dump_item($label, $value);
+}
+}
+
+sub dump_item {
+my ($label_in, $value) = @_;
+if (ref($value) eq 'ARRAY') {
+   print "$label_in array\n";
+   dump_array($label_in, $value);
+} elsif (ref($value) eq 'HASH') {
+   print "$label_in hash\n";
+   dump_hash($label_in, $value);
+} elsif (defined $value) {
+   print "$label_in $value\n";
+} else {
+   print "$label_in null\n";
+}
+}
+
+my $row = 0;
+while (<>) {
+my $data = decode_json( $_ );
+my $label = "row[$row]";
+
+dump_hash($label, $data);
+$row++;
+}
+
-- 
2.9.3

[PATCH v8 0/2] json-writer V8

2018-06-07 Thread git

From: Jeff Hostetler 

Here is V8 of my json-writer patches.  Please replace the existing V5/V6/V7
version of the jh/json-writer branch with this one.

This version uses perl rather than python to test the generated JSON.

Jeff Hostetler (2):
  json_writer: new routines to create data in JSON format
  json-writer: t0019: add perl unit test

 Makefile|   2 +
 json-writer.c   | 419 
 json-writer.h   | 113 +
 t/helper/test-json-writer.c | 572 
 t/t0019-json-writer.sh  | 274 +
 t/t0019/parse_json.perl |  52 
 6 files changed, 1432 insertions(+)
 create mode 100644 json-writer.c
 create mode 100644 json-writer.h
 create mode 100644 t/helper/test-json-writer.c
 create mode 100755 t/t0019-json-writer.sh
 create mode 100644 t/t0019/parse_json.perl

-- 
2.9.3

Re: [PATCH 00/23] Multi-pack-index (MIDX)

2018-06-07 Thread Derrick Stolee


On 6/7/2018 10:03 AM, Derrick Stolee wrote:

This patch series includes a rewrite of the previous
multi-pack-index RFC [1] using the feedback from the
commit-graph feature.


Sorry to everyone who got a duplicate copy of this series. I misspelled 
'kernel.org' and it didn't go to the list.


I also have this series available as a GitHub PR [1]

[1] https://github.com/derrickstolee/git/pull/7

[PATCH 07/23] midx: expand test data

2018-06-07 Thread Derrick Stolee

As we build the multi-pack-index file format, we want to test the format
on real repoasitories. Add tests to t5319-midx.sh that create repository
data including multiple packfiles with both version 1 and version 2
formats.

The current 'git midx write' command will always write the same file
with no "real" data. This will be expanded in future commits, along with
the test expectations.

Signed-off-by: Derrick Stolee 
---
 t/t5319-midx.sh | 101 
 1 file changed, 101 insertions(+)

diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
index e78514d8e9..2c25a69744 100755
--- a/t/t5319-midx.sh
+++ b/t/t5319-midx.sh
@@ -14,8 +14,109 @@ midx_read_expect() {
 
 test_expect_success 'write midx with no packs' '
git midx --object-dir=. write &&
+   test_when_finished rm pack/multi-pack-index &&
test_path_is_file pack/multi-pack-index &&
midx_read_expect
 '
 
+test_expect_success 'create objects' '
+   for i in `test_seq 1 5`
+   do
+   iii=$(printf '%03i' $i)
+   test-tool genrandom "bar" 200 > wide_delta_$iii &&
+   test-tool genrandom "baz $iii" 50 >> wide_delta_$iii &&
+   test-tool genrandom "foo"$i 100 > deep_delta_$iii &&
+   test-tool genrandom "foo"$(expr $i + 1) 100 >> deep_delta_$iii 
&&
+   test-tool genrandom "foo"$(expr $i + 2) 100 >> deep_delta_$iii 
&&
+   echo $iii >file_$iii &&
+   test-tool genrandom "$iii" 8192 >>file_$iii &&
+   git update-index --add file_$iii deep_delta_$iii 
wide_delta_$iii &&
+   i=$(expr $i + 1) || return 1
+   done &&
+   { echo 101 && test-tool genrandom 100 8192; } >file_101 &&
+   git update-index --add file_101 &&
+   tree=$(git write-tree) &&
+   commit=$(git commit-tree $tree obj-list &&
+   git update-ref HEAD $commit
+'
+
+test_expect_success 'write midx with one v1 pack' '
+   pack=$(git pack-objects --index-version=1 pack/test  wide_delta_$iii &&
+   test-tool genrandom "baz $iii" 50 >> wide_delta_$iii &&
+   test-tool genrandom "foo"$i 100 > deep_delta_$iii &&
+   test-tool genrandom "foo"$(expr $i + 1) 100 >> deep_delta_$iii 
&&
+   test-tool genrandom "foo"$(expr $i + 2) 100 >> deep_delta_$iii 
&&
+   echo $iii >file_$iii &&
+   test-tool genrandom "$iii" 8192 >>file_$iii &&
+   git update-index --add file_$iii deep_delta_$iii 
wide_delta_$iii &&
+   i=$(expr $i + 1) || return 1
+   done &&
+   { echo 101 && test-tool genrandom 100 8192; } >file_101 &&
+   git update-index --add file_101 &&
+   tree=$(git write-tree) &&
+   commit=$(git commit-tree $tree -p HEADobj-list2 &&
+   git update-ref HEAD $commit
+'
+
+test_expect_success 'write midx with two packs' '
+   pack1=$(git pack-objects --index-version=1 pack/test-1  wide_delta_$iii &&
+   test-tool genrandom "baz $iii" 50 >> wide_delta_$iii &&
+   test-tool genrandom "foo"$i 100 > deep_delta_$iii &&
+   test-tool genrandom "foo"$(expr $i + 1) 100 >> deep_delta_$iii 
&&
+   test-tool genrandom "foo"$(expr $i + 2) 100 >> deep_delta_$iii 
&&
+   echo $iii >file_$iii &&
+   test-tool genrandom "$iii" 8192 >>file_$iii &&
+   git update-index --add file_$iii deep_delta_$iii 
wide_delta_$iii &&
+   { echo 101 && test-tool genrandom 100 8192; } >file_101 &&
+   git update-index --add file_101 &&
+   tree=$(git write-tree) &&
+   commit=$(git commit-tree $tree -p HEADobj-list &&
+   git update-ref HEAD $commit &&
+   git pack-objects --index-version=2 test-pack

[PATCH 06/23] midx: struct midxed_git and 'read' subcommand

2018-06-07 Thread Derrick Stolee

As we build the multi-pack-index feature by adding chunks at a time,
we want to test that the data is being written correctly.

Create struct midxed_git to store an in-memory representation of a
multi-pack-index and a memory-map of the binary file. Initialize this
struct in load_midxed_git(object_dir).

Create the 'git midx read' subcommand to output basic information about
the multi-pack-index file. This will be expanded as more information is
written to the file.

Signed-off-by: Derrick Stolee 
---
 Documentation/git-midx.txt | 11 +++
 builtin/midx.c | 23 +-
 midx.c | 65 ++
 midx.h |  9 ++
 object-store.h | 19 +++
 t/t5319-midx.sh| 12 ++-
 6 files changed, 137 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
index dcaeb1a91b..919283fdd8 100644
--- a/Documentation/git-midx.txt
+++ b/Documentation/git-midx.txt
@@ -23,6 +23,11 @@ OPTIONS
/packs/multi-pack-index for the current MIDX file, and
/packs for the pack-files to index.
 
+read::
+   When given as the verb, read the current MIDX file and output
+   basic information about its contents. Used for debugging
+   purposes only.
+
 write::
When given as the verb, write a new MIDX file to
/packs/multi-pack-index.
@@ -43,6 +48,12 @@ $ git midx write
 $ git midx --object-dir  write
 ---
 
+* Read the MIDX file in the .git/objects folder.
++
+---
+$ git midx read
+---
+
 
 GIT
 ---
diff --git a/builtin/midx.c b/builtin/midx.c
index dc0a5acd3f..c7002f664a 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -6,7 +6,7 @@
 #include "midx.h"
 
 static char const * const builtin_midx_usage[] ={
-   N_("git midx [--object-dir ] [write]"),
+   N_("git midx [--object-dir ] [read|write]"),
NULL
 };
 
@@ -14,6 +14,25 @@ static struct opts_midx {
const char *object_dir;
 } opts;
 
+static int read_midx_file(const char *object_dir)
+{
+   struct midxed_git *m = load_midxed_git(object_dir);
+
+   if (!m)
+   return 0;
+
+   printf("header: %08x %d %d %d %d\n",
+  m->signature,
+  m->version,
+  m->hash_version,
+  m->num_chunks,
+  m->num_packs);
+
+   printf("object_dir: %s\n", m->object_dir);
+
+   return 0;
+}
+
 int cmd_midx(int argc, const char **argv, const char *prefix)
 {
static struct option builtin_midx_options[] = {
@@ -38,6 +57,8 @@ int cmd_midx(int argc, const char **argv, const char *prefix)
if (argc == 0)
return 0;
 
+   if (!strcmp(argv[0], "read"))
+   return read_midx_file(opts.object_dir);
if (!strcmp(argv[0], "write"))
return write_midx_file(opts.object_dir);
 
diff --git a/midx.c b/midx.c
index 3e55422a21..fa18770f1d 100644
--- a/midx.c
+++ b/midx.c
@@ -3,12 +3,15 @@
 #include "dir.h"
 #include "csum-file.h"
 #include "lockfile.h"
+#include "object-store.h"
 #include "midx.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
 #define MIDX_HASH_VERSION 1 /* SHA-1 */
 #define MIDX_HEADER_SIZE 12
+#define MIDX_HASH_LEN 20
+#define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
 
 static char *get_midx_filename(const char *object_dir)
 {
@@ -18,6 +21,68 @@ static char *get_midx_filename(const char *object_dir)
return strbuf_detach(_name, NULL);
 }
 
+struct midxed_git *load_midxed_git(const char *object_dir)
+{
+   struct midxed_git *m;
+   int fd;
+   struct stat st;
+   size_t midx_size;
+   void *midx_map;
+   const char *midx_name = get_midx_filename(object_dir);
+
+   fd = git_open(midx_name);
+   if (fd < 0)
+   return NULL;
+   if (fstat(fd, )) {
+   close(fd);
+   return NULL;
+   }
+   midx_size = xsize_t(st.st_size);
+
+   if (midx_size < MIDX_MIN_SIZE) {
+   close(fd);
+   die("multi-pack-index file %s is too small", midx_name);
+   }
+
+   midx_map = xmmap(NULL, midx_size, PROT_READ, MAP_PRIVATE, fd, 0);
+
+   m = xcalloc(1, sizeof(*m) + strlen(object_dir) + 1);
+   strcpy(m->object_dir, object_dir);
+   m->data = midx_map;
+
+   m->signature = get_be32(m->data);
+   if (m->signature != MIDX_SIGNATURE) {
+   error("multi-pack-index signature %X does not match signature 
%X",
+ m->signature, MIDX_SIGNATURE);
+   goto cleanup_fail;
+   }
+
+   m->version = *(m->data + 4);
+   if (m->version != MIDX_VERSION) {
+   error("multi-pack-index version %d not recognized",
+ m->version);
+   goto cleanup_fail;
+   }
+
+   m->hash_version =

[PATCH 08/23] midx: read packfiles from pack directory

2018-06-07 Thread Derrick Stolee

When constructing a multi-pack-index file for a given object directory,
read the files within the enclosed pack directory and find matches that
end with ".idx" and find the correct paired packfile using
add_packed_git().

Signed-off-by: Derrick Stolee 
---
 midx.c  | 51 +++--
 t/t5319-midx.sh | 15 ---
 2 files changed, 57 insertions(+), 9 deletions(-)

diff --git a/midx.c b/midx.c
index fa18770f1d..9fb89c80a2 100644
--- a/midx.c
+++ b/midx.c
@@ -102,10 +102,15 @@ static size_t write_midx_header(struct hashfile *f,
 int write_midx_file(const char *object_dir)
 {
unsigned char num_chunks = 0;
-   uint32_t num_packs = 0;
char *midx_name;
struct hashfile *f;
struct lock_file lk;
+   struct packed_git **packs = NULL;
+   uint32_t i, nr_packs = 0, alloc_packs = 0;
+   DIR *dir;
+   struct dirent *de;
+   struct strbuf pack_dir = STRBUF_INIT;
+   size_t pack_dir_len;
 
midx_name = get_midx_filename(object_dir);
if (safe_create_leading_directories(midx_name)) {
@@ -114,14 +119,56 @@ int write_midx_file(const char *object_dir)
  midx_name);
}
 
+   strbuf_addf(_dir, "%s/pack", object_dir);
+   dir = opendir(pack_dir.buf);
+
+   if (!dir) {
+   error_errno("unable to open pack directory: %s",
+   pack_dir.buf);
+   strbuf_release(_dir);
+   return 1;
+   }
+
+   strbuf_addch(_dir, '/');
+   pack_dir_len = pack_dir.len;
+   ALLOC_ARRAY(packs, alloc_packs);
+   while ((de = readdir(dir)) != NULL) {
+   if (is_dot_or_dotdot(de->d_name))
+   continue;
+
+   if (ends_with(de->d_name, ".idx")) {
+   ALLOC_GROW(packs, nr_packs + 1, alloc_packs);
+
+   strbuf_setlen(_dir, pack_dir_len);
+   strbuf_addstr(_dir, de->d_name);
+
+   packs[nr_packs] = add_packed_git(pack_dir.buf,
+pack_dir.len,
+0);
+   if (!packs[nr_packs])
+   warning("failed to add packfile '%s'",
+   pack_dir.buf);
+   else
+   nr_packs++;
+   }
+   }
+   closedir(dir);
+   strbuf_release(_dir);
+
hold_lock_file_for_update(, midx_name, LOCK_DIE_ON_ERROR);
f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
FREE_AND_NULL(midx_name);
 
-   write_midx_header(f, num_chunks, num_packs);
+   write_midx_header(f, num_chunks, nr_packs);
 
finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
commit_lock_file();
 
+   for (i = 0; i < nr_packs; i++) {
+   close_pack(packs[i]);
+   FREE_AND_NULL(packs[i]);
+   }
+
+   FREE_AND_NULL(packs);
return 0;
 }
diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
index 2c25a69744..abe545c7c4 100755
--- a/t/t5319-midx.sh
+++ b/t/t5319-midx.sh
@@ -4,8 +4,9 @@ test_description='multi-pack-indexes'
 . ./test-lib.sh
 
 midx_read_expect() {
+   NUM_PACKS=$1
cat >expect <<- EOF
-   header: 4d494458 1 1 0 0
+   header: 4d494458 1 1 0 $NUM_PACKS
object_dir: .
EOF
git midx read --object-dir=. >actual &&
@@ -16,7 +17,7 @@ test_expect_success 'write midx with no packs' '
git midx --object-dir=. write &&
test_when_finished rm pack/multi-pack-index &&
test_path_is_file pack/multi-pack-index &&
-   midx_read_expect
+   midx_read_expect 0
 '
 
 test_expect_success 'create objects' '
@@ -47,14 +48,14 @@ test_expect_success 'write midx with one v1 pack' '
pack=$(git pack-objects --index-version=1 pack/test obj-list &&
git update-ref HEAD $commit &&
-   git pack-objects --index-version=2 test-pack

[PATCH 17/23] midx: read objects from multi-pack-index

2018-06-07 Thread Derrick Stolee

Signed-off-by: Derrick Stolee 
---
 midx.c | 96 --
 midx.h |  2 ++
 object-store.h |  1 +
 packfile.c |  8 -
 4 files changed, 104 insertions(+), 3 deletions(-)

diff --git a/midx.c b/midx.c
index 5e9290ca8f..6eca8f1b12 100644
--- a/midx.c
+++ b/midx.c
@@ -3,6 +3,7 @@
 #include "dir.h"
 #include "csum-file.h"
 #include "lockfile.h"
+#include "sha1-lookup.h"
 #include "object-store.h"
 #include "packfile.h"
 #include "midx.h"
@@ -64,7 +65,7 @@ struct midxed_git *load_midxed_git(const char *object_dir)
 
m = xcalloc(1, sizeof(*m) + strlen(object_dir) + 1);
strcpy(m->object_dir, object_dir);
-   m->data = midx_map;
+   m->data = (const unsigned char*)midx_map;
 
m->signature = get_be32(m->data);
if (m->signature != MIDX_SIGNATURE) {
@@ -145,7 +146,9 @@ struct midxed_git *load_midxed_git(const char *object_dir)
 
m->num_objects = ntohl(m->chunk_oid_fanout[255]);
 
-   m->pack_names = xcalloc(m->num_packs, sizeof(const char *));
+   m->packs = xcalloc(m->num_packs, sizeof(*m->packs));
+
+   ALLOC_ARRAY(m->pack_names, m->num_packs);
for (i = 0; i < m->num_packs; i++) {
if (i) {
if (ntohl(m->chunk_pack_lookup[i]) <= 
ntohl(m->chunk_pack_lookup[i - 1])) {
@@ -175,6 +178,95 @@ struct midxed_git *load_midxed_git(const char *object_dir)
exit(1);
 }
 
+static int prepare_midx_pack(struct midxed_git *m, uint32_t pack_int_id)
+{
+   struct strbuf pack_name = STRBUF_INIT;
+
+   if (pack_int_id >= m->num_packs)
+   BUG("bad pack-int-id");
+
+   if (m->packs[pack_int_id])
+   return 0;
+
+   strbuf_addstr(_name, m->object_dir);
+   strbuf_addstr(_name, "/pack/");
+   strbuf_addstr(_name, m->pack_names[pack_int_id]);
+
+   m->packs[pack_int_id] = add_packed_git(pack_name.buf, pack_name.len, 1);
+   strbuf_release(_name);
+   return !m->packs[pack_int_id];
+}
+
+int bsearch_midx(const struct object_id *oid, struct midxed_git *m, uint32_t 
*result)
+{
+   return bsearch_hash(oid->hash, m->chunk_oid_fanout, m->chunk_oid_lookup,
+   MIDX_HASH_LEN, result);
+}
+
+static off_t nth_midxed_offset(struct midxed_git *m, uint32_t pos)
+{
+   const unsigned char *offset_data;
+   uint32_t offset32;
+
+   offset_data = m->chunk_object_offsets + pos * MIDX_CHUNK_OFFSET_WIDTH;
+   offset32 = get_be32(offset_data + sizeof(uint32_t));
+
+   if (m->chunk_large_offsets && offset32 & MIDX_LARGE_OFFSET_NEEDED) {
+   if (sizeof(offset32) < sizeof(uint64_t))
+   die(_("multi-pack-index stores a 64-bit offset, but 
off_t is too small"));
+
+   offset32 ^= MIDX_LARGE_OFFSET_NEEDED;
+   return get_be64(m->chunk_large_offsets + sizeof(uint64_t) * 
offset32);
+   }
+
+   return offset32;
+}
+
+static uint32_t nth_midxed_pack_int_id(struct midxed_git *m, uint32_t pos)
+{
+   return get_be32(m->chunk_object_offsets + pos * 
MIDX_CHUNK_OFFSET_WIDTH);
+}
+
+static int nth_midxed_pack_entry(struct midxed_git *m, struct pack_entry *e, 
uint32_t pos)
+{
+   uint32_t pack_int_id;
+   struct packed_git *p;
+
+   if (pos >= m->num_objects)
+   return 0;
+
+   pack_int_id = nth_midxed_pack_int_id(m, pos);
+
+   if (prepare_midx_pack(m, pack_int_id))
+   die(_("error preparing packfile from multi-pack-index"));
+   p = m->packs[pack_int_id];
+
+   /*
+   * We are about to tell the caller where they can locate the
+   * requested object.  We better make sure the packfile is
+   * still here and can be accessed before supplying that
+   * answer, as it may have been deleted since the MIDX was
+   * loaded!
+   */
+   if (!is_pack_valid(p))
+   return 0;
+
+   e->offset = nth_midxed_offset(m, pos);
+   e->p = p;
+
+   return 1;
+}
+
+int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct 
midxed_git *m)
+{
+   uint32_t pos;
+
+   if (!bsearch_midx(oid, m, ))
+   return 0;
+
+   return nth_midxed_pack_entry(m, e, pos);
+}
+
 int prepare_midxed_git_one(struct repository *r, const char *object_dir)
 {
struct midxed_git *m = r->objects->midxed_git;
diff --git a/midx.h b/midx.h
index 793203fc4a..0c66812229 100644
--- a/midx.h
+++ b/midx.h
@@ -8,6 +8,8 @@
 #include "repository.h"
 
 struct midxed_git *load_midxed_git(const char *object_dir);
+int bsearch_midx(const struct object_id *oid, struct midxed_git *m, uint32_t 
*result);
+int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct 
midxed_git *m);
 int prepare_midxed_git_one(struct repository *r, const char *object_dir);
 
 int write_midx_file(const char *object_dir);
diff --git a/object-store.h b/object-store.h
index 7908d46e34..5af2a852bc 100644
--- a/object-store.h
+++

[PATCH 02/23] midx: add midx format details to pack-format.txt

2018-06-07 Thread Derrick Stolee

The multi-pack-index (MIDX) feature generalizes the existing pack-
index (IDX) feature by indexing objects across multiple pack-files.

Describe the basic file format, using a 12-byte header followed by
a lookup table for a list of "chunks" which will be described later.
The file ends with a footer containing a checksum using the hash
algorithm.

The header allows later versions to create breaking changes by
advancing the version number. We can also change the hash algorithm
using a different version value.

We will add the individual chunk format information as we introduce
the code that writes that information.

Signed-off-by: Derrick Stolee 
---
 Documentation/technical/pack-format.txt | 49 +
 1 file changed, 49 insertions(+)

diff --git a/Documentation/technical/pack-format.txt 
b/Documentation/technical/pack-format.txt
index 70a99fd142..17666b4bfc 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -252,3 +252,52 @@ Pack file entry: <+
 corresponding packfile.
 
 20-byte SHA-1-checksum of all of the above.
+
+== midx-*.midx files have the following format:
+
+The meta-index files refer to multiple pack-files and loose objects.
+
+In order to allow extensions that add extra data to the MIDX, we organize
+the body into "chunks" and provide a lookup table at the beginning of the
+body. The header includes certain length values, such as the number of packs,
+the number of base MIDX files, hash lengths and types.
+
+All 4-byte numbers are in network order.
+
+HEADER:
+
+   4-byte signature:
+   The signature is: {'M', 'I', 'D', 'X'}
+
+   1-byte version number:
+   Git only writes or recognizes version 1
+
+   1-byte Object Id Version
+   Git only writes or recognizes verion 1 (SHA-1)
+
+   1-byte number (C) of "chunks"
+
+   1-byte number (I) of base multi-pack-index files:
+   This value is currently always zero.
+
+   4-byte number (P) of pack files
+
+CHUNK LOOKUP:
+
+   (C + 1) * 12 bytes providing the chunk offsets:
+   First 4 bytes describe chunk id. Value 0 is a terminating label.
+   Other 8 bytes provide offset in current file for chunk to start.
+   (Chunks are provided in file-order, so you can infer the length
+   using the next chunk position if necessary.)
+
+   The remaining data in the body is described one chunk at a time, and
+   these chunks may be given in any order. Chunks are required unless
+   otherwise specified.
+
+CHUNK DATA:
+
+   (This section intentionally left incomplete.)
+
+TRAILER:
+
+   H-byte HASH-checksum of all of the above.
-- 
2.18.0.rc1

[PATCH 05/23] midx: write header information to lockfile

2018-06-07 Thread Derrick Stolee

As we begin writing the multi-pack-index format to disk, start with
the basics: the 12-byte header and the 20-byte checksum footer. Start
with these basics so we can add the rest of the format in small
increments.

As we implement the format, we will use a technique to check that our
computed offsets within the multi-pack-index file match what we are
actually writing. Each method that writes to the hashfile will return
the number of bytes written, and we will track that those values match
our expectations.

Currently, write_midx_header() returns 12, but is not checked. We will
check the return value in a later commit.

Signed-off-by: Derrick Stolee 
---
 midx.c  | 53 +
 t/t5319-midx.sh |  5 +++--
 2 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/midx.c b/midx.c
index 616af66b13..3e55422a21 100644
--- a/midx.c
+++ b/midx.c
@@ -1,9 +1,62 @@
 #include "git-compat-util.h"
 #include "cache.h"
 #include "dir.h"
+#include "csum-file.h"
+#include "lockfile.h"
 #include "midx.h"
 
+#define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
+#define MIDX_VERSION 1
+#define MIDX_HASH_VERSION 1 /* SHA-1 */
+#define MIDX_HEADER_SIZE 12
+
+static char *get_midx_filename(const char *object_dir)
+{
+   struct strbuf midx_name = STRBUF_INIT;
+   strbuf_addstr(_name, object_dir);
+   strbuf_addstr(_name, "/pack/multi-pack-index");
+   return strbuf_detach(_name, NULL);
+}
+
+static size_t write_midx_header(struct hashfile *f,
+   unsigned char num_chunks,
+   uint32_t num_packs)
+{
+   char byte_values[4];
+   hashwrite_be32(f, MIDX_SIGNATURE);
+   byte_values[0] = MIDX_VERSION;
+   byte_values[1] = MIDX_HASH_VERSION;
+   byte_values[2] = num_chunks;
+   byte_values[3] = 0; /* unused */
+   hashwrite(f, byte_values, sizeof(byte_values));
+   hashwrite_be32(f, num_packs);
+
+   return MIDX_HEADER_SIZE;
+}
+
 int write_midx_file(const char *object_dir)
 {
+   unsigned char num_chunks = 0;
+   uint32_t num_packs = 0;
+   char *midx_name;
+   struct hashfile *f;
+   struct lock_file lk;
+
+   midx_name = get_midx_filename(object_dir);
+   if (safe_create_leading_directories(midx_name)) {
+   UNLEAK(midx_name);
+   die_errno(_("unable to create leading directories of %s"),
+ midx_name);
+   }
+
+   hold_lock_file_for_update(, midx_name, LOCK_DIE_ON_ERROR);
+   f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
+   FREE_AND_NULL(midx_name);
+
+   write_midx_header(f, num_chunks, num_packs);
+
+   finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
+   commit_lock_file();
+
return 0;
 }
diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
index a590137af7..80f9389837 100755
--- a/t/t5319-midx.sh
+++ b/t/t5319-midx.sh
@@ -3,8 +3,9 @@
 test_description='multi-pack-indexes'
 . ./test-lib.sh
 
-test_expect_success 'write midx with no pakcs' '
-   git midx --object-dir=. write
+test_expect_success 'write midx with no packs' '
+   git midx --object-dir=. write &&
+   test_path_is_file pack/multi-pack-index
 '
 
 test_done
-- 
2.18.0.rc1

[PATCH 21/23] midx: prevent duplicate packfile loads

2018-06-07 Thread Derrick Stolee

If the multi-pack-index contains a packfile, then we do not need to add
that packfile to the packed_git linked list or the MRU list.

Signed-off-by: Derrick Stolee 
---
 midx.c | 23 +++
 midx.h |  1 +
 packfile.c |  7 +++
 3 files changed, 31 insertions(+)

diff --git a/midx.c b/midx.c
index 388d79b7d9..3242646fe0 100644
--- a/midx.c
+++ b/midx.c
@@ -278,6 +278,29 @@ int fill_midx_entry(const struct object_id *oid, struct 
pack_entry *e, struct mi
return nth_midxed_pack_entry(m, e, pos);
 }
 
+int midx_contains_pack(struct midxed_git *m, const char *idx_name)
+{
+   uint32_t first = 0, last = m->num_packs;
+
+   while (first < last) {
+   uint32_t mid = first + (last - first) / 2;
+   const char *current;
+   int cmp;
+
+   current = m->pack_names[mid];
+   cmp = strcmp(idx_name, current);
+   if (!cmp)
+   return 1;
+   if (cmp > 0) {
+   first = mid + 1;
+   continue;
+   }
+   last = mid;
+   }
+
+   return 0;
+}
+
 int prepare_midxed_git_one(struct repository *r, const char *object_dir)
 {
struct midxed_git *m = r->objects->midxed_git;
diff --git a/midx.h b/midx.h
index 497bdcc77c..c1db58d8c4 100644
--- a/midx.h
+++ b/midx.h
@@ -13,6 +13,7 @@ struct object_id *nth_midxed_object_oid(struct object_id *oid,
struct midxed_git *m,
uint32_t n);
 int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct 
midxed_git *m);
+int midx_contains_pack(struct midxed_git *m, const char *idx_name);
 int prepare_midxed_git_one(struct repository *r, const char *object_dir);
 
 int write_midx_file(const char *object_dir);
diff --git a/packfile.c b/packfile.c
index 059b2aa097..479cb69b9f 100644
--- a/packfile.c
+++ b/packfile.c
@@ -746,6 +746,11 @@ static void prepare_packed_git_one(struct repository *r, 
char *objdir, int local
DIR *dir;
struct dirent *de;
struct string_list garbage = STRING_LIST_INIT_DUP;
+   struct midxed_git *m = r->objects->midxed_git;
+
+   /* look for the multi-pack-index for this object directory */
+   while (m && strcmp(m->object_dir, objdir))
+   m = m->next;
 
strbuf_addstr(, objdir);
strbuf_addstr(, "/pack");
@@ -772,6 +777,8 @@ static void prepare_packed_git_one(struct repository *r, 
char *objdir, int local
base_len = path.len;
if (strip_suffix_mem(path.buf, _len, ".idx")) {
/* Don't reopen a pack we already have. */
+   if (m && midx_contains_pack(m, de->d_name))
+   continue;
for (p = r->objects->packed_git; p;
 p = p->next) {
size_t len;
-- 
2.18.0.rc1

[PATCH 15/23] midx: create core.midx config setting

2018-06-07 Thread Derrick Stolee

The core.midx config setting controls the multi-pack-index (MIDX)
feature. If false, the setting will disable all reads from the
multi-pack-index file.

Add comparison commands in t5319-midx.sh to check typical Git behavior
remains the same as the config setting is turned on and off. This
currently includes 'git rev-list' and 'git log' commands to trigger
several object database reads.

Signed-off-by: Derrick Stolee 
---
 Documentation/config.txt |  4 +++
 cache.h  |  1 +
 config.c |  5 
 environment.c|  1 +
 t/t5319-midx.sh  | 57 
 5 files changed, 57 insertions(+), 11 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index ab641bf5a9..e78150e452 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -908,6 +908,10 @@ core.commitGraph::
Enable git commit graph feature. Allows reading from the
commit-graph file.
 
+core.midx::
+   Enable multi-pack-index feature. Allows reading from the multi-
+   pack-index file.
+
 core.sparseCheckout::
Enable "sparse checkout" feature. See section "Sparse checkout" in
linkgit:git-read-tree[1] for more information.
diff --git a/cache.h b/cache.h
index 89a107a7f7..c7967f7643 100644
--- a/cache.h
+++ b/cache.h
@@ -814,6 +814,7 @@ extern char *git_replace_ref_base;
 extern int fsync_object_files;
 extern int core_preload_index;
 extern int core_commit_graph;
+extern int core_midx;
 extern int core_apply_sparse_checkout;
 extern int precomposed_unicode;
 extern int protect_hfs;
diff --git a/config.c b/config.c
index fbbf0f8e9f..0df3dbdf74 100644
--- a/config.c
+++ b/config.c
@@ -1313,6 +1313,11 @@ static int git_default_core_config(const char *var, 
const char *value)
return 0;
}
 
+   if (!strcmp(var, "core.midx")) {
+   core_midx = git_config_bool(var, value);
+   return 0;
+   }
+
if (!strcmp(var, "core.sparsecheckout")) {
core_apply_sparse_checkout = git_config_bool(var, value);
return 0;
diff --git a/environment.c b/environment.c
index 2a6de2330b..dcb4417604 100644
--- a/environment.c
+++ b/environment.c
@@ -67,6 +67,7 @@ enum object_creation_mode object_creation_mode = 
OBJECT_CREATION_MODE;
 char *notes_ref_name;
 int grafts_replace_parents = 1;
 int core_commit_graph;
+int core_midx;
 int core_apply_sparse_checkout;
 int merge_log_config = -1;
 int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */
diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
index 709652c635..1a50987778 100755
--- a/t/t5319-midx.sh
+++ b/t/t5319-midx.sh
@@ -3,6 +3,8 @@
 test_description='multi-pack-indexes'
 . ./test-lib.sh
 
+objdir=.git/objects
+
 midx_read_expect() {
NUM_PACKS=$1
NUM_OBJECTS=$2
@@ -62,13 +64,42 @@ test_expect_success 'write midx with one v1 pack' '
midx_read_expect 1 17 5 .
 '
 
+midx_git_two_modes() {
+   git -c core.midx=false $1 >expect &&
+   git -c core.midx=true $1 >actual &&
+   test_cmp expect actual
+}
+
+compare_results_with_midx() {
+   MSG=$1
+   test_expect_success "check normal git operations: $MSG" '
+   midx_git_two_modes "rev-list --objects --all" &&
+   midx_git_two_modes "log --raw"
+   '
+}
+
 test_expect_success 'write midx with one v2 pack' '
-   pack=$(git pack-objects --index-version=2,0x40 pack/test expect &&
+   git -c core.midx=true $1 >actual &&
+   test_cmp expect actual
+}
+
+compare_results_with_midx() {
+   MSG=$1
+   test_expect_success "check normal git operations: $MSG" '
+   midx_git_two_modes "rev-list --objects --all" &&
+   midx_git_two_modes "log --raw"
+   '
+}
+
+compare_results_with_midx "one v2 pack"
+
 test_expect_success 'Add more objects' '
for i in `test_seq 6 10`
do
@@ -94,12 +125,13 @@ test_expect_success 'Add more objects' '
 '
 
 test_expect_success 'write midx with two packs' '
-   pack1=$(git pack-objects --index-version=1 pack/test-1 obj-list &&
git update-ref HEAD $commit &&
-   git pack-objects --index-version=2 pack/test-pack   []
 corrupt_data() {
-- 
2.18.0.rc1

[PATCH 19/23] midx: use existing midx when writing new one

2018-06-07 Thread Derrick Stolee

Signed-off-by: Derrick Stolee 
---
 midx.c | 68 +-
 1 file changed, 63 insertions(+), 5 deletions(-)

diff --git a/midx.c b/midx.c
index 25d8142c2a..388d79b7d9 100644
--- a/midx.c
+++ b/midx.c
@@ -389,6 +389,23 @@ static int midx_oid_compare(const void *_a, const void *_b)
return a->pack_int_id - b->pack_int_id;
 }
 
+static int nth_midxed_pack_midx_entry(struct midxed_git *m,
+ uint32_t *pack_perm,
+ struct pack_midx_entry *e,
+ uint32_t pos)
+{
+   if (pos >= m->num_objects)
+   return 1;
+
+   nth_midxed_object_oid(>oid, m, pos);
+   e->pack_int_id = pack_perm[nth_midxed_pack_int_id(m, pos)];
+   e->offset = nth_midxed_offset(m, pos);
+
+   /* consider objects in midx to be from "old" packs */
+   e->pack_mtime = 0;
+   return 0;
+}
+
 static void fill_pack_entry(uint32_t pack_int_id,
struct packed_git *p,
uint32_t cur_object,
@@ -414,7 +431,8 @@ static void fill_pack_entry(uint32_t pack_int_id,
  * Copy only the de-duplicated entries (selected by most-recent modified time
  * of a packfile containing the object).
  */
-static struct pack_midx_entry *get_sorted_entries(struct packed_git **p,
+static struct pack_midx_entry *get_sorted_entries(struct midxed_git *m,
+ struct packed_git **p,
  uint32_t *perm,
  uint32_t nr_packs,
  uint32_t *nr_objects)
@@ -423,8 +441,9 @@ static struct pack_midx_entry *get_sorted_entries(struct 
packed_git **p,
uint32_t nr_fanout, alloc_fanout, alloc_objects, total_objects = 0;
struct pack_midx_entry *entries_by_fanout = NULL;
struct pack_midx_entry *deduplicated_entries = NULL;
+   uint32_t start_pack = m ? m->num_packs : 0;
 
-   for (cur_pack = 0; cur_pack < nr_packs; cur_pack++) {
+   for (cur_pack = start_pack; cur_pack < nr_packs; cur_pack++) {
if (open_pack_index(p[cur_pack]))
continue;
 
@@ -445,7 +464,23 @@ static struct pack_midx_entry *get_sorted_entries(struct 
packed_git **p,
for (cur_fanout = 0; cur_fanout < 256; cur_fanout++) {
nr_fanout = 0;
 
-   for (cur_pack = 0; cur_pack < nr_packs; cur_pack++) {
+   if (m) {
+   uint32_t start = 0, end;
+
+   if (cur_fanout)
+   start = ntohl(m->chunk_oid_fanout[cur_fanout - 
1]);
+   end = ntohl(m->chunk_oid_fanout[cur_fanout]);
+
+   for (cur_object = start; cur_object < end; 
cur_object++) {
+   ALLOC_GROW(entries_by_fanout, nr_fanout + 1, 
alloc_fanout);
+   nth_midxed_pack_midx_entry(m, perm,
+  
_by_fanout[nr_fanout],
+  cur_object);
+   nr_fanout++;
+   }
+   }
+
+   for (cur_pack = start_pack; cur_pack < nr_packs; cur_pack++) {
uint32_t start = 0, end;
 
if (cur_fanout)
@@ -654,6 +689,7 @@ int write_midx_file(const char *object_dir)
struct pack_midx_entry *entries;
uint32_t nr_entries, num_large_offsets = 0;
int large_offsets_needed = 0;
+   struct midxed_git *m = NULL;
 
midx_name = get_midx_filename(object_dir);
if (safe_create_leading_directories(midx_name)) {
@@ -662,6 +698,8 @@ int write_midx_file(const char *object_dir)
  midx_name);
}
 
+   m = load_midxed_git(object_dir);
+
strbuf_addf(_dir, "%s/pack", object_dir);
dir = opendir(pack_dir.buf);
 
@@ -676,11 +714,27 @@ int write_midx_file(const char *object_dir)
pack_dir_len = pack_dir.len;
ALLOC_ARRAY(packs, alloc_packs);
ALLOC_ARRAY(pack_names, alloc_pack_names);
+
+   if (m) {
+   for (i = 0; i < m->num_packs; i++) {
+   ALLOC_GROW(packs, nr_packs + 1, alloc_packs);
+   ALLOC_GROW(pack_names, nr_packs + 1, alloc_pack_names);
+
+   packs[nr_packs] = NULL;
+   pack_names[nr_packs] = xstrdup(m->pack_names[i]);
+   pack_name_concat_len += strlen(pack_names[nr_packs]) + 
1;
+   nr_packs++;
+   }
+   }
+
while ((de = readdir(dir)) != NULL) {
if (is_dot_or_dotdot(de->d_name))
continue;
 
if (ends_with(de->d_name, ".idx")) {
+   if (m &&

[PATCH 14/23] midx: write object offsets

2018-06-07 Thread Derrick Stolee

The final pair of chunks for the multi-pack-index (MIDX) file stores the
object offsets. We default to using 32-bit offsets as in the pack-index
version 1 format, but if there exists an offset larger than 32-bits, we
use a trick similar to the pack-index version 2 format by storing all
offsets at least 2^31 in a 64-bit table; we use the 32-bit table to
point into that 64-bit table as necessary.

We only store these 64-bit offsets if necessary, so create a test that
manipulates a version 2 pack-index to fake a large offset. This allows
us to test that the large offset table is created, but the data does not
match the actual packfile offsets. The MIDX offset does match the
(corrupted) pack-index offset, so a later commit will compare these
offsets during a 'verify' step.

Signed-off-by: Derrick Stolee 
---
 Documentation/technical/pack-format.txt |  15 +++-
 builtin/midx.c  |   4 +
 midx.c  | 100 +++-
 object-store.h  |   2 +
 t/t5319-midx.sh |  45 ---
 5 files changed, 151 insertions(+), 15 deletions(-)

diff --git a/Documentation/technical/pack-format.txt 
b/Documentation/technical/pack-format.txt
index 77e88f85e4..0256cfb5e0 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -316,7 +316,20 @@ CHUNK DATA:
The OIDs for all objects in the MIDX are stored in lexicographic
order in this chunk.
 
-   (This section intentionally left incomplete.)
+   Object Offsets (ID: {'O', 'O', 'F', 'F'}) (N * 8 bytes)
+   Stores two 4-byte values for every object.
+   1: The pack-int-id for the pack storing this object.
+   2: The offset within the pack.
+   If all offsets are less than 2^31, then the large offset chunk
+   will not exist and offsets are stored as in IDX v1.
+   If there is at least one offset value larger than 2^32-1, then
+   the large offset chunk must exist. If the large offset chunk
+   exists and the 31st bit is on, then removing that bit reveals
+   the row in the large offsets containing the 8-byte offset of
+   this object.
+
+   [Optional] Object Large Offsets (ID: {'L', 'O', 'F', 'F'})
+   8-byte offsets into large packfiles.
 
 TRAILER:
 
diff --git a/builtin/midx.c b/builtin/midx.c
index e1fd0e0de4..607d2b3544 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -39,6 +39,10 @@ static int read_midx_file(const char *object_dir)
printf(" oid_fanout");
if (m->chunk_oid_lookup)
printf(" oid_lookup");
+   if (m->chunk_object_offsets)
+   printf(" object_offsets");
+   if (m->chunk_large_offsets)
+   printf(" large_offsets");
 
printf("\nnum_objects: %d\n", m->num_objects);
 
diff --git a/midx.c b/midx.c
index 9458ced208..a49300bf75 100644
--- a/midx.c
+++ b/midx.c
@@ -14,14 +14,19 @@
 #define MIDX_HASH_LEN 20
 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
 
-#define MIDX_MAX_CHUNKS 4
+#define MIDX_MAX_CHUNKS 6
 #define MIDX_CHUNK_ALIGNMENT 4
 #define MIDX_CHUNKID_PACKLOOKUP 0x504c4f4f /* "PLOO" */
 #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
 #define MIDX_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
 #define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
+#define MIDX_CHUNKID_OBJECTOFFSETS 0x4f4f4646 /* "OOFF" */
+#define MIDX_CHUNKID_LARGEOFFSETS 0x4c4f4646 /* "LOFF" */
 #define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
 #define MIDX_CHUNK_FANOUT_SIZE (sizeof(uint32_t) * 256)
+#define MIDX_CHUNK_OFFSET_WIDTH (2 * sizeof(uint32_t))
+#define MIDX_CHUNK_LARGE_OFFSET_WIDTH (sizeof(uint64_t))
+#define MIDX_LARGE_OFFSET_NEEDED 0x8000
 
 static char *get_midx_filename(const char *object_dir)
 {
@@ -106,6 +111,14 @@ struct midxed_git *load_midxed_git(const char *object_dir)
m->chunk_oid_lookup = m->data + chunk_offset;
break;
 
+   case MIDX_CHUNKID_OBJECTOFFSETS:
+   m->chunk_object_offsets = m->data + 
chunk_offset;
+   break;
+
+   case MIDX_CHUNKID_LARGEOFFSETS:
+   m->chunk_large_offsets = m->data + chunk_offset;
+   break;
+
case 0:
die("terminating MIDX chunk id appears earlier 
than expected");
break;
@@ -127,6 +140,8 @@ struct midxed_git *load_midxed_git(const char *object_dir)
die("MIDX missing required OID fanout chunk");
if (!m->chunk_oid_lookup)
die("MIDX missing required OID lookup chunk");
+   if (!m->chunk_object_offsets)
+   die("MIDX missing required object offsets chunk");

[PATCH 20/23] midx: use midx in approximate_object_count

2018-06-07 Thread Derrick Stolee

Signed-off-by: Derrick Stolee 
---
 packfile.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/packfile.c b/packfile.c
index 638e113972..059b2aa097 100644
--- a/packfile.c
+++ b/packfile.c
@@ -819,11 +819,14 @@ unsigned long approximate_object_count(void)
 {
if (!the_repository->objects->approximate_object_count_valid) {
unsigned long count;
+   struct midxed_git *m;
struct packed_git *p;
 
prepare_packed_git(the_repository);
count = 0;
-   for (p = the_repository->objects->packed_git; p; p = p->next) {
+   for (m = get_midxed_git(the_repository); m; m = m->next)
+   count += m->num_objects;
+   for (p = get_packed_git(the_repository); p; p = p->next) {
if (open_pack_index(p))
continue;
count += p->num_objects;
-- 
2.18.0.rc1

[PATCH 23/23] midx: clear midx on repack

2018-06-07 Thread Derrick Stolee

If a 'git repack' command replaces existing packfiles, then we must
clear the existing multi-pack-index before moving the packfiles it
references.

Signed-off-by: Derrick Stolee 
---
 builtin/repack.c | 8 
 midx.c   | 8 
 midx.h   | 1 +
 3 files changed, 17 insertions(+)

diff --git a/builtin/repack.c b/builtin/repack.c
index 6c636e159e..66a7d8e8ea 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -8,6 +8,7 @@
 #include "strbuf.h"
 #include "string-list.h"
 #include "argv-array.h"
+#include "midx.h"
 
 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
@@ -174,6 +175,7 @@ int cmd_repack(int argc, const char **argv, const char 
*prefix)
int no_update_server_info = 0;
int quiet = 0;
int local = 0;
+   int midx_cleared = 0;
 
struct option builtin_repack_options[] = {
OPT_BIT('a', NULL, _everything,
@@ -340,6 +342,12 @@ int cmd_repack(int argc, const char **argv, const char 
*prefix)
continue;
}
 
+   if (!midx_cleared) {
+   /* if we move a packfile, it will invalidated 
the midx */
+   clear_midx_file(get_object_directory());
+   midx_cleared = 1;
+   }
+
fname_old = mkpathdup("%s/old-%s%s", packdir,
item->string, exts[ext].name);
if (file_exists(fname_old))
diff --git a/midx.c b/midx.c
index e46f392fa4..1043c01fa7 100644
--- a/midx.c
+++ b/midx.c
@@ -913,3 +913,11 @@ int write_midx_file(const char *object_dir)
FREE_AND_NULL(pack_names);
return 0;
 }
+
+void clear_midx_file(const char *object_dir)
+{
+   char *midx = get_midx_filename(object_dir);
+
+   if (remove_path(midx))
+   die(_("failed to clear multi-pack-index at %s"), midx);
+}
diff --git a/midx.h b/midx.h
index 6996b5ff6b..46f9f44c94 100644
--- a/midx.h
+++ b/midx.h
@@ -18,5 +18,6 @@ int midx_contains_pack(struct midxed_git *m, const char 
*idx_name);
 int prepare_midxed_git_one(struct repository *r, const char *object_dir);
 
 int write_midx_file(const char *object_dir);
+void clear_midx_file(const char *object_dir);
 
 #endif
-- 
2.18.0.rc1

[PATCH 13/23] midx: write object id fanout chunk

2018-06-07 Thread Derrick Stolee

Signed-off-by: Derrick Stolee 
---
 Documentation/technical/pack-format.txt |  5 +++
 builtin/midx.c  |  4 +-
 midx.c  | 53 +++--
 object-store.h  |  1 +
 t/t5319-midx.sh | 18 +
 5 files changed, 69 insertions(+), 12 deletions(-)

diff --git a/Documentation/technical/pack-format.txt 
b/Documentation/technical/pack-format.txt
index de9ac778b6..77e88f85e4 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -307,6 +307,11 @@ CHUNK DATA:
name. This is the only chunk not guaranteed to be a multiple of four
bytes in length, so should be the last chunk for alignment reasons.
 
+   OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
+   The ith entry, F[i], stores the number of OIDs with first
+   byte at most i. Thus F[255] stores the total
+   number of objects (N).
+
OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
The OIDs for all objects in the MIDX are stored in lexicographic
order in this chunk.
diff --git a/builtin/midx.c b/builtin/midx.c
index 86edd30174..e1fd0e0de4 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -35,10 +35,12 @@ static int read_midx_file(const char *object_dir)
printf(" pack_lookup");
if (m->chunk_pack_names)
printf(" pack_names");
+   if (m->chunk_oid_fanout)
+   printf(" oid_fanout");
if (m->chunk_oid_lookup)
printf(" oid_lookup");
 
-   printf("\n");
+   printf("\nnum_objects: %d\n", m->num_objects);
 
printf("packs:\n");
for (i = 0; i < m->num_packs; i++)
diff --git a/midx.c b/midx.c
index d06bc6876a..9458ced208 100644
--- a/midx.c
+++ b/midx.c
@@ -14,12 +14,14 @@
 #define MIDX_HASH_LEN 20
 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
 
-#define MIDX_MAX_CHUNKS 3
+#define MIDX_MAX_CHUNKS 4
 #define MIDX_CHUNK_ALIGNMENT 4
 #define MIDX_CHUNKID_PACKLOOKUP 0x504c4f4f /* "PLOO" */
 #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
+#define MIDX_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
 #define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
+#define MIDX_CHUNK_FANOUT_SIZE (sizeof(uint32_t) * 256)
 
 static char *get_midx_filename(const char *object_dir)
 {
@@ -96,6 +98,10 @@ struct midxed_git *load_midxed_git(const char *object_dir)
m->chunk_pack_names = m->data + chunk_offset;
break;
 
+   case MIDX_CHUNKID_OIDFANOUT:
+   m->chunk_oid_fanout = (uint32_t *)(m->data + 
chunk_offset);
+   break;
+
case MIDX_CHUNKID_OIDLOOKUP:
m->chunk_oid_lookup = m->data + chunk_offset;
break;
@@ -117,9 +123,13 @@ struct midxed_git *load_midxed_git(const char *object_dir)
die("MIDX missing required pack lookup chunk");
if (!m->chunk_pack_names)
die("MIDX missing required pack-name chunk");
+   if (!m->chunk_oid_fanout)
+   die("MIDX missing required OID fanout chunk");
if (!m->chunk_oid_lookup)
die("MIDX missing required OID lookup chunk");
 
+   m->num_objects = ntohl(m->chunk_oid_fanout[255]);
+
m->pack_names = xcalloc(m->num_packs, sizeof(const char *));
for (i = 0; i < m->num_packs; i++) {
if (i) {
@@ -377,6 +387,35 @@ static size_t write_midx_pack_names(struct hashfile *f,
return written;
 }
 
+static size_t write_midx_oid_fanout(struct hashfile *f,
+   struct pack_midx_entry *objects,
+   uint32_t nr_objects)
+{
+   struct pack_midx_entry *list = objects;
+   struct pack_midx_entry *last = objects + nr_objects;
+   uint32_t count = 0;
+   uint32_t i;
+
+   /*
+   * Write the first-level table (the list is sorted,
+   * but we use a 256-entry lookup to be able to avoid
+   * having to do eight extra binary search iterations).
+   */
+   for (i = 0; i < 256; i++) {
+   struct pack_midx_entry *next = list;
+
+   while (next < last && next->oid.hash[0] == i) {
+   count++;
+   next++;
+   }
+
+   hashwrite_be32(f, count);
+   list = next;
+   }
+
+   return MIDX_CHUNK_FANOUT_SIZE;
+}
+
 static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char hash_len,
struct pack_midx_entry *objects,
uint32_t nr_objects)
@@ -489,7 +528,7 @@ int write_midx_file(const char *object_dir)
FREE_AND_NULL(midx_name);

[PATCH 18/23] midx: use midx in abbreviation calculations

2018-06-07 Thread Derrick Stolee

Signed-off-by: Derrick Stolee 
---
 midx.c  | 11 
 midx.h  |  3 +++
 packfile.c  |  6 +
 packfile.h  |  1 +
 sha1-name.c | 70 +
 t/t5319-midx.sh |  3 ++-
 6 files changed, 93 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index 6eca8f1b12..25d8142c2a 100644
--- a/midx.c
+++ b/midx.c
@@ -203,6 +203,17 @@ int bsearch_midx(const struct object_id *oid, struct 
midxed_git *m, uint32_t *re
MIDX_HASH_LEN, result);
 }
 
+struct object_id *nth_midxed_object_oid(struct object_id *oid,
+   struct midxed_git *m,
+   uint32_t n)
+{
+   if (n >= m->num_objects)
+   return NULL;
+
+   hashcpy(oid->hash, m->chunk_oid_lookup + m->hash_len * n);
+   return oid;
+}
+
 static off_t nth_midxed_offset(struct midxed_git *m, uint32_t pos)
 {
const unsigned char *offset_data;
diff --git a/midx.h b/midx.h
index 0c66812229..497bdcc77c 100644
--- a/midx.h
+++ b/midx.h
@@ -9,6 +9,9 @@
 
 struct midxed_git *load_midxed_git(const char *object_dir);
 int bsearch_midx(const struct object_id *oid, struct midxed_git *m, uint32_t 
*result);
+struct object_id *nth_midxed_object_oid(struct object_id *oid,
+   struct midxed_git *m,
+   uint32_t n);
 int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct 
midxed_git *m);
 int prepare_midxed_git_one(struct repository *r, const char *object_dir);
 
diff --git a/packfile.c b/packfile.c
index 73f8cc28ee..638e113972 100644
--- a/packfile.c
+++ b/packfile.c
@@ -919,6 +919,12 @@ struct packed_git *get_packed_git(struct repository *r)
return r->objects->packed_git;
 }
 
+struct midxed_git *get_midxed_git(struct repository *r)
+{
+   prepare_packed_git(r);
+   return r->objects->midxed_git;
+}
+
 struct list_head *get_packed_git_mru(struct repository *r)
 {
prepare_packed_git(r);
diff --git a/packfile.h b/packfile.h
index e0a38aba93..01e14b93fd 100644
--- a/packfile.h
+++ b/packfile.h
@@ -39,6 +39,7 @@ extern void install_packed_git(struct repository *r, struct 
packed_git *pack);
 
 struct packed_git *get_packed_git(struct repository *r);
 struct list_head *get_packed_git_mru(struct repository *r);
+struct midxed_git *get_midxed_git(struct repository *r);
 
 /*
  * Give a rough count of objects in the repository. This sacrifices accuracy
diff --git a/sha1-name.c b/sha1-name.c
index 60d9ef3c7e..d975a186c9 100644
--- a/sha1-name.c
+++ b/sha1-name.c
@@ -12,6 +12,7 @@
 #include "packfile.h"
 #include "object-store.h"
 #include "repository.h"
+#include "midx.h"
 
 static int get_oid_oneline(const char *, struct object_id *, struct 
commit_list *);
 
@@ -149,6 +150,32 @@ static int match_sha(unsigned len, const unsigned char *a, 
const unsigned char *
return 1;
 }
 
+static void unique_in_midx(struct midxed_git *m,
+  struct disambiguate_state *ds)
+{
+   uint32_t num, i, first = 0;
+   const struct object_id *current = NULL;
+   num = m->num_objects;
+
+   if (!num)
+   return;
+
+   bsearch_midx(>bin_pfx, m, );
+
+   /*
+* At this point, "first" is the location of the lowest object
+* with an object name that could match "bin_pfx".  See if we have
+* 0, 1 or more objects that actually match(es).
+*/
+   for (i = first; i < num && !ds->ambiguous; i++) {
+   struct object_id oid;
+   current = nth_midxed_object_oid(, m, i);
+   if (!match_sha(ds->len, ds->bin_pfx.hash, current->hash))
+   break;
+   update_candidates(ds, current);
+   }
+}
+
 static void unique_in_pack(struct packed_git *p,
   struct disambiguate_state *ds)
 {
@@ -177,8 +204,12 @@ static void unique_in_pack(struct packed_git *p,
 
 static void find_short_packed_object(struct disambiguate_state *ds)
 {
+   struct midxed_git *m;
struct packed_git *p;
 
+   for (m = get_midxed_git(the_repository); m && !ds->ambiguous;
+m = m->next)
+   unique_in_midx(m, ds);
for (p = get_packed_git(the_repository); p && !ds->ambiguous;
 p = p->next)
unique_in_pack(p, ds);
@@ -527,6 +558,42 @@ static int extend_abbrev_len(const struct object_id *oid, 
void *cb_data)
return 0;
 }
 
+static void find_abbrev_len_for_midx(struct midxed_git *m,
+struct min_abbrev_data *mad)
+{
+   int match = 0;
+   uint32_t num, first = 0;
+   struct object_id oid;
+   const struct object_id *mad_oid;
+
+   if (!m->num_objects)
+   return;
+
+   num = m->num_objects;
+   mad_oid = mad->oid;
+   match = bsearch_midx(mad_oid, m, );
+
+   /*
+* first is now the position in

[PATCH 22/23] midx: use midx to find ref-deltas

2018-06-07 Thread Derrick Stolee

Signed-off-by: Derrick Stolee 
---
 midx.c |  2 +-
 midx.h |  1 +
 packfile.c | 15 +++
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index 3242646fe0..e46f392fa4 100644
--- a/midx.c
+++ b/midx.c
@@ -214,7 +214,7 @@ struct object_id *nth_midxed_object_oid(struct object_id 
*oid,
return oid;
 }
 
-static off_t nth_midxed_offset(struct midxed_git *m, uint32_t pos)
+off_t nth_midxed_offset(struct midxed_git *m, uint32_t pos)
 {
const unsigned char *offset_data;
uint32_t offset32;
diff --git a/midx.h b/midx.h
index c1db58d8c4..6996b5ff6b 100644
--- a/midx.h
+++ b/midx.h
@@ -9,6 +9,7 @@
 
 struct midxed_git *load_midxed_git(const char *object_dir);
 int bsearch_midx(const struct object_id *oid, struct midxed_git *m, uint32_t 
*result);
+off_t nth_midxed_offset(struct midxed_git *m, uint32_t n);
 struct object_id *nth_midxed_object_oid(struct object_id *oid,
struct midxed_git *m,
uint32_t n);
diff --git a/packfile.c b/packfile.c
index 479cb69b9f..9b814c89c7 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1794,6 +1794,21 @@ off_t find_pack_entry_one(const unsigned char *sha1,
uint32_t result;
 
if (!index) {
+   /*
+* If we have a MIDX, then we want to
+* check the MIDX for the offset instead.
+*/
+   struct midxed_git *m;
+
+   for (m = get_midxed_git(the_repository); m; m = m->next) {
+   if (midx_contains_pack(m, p->pack_name)) {
+   if (bsearch_midx(, m, ))
+   return nth_midxed_offset(m, result);
+
+   break;
+   }
+   }
+
if (open_pack_index(p))
return 0;
}
-- 
2.18.0.rc1

[PATCH 11/23] midx: sort and deduplicate objects from packfiles

2018-06-07 Thread Derrick Stolee

Before writing a list of objects and their offsets to a multi-pack-index
(MIDX), we need to collect the list of objects contained in the
packfiles. There may be multiple copies of some objects, so this list
must be deduplicated.

It is possible to artificially get into a state where there are many
duplicate copies of objects. That can create high memory pressure if we
are to create a list of all objects before de-duplication. To reduce
this memory pressure without a significant performance drop,
automatically group objects by the first byte of their object id. Use
the IDX fanout tables to group the data, copy to a local array, then
sort.

Copy only the de-duplicated entries. Select the duplicate based on the
most-recent modified time of a packfile containing the object.

Signed-off-by: Derrick Stolee 
---
 midx.c | 138 +
 1 file changed, 138 insertions(+)

diff --git a/midx.c b/midx.c
index 923acda72e..b20d52713c 100644
--- a/midx.c
+++ b/midx.c
@@ -4,6 +4,7 @@
 #include "csum-file.h"
 #include "lockfile.h"
 #include "object-store.h"
+#include "packfile.h"
 #include "midx.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
@@ -190,6 +191,140 @@ static void sort_packs_by_name(char **pack_names, 
uint32_t nr_packs, uint32_t *p
}
 }
 
+static uint32_t get_pack_fanout(struct packed_git *p, uint32_t value)
+{
+   const uint32_t *level1_ofs = p->index_data;
+
+   if (!level1_ofs) {
+   if (open_pack_index(p))
+   return 0;
+   level1_ofs = p->index_data;
+   }
+
+   if (p->index_version > 1) {
+   level1_ofs += 2;
+   }
+
+   return ntohl(level1_ofs[value]);
+}
+
+struct pack_midx_entry {
+   struct object_id oid;
+   uint32_t pack_int_id;
+   time_t pack_mtime;
+   uint64_t offset;
+};
+
+static int midx_oid_compare(const void *_a, const void *_b)
+{
+   struct pack_midx_entry *a = (struct pack_midx_entry *)_a;
+   struct pack_midx_entry *b = (struct pack_midx_entry *)_b;
+   int cmp = oidcmp(>oid, >oid);
+
+   if (cmp)
+   return cmp;
+
+   if (a->pack_mtime > b->pack_mtime)
+   return -1;
+   else if (a->pack_mtime < b->pack_mtime)
+   return 1;
+
+   return a->pack_int_id - b->pack_int_id;
+}
+
+static void fill_pack_entry(uint32_t pack_int_id,
+   struct packed_git *p,
+   uint32_t cur_object,
+   struct pack_midx_entry *entry)
+{
+   if (!nth_packed_object_oid(>oid, p, cur_object))
+   die("failed to located object %d in packfile", cur_object);
+
+   entry->pack_int_id = pack_int_id;
+   entry->pack_mtime = p->mtime;
+
+   entry->offset = nth_packed_object_offset(p, cur_object);
+}
+
+/*
+ * It is possible to artificially get into a state where there are many
+ * duplicate copies of objects. That can create high memory pressure if
+ * we are to create a list of all objects before de-duplication. To reduce
+ * this memory pressure without a significant performance drop, automatically
+ * group objects by the first byte of their object id. Use the IDX fanout
+ * tables to group the data, copy to a local array, then sort.
+ *
+ * Copy only the de-duplicated entries (selected by most-recent modified time
+ * of a packfile containing the object).
+ */
+static struct pack_midx_entry *get_sorted_entries(struct packed_git **p,
+ uint32_t *perm,
+ uint32_t nr_packs,
+ uint32_t *nr_objects)
+{
+   uint32_t cur_fanout, cur_pack, cur_object;
+   uint32_t nr_fanout, alloc_fanout, alloc_objects, total_objects = 0;
+   struct pack_midx_entry *entries_by_fanout = NULL;
+   struct pack_midx_entry *deduplicated_entries = NULL;
+
+   for (cur_pack = 0; cur_pack < nr_packs; cur_pack++) {
+   if (open_pack_index(p[cur_pack]))
+   continue;
+
+   total_objects += p[cur_pack]->num_objects;
+   }
+
+   /*
+* As we de-duplicate by fanout value, we expect the fanout
+* slices to be evenly distributed, with some noise. Hence,
+* allocate slightly more than one 256th.
+*/
+   alloc_objects = alloc_fanout = total_objects > 3200 ? total_objects / 
200 : 16;
+
+   ALLOC_ARRAY(entries_by_fanout, alloc_fanout);
+   ALLOC_ARRAY(deduplicated_entries, alloc_objects);
+   *nr_objects = 0;
+
+   for (cur_fanout = 0; cur_fanout < 256; cur_fanout++) {
+   nr_fanout = 0;
+
+   for (cur_pack = 0; cur_pack < nr_packs; cur_pack++) {
+   uint32_t start = 0, end;
+
+   if (cur_fanout)
+   start = get_pack_fanout(p[cur_pack], cur_fanout 
- 1);
+   end =

[PATCH 10/23] midx: write a lookup into the pack names chunk

2018-06-07 Thread Derrick Stolee

Signed-off-by: Derrick Stolee 
---
 Documentation/technical/pack-format.txt |  5 +++
 builtin/midx.c  |  7 
 midx.c  | 56 +++--
 object-store.h  |  2 +
 t/t5319-midx.sh | 11 +++--
 5 files changed, 75 insertions(+), 6 deletions(-)

diff --git a/Documentation/technical/pack-format.txt 
b/Documentation/technical/pack-format.txt
index 2b37be7b33..29bf87283a 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -296,6 +296,11 @@ CHUNK LOOKUP:
 
 CHUNK DATA:
 
+   Packfile Name Lookup (ID: {'P', 'L', 'O', 'O'}) (P * 4 bytes)
+   P * 4 bytes storing the offset in the packfile name chunk for
+   the null-terminated string containing the filename for the
+   ith packfile.
+
Packfile Names (ID: {'P', 'N', 'A', 'M'})
Stores the packfile names as concatenated, null-terminated strings.
Packfiles must be listed in lexicographic order for fast lookups by
diff --git a/builtin/midx.c b/builtin/midx.c
index fe56560853..3a261e9bbf 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -16,6 +16,7 @@ static struct opts_midx {
 
 static int read_midx_file(const char *object_dir)
 {
+   uint32_t i;
struct midxed_git *m = load_midxed_git(object_dir);
 
if (!m)
@@ -30,11 +31,17 @@ static int read_midx_file(const char *object_dir)
 
printf("chunks:");
 
+   if (m->chunk_pack_lookup)
+   printf(" pack_lookup");
if (m->chunk_pack_names)
printf(" pack_names");
 
printf("\n");
 
+   printf("packs:\n");
+   for (i = 0; i < m->num_packs; i++)
+   printf("%s\n", m->pack_names[i]);
+
printf("object_dir: %s\n", m->object_dir);
 
return 0;
diff --git a/midx.c b/midx.c
index d4f4a01a51..923acda72e 100644
--- a/midx.c
+++ b/midx.c
@@ -13,8 +13,9 @@
 #define MIDX_HASH_LEN 20
 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
 
-#define MIDX_MAX_CHUNKS 1
+#define MIDX_MAX_CHUNKS 2
 #define MIDX_CHUNK_ALIGNMENT 4
+#define MIDX_CHUNKID_PACKLOOKUP 0x504c4f4f /* "PLOO" */
 #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
 #define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
 
@@ -85,6 +86,10 @@ struct midxed_git *load_midxed_git(const char *object_dir)
uint64_t chunk_offset = get_be64(m->data + 16 + 
MIDX_CHUNKLOOKUP_WIDTH * i);
 
switch (chunk_id) {
+   case MIDX_CHUNKID_PACKLOOKUP:
+   m->chunk_pack_lookup = (uint32_t *)(m->data + 
chunk_offset);
+   break;
+
case MIDX_CHUNKID_PACKNAMES:
m->chunk_pack_names = m->data + chunk_offset;
break;
@@ -102,9 +107,32 @@ struct midxed_git *load_midxed_git(const char *object_dir)
}
}
 
+   if (!m->chunk_pack_lookup)
+   die("MIDX missing required pack lookup chunk");
if (!m->chunk_pack_names)
die("MIDX missing required pack-name chunk");
 
+   m->pack_names = xcalloc(m->num_packs, sizeof(const char *));
+   for (i = 0; i < m->num_packs; i++) {
+   if (i) {
+   if (ntohl(m->chunk_pack_lookup[i]) <= 
ntohl(m->chunk_pack_lookup[i - 1])) {
+   error("MIDX pack lookup value %d before %d",
+ ntohl(m->chunk_pack_lookup[i - 1]),
+ ntohl(m->chunk_pack_lookup[i]));
+   goto cleanup_fail;
+   }
+   }
+
+   m->pack_names[i] = (const char *)(m->chunk_pack_names + 
ntohl(m->chunk_pack_lookup[i]));
+
+   if (i && strcmp(m->pack_names[i], m->pack_names[i - 1]) <= 0) {
+   error("MIDX pack names out of order: '%s' before '%s'",
+ m->pack_names[i - 1],
+ m->pack_names[i]);
+   goto cleanup_fail;
+   }
+   }
+
return m;
 
 cleanup_fail:
@@ -162,6 +190,20 @@ static void sort_packs_by_name(char **pack_names, uint32_t 
nr_packs, uint32_t *p
}
 }
 
+static size_t write_midx_pack_lookup(struct hashfile *f,
+char **pack_names,
+uint32_t nr_packs)
+{
+   uint32_t i, cur_len = 0;
+
+   for (i = 0; i < nr_packs; i++) {
+   hashwrite_be32(f, cur_len);
+   cur_len += strlen(pack_names[i]) + 1;
+   }
+
+   return sizeof(uint32_t) * (size_t)nr_packs;
+}
+
 static size_t write_midx_pack_names(struct hashfile *f,
char **pack_names,
uint32_t num_packs)
@@ -275,13 +317,17 @@ int

[PATCH 04/23] midx: add 'write' subcommand and basic wiring

2018-06-07 Thread Derrick Stolee

In anticipation of writing multi-pack-indexes (MIDX files), add a
'git midx write' subcommand and send the options to a write_midx_file()
method. Also create a basic test file that tests the 'write' subcommand.

Signed-off-by: Derrick Stolee 
---
 Documentation/git-midx.txt | 22 +-
 Makefile   |  1 +
 builtin/midx.c |  9 -
 midx.c |  9 +
 midx.h |  4 
 t/t5319-midx.sh| 10 ++
 6 files changed, 53 insertions(+), 2 deletions(-)
 create mode 100644 midx.c
 create mode 100644 midx.h
 create mode 100755 t/t5319-midx.sh

diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
index 2bd886f1a2..dcaeb1a91b 100644
--- a/Documentation/git-midx.txt
+++ b/Documentation/git-midx.txt
@@ -9,7 +9,7 @@ git-midx - Write and verify multi-pack-indexes (MIDX files).
 SYNOPSIS
 
 [verse]
-'git midx' [--object-dir ]
+'git midx' [--object-dir ] 
 
 DESCRIPTION
 ---
@@ -23,6 +23,26 @@ OPTIONS
/packs/multi-pack-index for the current MIDX file, and
/packs for the pack-files to index.
 
+write::
+   When given as the verb, write a new MIDX file to
+   /packs/multi-pack-index.
+
+
+EXAMPLES
+
+
+* Write a MIDX file for the packfiles in the current .git folder.
++
+---
+$ git midx write
+---
+
+* Write a MIDX file for the packfiles in an alternate.
++
+---
+$ git midx --object-dir  write
+---
+
 
 GIT
 ---
diff --git a/Makefile b/Makefile
index 88958c7b42..aa86fcd8ec 100644
--- a/Makefile
+++ b/Makefile
@@ -890,6 +890,7 @@ LIB_OBJS += merge.o
 LIB_OBJS += merge-blobs.o
 LIB_OBJS += merge-recursive.o
 LIB_OBJS += mergesort.o
+LIB_OBJS += midx.o
 LIB_OBJS += name-hash.o
 LIB_OBJS += notes.o
 LIB_OBJS += notes-cache.o
diff --git a/builtin/midx.c b/builtin/midx.c
index 59ea92178f..dc0a5acd3f 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -3,9 +3,10 @@
 #include "config.h"
 #include "git-compat-util.h"
 #include "parse-options.h"
+#include "midx.h"
 
 static char const * const builtin_midx_usage[] ={
-   N_("git midx [--object-dir ]"),
+   N_("git midx [--object-dir ] [write]"),
NULL
 };
 
@@ -34,5 +35,11 @@ int cmd_midx(int argc, const char **argv, const char *prefix)
if (!opts.object_dir)
opts.object_dir = get_object_directory();
 
+   if (argc == 0)
+   return 0;
+
+   if (!strcmp(argv[0], "write"))
+   return write_midx_file(opts.object_dir);
+
return 0;
 }
diff --git a/midx.c b/midx.c
new file mode 100644
index 00..616af66b13
--- /dev/null
+++ b/midx.c
@@ -0,0 +1,9 @@
+#include "git-compat-util.h"
+#include "cache.h"
+#include "dir.h"
+#include "midx.h"
+
+int write_midx_file(const char *object_dir)
+{
+   return 0;
+}
diff --git a/midx.h b/midx.h
new file mode 100644
index 00..3a63673952
--- /dev/null
+++ b/midx.h
@@ -0,0 +1,4 @@
+#include "cache.h"
+#include "packfile.h"
+
+int write_midx_file(const char *object_dir);
diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh
new file mode 100755
index 00..a590137af7
--- /dev/null
+++ b/t/t5319-midx.sh
@@ -0,0 +1,10 @@
+#!/bin/sh
+
+test_description='multi-pack-indexes'
+. ./test-lib.sh
+
+test_expect_success 'write midx with no pakcs' '
+   git midx --object-dir=. write
+'
+
+test_done
-- 
2.18.0.rc1

[PATCH 12/23] midx: write object ids in a chunk

2018-06-07 Thread Derrick Stolee

Signed-off-by: Derrick Stolee 
---
 Documentation/technical/pack-format.txt |  4 ++
 builtin/midx.c  |  2 +
 midx.c  | 50 +++--
 object-store.h  |  1 +
 t/t5319-midx.sh |  4 +-
 5 files changed, 55 insertions(+), 6 deletions(-)

diff --git a/Documentation/technical/pack-format.txt 
b/Documentation/technical/pack-format.txt
index 29bf87283a..de9ac778b6 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -307,6 +307,10 @@ CHUNK DATA:
name. This is the only chunk not guaranteed to be a multiple of four
bytes in length, so should be the last chunk for alignment reasons.
 
+   OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
+   The OIDs for all objects in the MIDX are stored in lexicographic
+   order in this chunk.
+
(This section intentionally left incomplete.)
 
 TRAILER:
diff --git a/builtin/midx.c b/builtin/midx.c
index 3a261e9bbf..86edd30174 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -35,6 +35,8 @@ static int read_midx_file(const char *object_dir)
printf(" pack_lookup");
if (m->chunk_pack_names)
printf(" pack_names");
+   if (m->chunk_oid_lookup)
+   printf(" oid_lookup");
 
printf("\n");
 
diff --git a/midx.c b/midx.c
index b20d52713c..d06bc6876a 100644
--- a/midx.c
+++ b/midx.c
@@ -14,10 +14,11 @@
 #define MIDX_HASH_LEN 20
 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
 
-#define MIDX_MAX_CHUNKS 2
+#define MIDX_MAX_CHUNKS 3
 #define MIDX_CHUNK_ALIGNMENT 4
 #define MIDX_CHUNKID_PACKLOOKUP 0x504c4f4f /* "PLOO" */
 #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
+#define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
 
 static char *get_midx_filename(const char *object_dir)
@@ -95,6 +96,10 @@ struct midxed_git *load_midxed_git(const char *object_dir)
m->chunk_pack_names = m->data + chunk_offset;
break;
 
+   case MIDX_CHUNKID_OIDLOOKUP:
+   m->chunk_oid_lookup = m->data + chunk_offset;
+   break;
+
case 0:
die("terminating MIDX chunk id appears earlier 
than expected");
break;
@@ -112,6 +117,8 @@ struct midxed_git *load_midxed_git(const char *object_dir)
die("MIDX missing required pack lookup chunk");
if (!m->chunk_pack_names)
die("MIDX missing required pack-name chunk");
+   if (!m->chunk_oid_lookup)
+   die("MIDX missing required OID lookup chunk");
 
m->pack_names = xcalloc(m->num_packs, sizeof(const char *));
for (i = 0; i < m->num_packs; i++) {
@@ -370,6 +377,32 @@ static size_t write_midx_pack_names(struct hashfile *f,
return written;
 }
 
+static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char hash_len,
+   struct pack_midx_entry *objects,
+   uint32_t nr_objects)
+{
+   struct pack_midx_entry *list = objects;
+   uint32_t i;
+   size_t written = 0;
+
+   for (i = 0; i < nr_objects; i++) {
+   struct pack_midx_entry *obj = list++;
+
+   if (i < nr_objects - 1) {
+   struct pack_midx_entry *next = list;
+   if (oidcmp(>oid, >oid) >= 0)
+   BUG("OIDs not in order: %s >= %s",
+   oid_to_hex(>oid),
+   oid_to_hex(>oid));
+   }
+
+   hashwrite(f, obj->oid.hash, (int)hash_len);
+   written += hash_len;
+   }
+
+   return written;
+}
+
 int write_midx_file(const char *object_dir)
 {
unsigned char cur_chunk, num_chunks = 0;
@@ -389,6 +422,7 @@ int write_midx_file(const char *object_dir)
uint64_t written = 0;
uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
+   struct pack_midx_entry *entries;
uint32_t nr_entries;
 
midx_name = get_midx_filename(object_dir);
@@ -448,14 +482,14 @@ int write_midx_file(const char *object_dir)
ALLOC_ARRAY(pack_perm, nr_packs);
sort_packs_by_name(pack_names, nr_packs, pack_perm);
 
-   get_sorted_entries(packs, pack_perm, nr_packs, _entries);
+   entries = get_sorted_entries(packs, pack_perm, nr_packs, _entries);
 
hold_lock_file_for_update(, midx_name, LOCK_DIE_ON_ERROR);
f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
FREE_AND_NULL(midx_name);
 
cur_chunk = 0;
-   num_chunks = 2;
+   num_chunks = 3;
 
written = write_midx_header(f, num_chunks,

[PATCH 00/23] Multi-pack-index (MIDX)

2018-06-07 Thread Derrick Stolee

This patch series includes a rewrite of the previous
multi-pack-index RFC [1] using the feedback from the
commit-graph feature.

I based this series on 'next' as it requires the
recent object-store patches.

The multi-pack-index (MIDX) is explained fully in
the design document 'Documentation/technical/midx.txt'.
The short description is that the MIDX stores the
information from all of the IDX files in a pack
directory. The crucial design decision is that the
IDX files still exist, so we can fall back to the IDX
files if there is any issue with the MIDX (or core.midx
is set to false, or a user downgrades Git, etc.)

The MIDX feature has been part of our GVFS releases
for a few months (since the RFC). It has behaved well,
indexing over 31 million commits and trees across up
to 250 packfiles. These MIDX files are nearly 1GB in
size and take ~20 seconds to rewrite when adding new
IDX information. This ~20s mark is something I'd like
to improve, and I mention how to make the file
incremental (similar to split-index) in the design
document. I also want to make the commit-graph file
incremental, so I'd like to do that at the same time
after both the MIDX and commit-graph are stable.


Lookup Speedups
---

When looking for an object, Git uses an most-recently-
used (MRU) cache of packfiles. This does pretty well to
minimize the number of misses when searching through
packfiles for an object, especially if there is one
"big" packfile that contains most of the objets (so it
will rarely miss and is usually one of the first two
packfiles in the list). The MIDX does provide a way
to remove these misses, improving lookup time. However,
this lookup time greatly depends on the arrangement of
the packfiles.

For instance, if you take the Linux repository and repack
using `git repack -adfF --max-pack-size=128m` then all
commits will be in one packfile, all trees will be in
a small set of packfiles and organized well so 'git
rev-list --objects HEAD^{tree}' only inspects one or two
packfiles.

GVFS has the notion of a "prefetch packfile". These are
packfiles that are precomputed by cache servers to
contain the commits and trees introduced to the remote
each day. GVFS downloads these packfiles and places them
in an alternate. Since these are organized by "first
time introduced" and the working directory is so large,
the MRU misses are significant when performing a checkout
and updating the .git/index file.

To test the performance in this situation, I created a
script that organizes the Linux repository in a similar
fashion. I split the commit history into 50 parts by
creating branches on every 10,000 commits of the first-
parent history. Then, `git rev-list --objects A ^B`
provides the list of objects reachable from A but not B,
so I could send that to `git pack-objects` to create
these "time-based" packfiles. With these 50 packfiles
(deleting the old one from my fresh clone, and deleting
all tags as they were no longer on-disk) I could then
test 'git rev-list --objects HEAD^{tree}' and see:

Before: 0.17s
After:  0.13s
% Diff: -23.5%

By adding logic to count hits and misses to bsearch_pack,
I was able to see that the command above calls that
method 266,930 times with a hit rate of 33%. The MIDX
has the same number of calls with a 100% hit rate.



Abbreviation Speedups
-

To fully disambiguate an abbreviation, we must iterate
through all packfiles to ensure no collision exists in
any packfile. This requires O(P log N) time. With the
MIDX, this is only O(log N) time. Our standard test [2]
is 'git log --oneline --parents --raw' because it writes
many abbreviations while also doing a lot of other work
(walking commits and trees to compute the raw diff).

For a copy of the Linux repository with 50 packfiles
split by time, we observed the following:

Before: 100.5 s
After:   58.2 s
% Diff: -59.7%


Request for Review Attention


I tried my best to take the feedback from the commit-graph
feature and apply it to this feature. I also worked to
follow the object-store refactoring as I could. I also have
some local commits that create a 'verify' subcommand and
integrate with 'fsck' similar to the commit-graph, but I'll
leave those for a later series (and review is still underway
for that part of the commit-graph).

One place where I could use some guidance is related to the
current state of 'the_hash_algo' patches. The file format
allows a different "hash version" which then indicates the
length of the hash. What's the best way to ensure this
feature doesn't cause extra pain in the hash-agnostic series?
This will inform how I go back and make the commit-graph
feature better in this area, too.


Thanks,
-Stolee

[1] 
https://public-inbox.org/git/20180107181459.222909-1-dsto...@microsoft.com/T/#u
Previous MIDX RFC.

[2] https://public-inbox.org/git/20171012120220.226427-1-dsto...@microsoft.com/
A patch series on

[PATCH 03/23] midx: add midx builtin

2018-06-07 Thread Derrick Stolee

This new 'git midx' builtin will be the plumbing access for writing,
reading, and checking multi-pack-index (MIDX) files. The initial
implementation is a no-op.

Signed-off-by: Derrick Stolee 
---
 .gitignore |  1 +
 Documentation/git-midx.txt | 29 +
 Makefile   |  1 +
 builtin.h  |  1 +
 builtin/midx.c | 38 ++
 command-list.txt   |  1 +
 git.c  |  1 +
 7 files changed, 72 insertions(+)
 create mode 100644 Documentation/git-midx.txt
 create mode 100644 builtin/midx.c

diff --git a/.gitignore b/.gitignore
index 388cc4beee..e309644d6b 100644
--- a/.gitignore
+++ b/.gitignore
@@ -97,6 +97,7 @@
 /git-merge-subtree
 /git-mergetool
 /git-mergetool--lib
+/git-midx
 /git-mktag
 /git-mktree
 /git-name-rev
diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
new file mode 100644
index 00..2bd886f1a2
--- /dev/null
+++ b/Documentation/git-midx.txt
@@ -0,0 +1,29 @@
+git-midx(1)
+
+
+NAME
+
+git-midx - Write and verify multi-pack-indexes (MIDX files).
+
+
+SYNOPSIS
+
+[verse]
+'git midx' [--object-dir ]
+
+DESCRIPTION
+---
+Write or verify a MIDX file.
+
+OPTIONS
+---
+
+--object-dir ::
+   Use given directory for the location of Git objects. We check
+   /packs/multi-pack-index for the current MIDX file, and
+   /packs for the pack-files to index.
+
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/Makefile b/Makefile
index 1d27f36365..88958c7b42 100644
--- a/Makefile
+++ b/Makefile
@@ -1045,6 +1045,7 @@ BUILTIN_OBJS += builtin/merge-index.o
 BUILTIN_OBJS += builtin/merge-ours.o
 BUILTIN_OBJS += builtin/merge-recursive.o
 BUILTIN_OBJS += builtin/merge-tree.o
+BUILTIN_OBJS += builtin/midx.o
 BUILTIN_OBJS += builtin/mktag.o
 BUILTIN_OBJS += builtin/mktree.o
 BUILTIN_OBJS += builtin/mv.o
diff --git a/builtin.h b/builtin.h
index 4e0f64723e..7b5bd46c7d 100644
--- a/builtin.h
+++ b/builtin.h
@@ -189,6 +189,7 @@ extern int cmd_merge_ours(int argc, const char **argv, 
const char *prefix);
 extern int cmd_merge_file(int argc, const char **argv, const char *prefix);
 extern int cmd_merge_recursive(int argc, const char **argv, const char 
*prefix);
 extern int cmd_merge_tree(int argc, const char **argv, const char *prefix);
+extern int cmd_midx(int argc, const char **argv, const char *prefix);
 extern int cmd_mktag(int argc, const char **argv, const char *prefix);
 extern int cmd_mktree(int argc, const char **argv, const char *prefix);
 extern int cmd_mv(int argc, const char **argv, const char *prefix);
diff --git a/builtin/midx.c b/builtin/midx.c
new file mode 100644
index 00..59ea92178f
--- /dev/null
+++ b/builtin/midx.c
@@ -0,0 +1,38 @@
+#include "builtin.h"
+#include "cache.h"
+#include "config.h"
+#include "git-compat-util.h"
+#include "parse-options.h"
+
+static char const * const builtin_midx_usage[] ={
+   N_("git midx [--object-dir ]"),
+   NULL
+};
+
+static struct opts_midx {
+   const char *object_dir;
+} opts;
+
+int cmd_midx(int argc, const char **argv, const char *prefix)
+{
+   static struct option builtin_midx_options[] = {
+   { OPTION_STRING, 0, "object-dir", _dir,
+ N_("dir"),
+ N_("The object directory containing set of packfile and 
pack-index pairs.") },
+   OPT_END(),
+   };
+
+   if (argc == 2 && !strcmp(argv[1], "-h"))
+   usage_with_options(builtin_midx_usage, builtin_midx_options);
+
+   git_config(git_default_config, NULL);
+
+   argc = parse_options(argc, argv, prefix,
+builtin_midx_options,
+builtin_midx_usage, 0);
+
+   if (!opts.object_dir)
+   opts.object_dir = get_object_directory();
+
+   return 0;
+}
diff --git a/command-list.txt b/command-list.txt
index e1c26c1bb7..a21bd7470e 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -123,6 +123,7 @@ git-merge-index plumbingmanipulators
 git-merge-one-file  purehelpers
 git-mergetool   ancillarymanipulators   
complete
 git-merge-tree  ancillaryinterrogators
+git-midxplumbingmanipulators
 git-mktag   plumbingmanipulators
 git-mktree  plumbingmanipulators
 git-mv  mainporcelain   worktree
diff --git a/git.c b/git.c
index c2f48d53dd..400fadd677 100644
--- a/git.c
+++ b/git.c
@@ -503,6 +503,7 @@ static struct cmd_struct commands[] = {
{ "merge-recursive-theirs", cmd_merge_recursive, RUN_SETUP | 
NEED_WORK_TREE | NO_PARSEOPT },
{ "merge-subtree", cmd_merge_recursive, RUN_SETUP | NEED_WORK_TREE | 
NO_PARSEOPT },
{ "merge-tree", cmd_merge_tree, RUN_SETUP | NO_PARSEOPT },
+   { "midx",

[PATCH 01/23] midx: add design document

2018-06-07 Thread Derrick Stolee

Signed-off-by: Derrick Stolee 
---
 Documentation/technical/midx.txt | 109 +++
 1 file changed, 109 insertions(+)
 create mode 100644 Documentation/technical/midx.txt

diff --git a/Documentation/technical/midx.txt b/Documentation/technical/midx.txt
new file mode 100644
index 00..789f410d71
--- /dev/null
+++ b/Documentation/technical/midx.txt
@@ -0,0 +1,109 @@
+Multi-Pack-Index (MIDX) Design Notes
+
+
+The Git object directory contains a 'pack' directory containing
+packfiles (with suffix ".pack") and pack-indexes (with suffix
+".idx"). The pack-indexes provide a way to lookup objects and
+navigate to their offset within the pack, but these must come
+in pairs with the packfiles. This pairing depends on the file
+names, as the pack-index differs only in suffix with its pack-
+file. While the pack-indexes provide fast lookup per packfile,
+this performance degrades as the number of packfiles increases,
+because abbreviations need to inspect every packfile and we are
+more likely to have a miss on our most-recently-used packfile.
+For some large repositories, repacking into a single packfile
+is not feasible due to storage space or excessive repack times.
+
+The multi-pack-index (MIDX for short) stores a list of objects
+and their offsets into multiple packfiles. It contains:
+
+- A list of packfile names.
+- A sorted list of object IDs.
+- A list of metadata for the ith object ID including:
+  - A value j referring to the jth packfile.
+  - An offset within the jth packfile for the object.
+- If large offsets are required, we use another list of large
+  offsets similar to version 2 pack-indexes.
+
+Thus, we can provide O(log N) lookup time for any number
+of packfiles.
+
+Design Details
+--
+
+- The MIDX is stored in a file named 'multi-pack-index' in the
+  .git/objects/pack directory. This could be stored in the pack
+  directory of an alternate. It refers only to packfiles in that
+  same directory.
+
+- The core.midx config setting must be on to consume MIDX files.
+
+- The file format includes parameters for the object ID hash
+  function, so a future change of hash algorithm does not require
+  a change in format.
+
+- The MIDX keeps only one record per object ID. If an object appears
+  in multiple packfiles, then the MIDX selects the copy in the most-
+  recently modified packfile.
+
+- If there exist packfiles in the pack directory not registered in
+  the MIDX, then those packfiles are loaded into the `packed_git`
+  list and `packed_git_mru` cache.
+
+- The pack-indexes (.idx files) remain in the pack directory so we
+  can delete the MIDX file, set core.midx to false, or downgrade
+  without any loss of information.
+
+- The MIDX file format uses a chunk-based approach (similar to the
+  commit-graph file) that allows optional data to be added.
+
+Future Work
+---
+
+- Add a 'verify' subcommand to the 'git midx' builtin to verify the
+  contents of the multi-pack-index file match the offsets listed in
+  the corresponding pack-indexes.
+
+- The multi-pack-index allows many packfiles, especially in a context
+  where repacking is expensive (such as a very large repo), or
+  unexpected maintenance time is unacceptable (such as a high-demand
+  build machine). However, the multi-pack-index needs to be rewritten
+  in full every time. We can extend the format to be incremental, so
+  writes are fast. By storing a small "tip" multi-pack-index that
+  points to large "base" MIDX files, we can keep writes fast while
+  still reducing the number of binary searches required for object
+  lookups.
+
+- The reachability bitmap is currently paired directly with a single
+  packfile, using the pack-order as the object order to hopefully
+  compress the bitmaps well using run-length encoding. This could be
+  extended to pair a reachability bitmap with a multi-pack-index. If
+  the multi-pack-index is extended to store a "stable object order"
+  (a function Order(hash) = integer that is constant for a given hash,
+  even as the multi-pack-index is updated) then a reachability bitmap
+  could point to a multi-pack-index and be updated independently.
+
+- Packfiles can be marked as "special" using empty files that share
+  the initial name but replace ".pack" with ".keep" or ".promisor".
+  We can add an optional chunk of data to the multi-pack-index that
+  records flags of information about the packfiles. This allows new
+  states, such as 'repacked' or 'redeltified', that can help with
+  pack maintenance in a multi-pack environment. It may also be
+  helpful to organize packfiles by object type (commit, tree, blob,
+  etc.) and use this metadata to help that maintenance.
+
+- The partial clone feature records special "promisor" packs that
+  may point to objects that are not stored locally, but available
+  on request to a server. The multi-pack-index does not currently
+  track these promisor packs.
+
+Related

[PATCH 09/23] midx: write pack names in chunk

2018-06-07 Thread Derrick Stolee

The multi-pack-index (MIDX) needs to track which pack-files are covered
by the MIDX file. Store these in our first required chunk. Since
filenames are not well structured, add padding to keep good alignment in
later chunks.

Modify the 'git midx read' subcommand to output the existence of the
pack-file name chunk. Modify t5319-midx.sh to reflect this new output
and the new expected number of chunks.

Defense in depth: A pattern we are using in the multi-pack-index feature
is to verify the data as we write it. We want to ensure we never write
invalid data to the multi-pack-index. There are many checks during the
write of a MIDX file that double-check that the values we are writing
fit the format definitions. If any value is incorrect, then we notice
before writing invalid data. This mainly helps developers while working
on the feature, but it can also identify issues that only appear when
dealing with very large data sets. These large sets are hard to encode
into test cases.

Signed-off-by: Derrick Stolee 
---
 Documentation/technical/pack-format.txt |   6 +
 builtin/midx.c  |   7 +
 midx.c  | 176 +++-
 object-store.h  |   2 +
 t/t5319-midx.sh |   3 +-
 5 files changed, 188 insertions(+), 6 deletions(-)

diff --git a/Documentation/technical/pack-format.txt 
b/Documentation/technical/pack-format.txt
index 17666b4bfc..2b37be7b33 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -296,6 +296,12 @@ CHUNK LOOKUP:
 
 CHUNK DATA:
 
+   Packfile Names (ID: {'P', 'N', 'A', 'M'})
+   Stores the packfile names as concatenated, null-terminated strings.
+   Packfiles must be listed in lexicographic order for fast lookups by
+   name. This is the only chunk not guaranteed to be a multiple of four
+   bytes in length, so should be the last chunk for alignment reasons.
+
(This section intentionally left incomplete.)
 
 TRAILER:
diff --git a/builtin/midx.c b/builtin/midx.c
index c7002f664a..fe56560853 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -28,6 +28,13 @@ static int read_midx_file(const char *object_dir)
   m->num_chunks,
   m->num_packs);
 
+   printf("chunks:");
+
+   if (m->chunk_pack_names)
+   printf(" pack_names");
+
+   printf("\n");
+
printf("object_dir: %s\n", m->object_dir);
 
return 0;
diff --git a/midx.c b/midx.c
index 9fb89c80a2..d4f4a01a51 100644
--- a/midx.c
+++ b/midx.c
@@ -13,6 +13,11 @@
 #define MIDX_HASH_LEN 20
 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
 
+#define MIDX_MAX_CHUNKS 1
+#define MIDX_CHUNK_ALIGNMENT 4
+#define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
+#define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
+
 static char *get_midx_filename(const char *object_dir)
 {
struct strbuf midx_name = STRBUF_INIT;
@@ -29,6 +34,7 @@ struct midxed_git *load_midxed_git(const char *object_dir)
size_t midx_size;
void *midx_map;
const char *midx_name = get_midx_filename(object_dir);
+   uint32_t i;
 
fd = git_open(midx_name);
if (fd < 0)
@@ -74,6 +80,31 @@ struct midxed_git *load_midxed_git(const char *object_dir)
m->num_chunks = *(m->data + 6);
m->num_packs = get_be32(m->data + 8);
 
+   for (i = 0; i < m->num_chunks; i++) {
+   uint32_t chunk_id = get_be32(m->data + 12 + 
MIDX_CHUNKLOOKUP_WIDTH * i);
+   uint64_t chunk_offset = get_be64(m->data + 16 + 
MIDX_CHUNKLOOKUP_WIDTH * i);
+
+   switch (chunk_id) {
+   case MIDX_CHUNKID_PACKNAMES:
+   m->chunk_pack_names = m->data + chunk_offset;
+   break;
+
+   case 0:
+   die("terminating MIDX chunk id appears earlier 
than expected");
+   break;
+
+   default:
+   /*
+* Do nothing on unrecognized chunks, allowing 
future
+* extensions to add optional chunks.
+*/
+   break;
+   }
+   }
+
+   if (!m->chunk_pack_names)
+   die("MIDX missing required pack-name chunk");
+
return m;
 
 cleanup_fail:
@@ -99,18 +130,88 @@ static size_t write_midx_header(struct hashfile *f,
return MIDX_HEADER_SIZE;
 }
 
+struct pack_pair {
+   uint32_t pack_int_id;
+   char *pack_name;
+};
+
+static int pack_pair_compare(const void *_a, const void *_b)
+{
+   struct pack_pair *a = (struct pack_pair *)_a;
+   struct pack_pair *b = (struct pack_pair *)_b;
+   return strcmp(a->pack_name, b->pack_name);
+}
+
+static void sort_packs_by_name(char **pack_names, uint32_t nr_packs,

[PATCH 16/23] midx: prepare midxed_git struct

2018-06-07 Thread Derrick Stolee

Signed-off-by: Derrick Stolee 
---
 midx.c | 22 ++
 midx.h |  2 ++
 object-store.h |  7 +++
 packfile.c |  6 +-
 4 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index a49300bf75..5e9290ca8f 100644
--- a/midx.c
+++ b/midx.c
@@ -175,6 +175,28 @@ struct midxed_git *load_midxed_git(const char *object_dir)
exit(1);
 }
 
+int prepare_midxed_git_one(struct repository *r, const char *object_dir)
+{
+   struct midxed_git *m = r->objects->midxed_git;
+   struct midxed_git *m_search;
+
+   if (!core_midx)
+   return 0;
+
+   for (m_search = m; m_search; m_search = m_search->next)
+   if (!strcmp(object_dir, m_search->object_dir))
+   return 1;
+
+   r->objects->midxed_git = load_midxed_git(object_dir);
+
+   if (r->objects->midxed_git) {
+   r->objects->midxed_git->next = m;
+   return 1;
+   }
+
+   return 0;
+}
+
 static size_t write_midx_header(struct hashfile *f,
unsigned char num_chunks,
uint32_t num_packs)
diff --git a/midx.h b/midx.h
index a1d18ed991..793203fc4a 100644
--- a/midx.h
+++ b/midx.h
@@ -5,8 +5,10 @@
 #include "cache.h"
 #include "object-store.h"
 #include "packfile.h"
+#include "repository.h"
 
 struct midxed_git *load_midxed_git(const char *object_dir);
+int prepare_midxed_git_one(struct repository *r, const char *object_dir);
 
 int write_midx_file(const char *object_dir);
 
diff --git a/object-store.h b/object-store.h
index 9b671f1b0a..7908d46e34 100644
--- a/object-store.h
+++ b/object-store.h
@@ -130,6 +130,13 @@ struct raw_object_store {
 */
struct oidmap *replace_map;
 
+   /*
+* private data
+*
+* should only be accessed directly by packfile.c and midx.c
+*/
+   struct midxed_git *midxed_git;
+
/*
 * private data
 *
diff --git a/packfile.c b/packfile.c
index 1a714fbde9..b91ca9b9f5 100644
--- a/packfile.c
+++ b/packfile.c
@@ -15,6 +15,7 @@
 #include "tree-walk.h"
 #include "tree.h"
 #include "object-store.h"
+#include "midx.h"
 
 char *odb_pack_name(struct strbuf *buf,
const unsigned char *sha1,
@@ -893,10 +894,13 @@ static void prepare_packed_git(struct repository *r)
 
if (r->objects->packed_git_initialized)
return;
+   prepare_midxed_git_one(r, r->objects->objectdir);
prepare_packed_git_one(r, r->objects->objectdir, 1);
prepare_alt_odb(r);
-   for (alt = r->objects->alt_odb_list; alt; alt = alt->next)
+   for (alt = r->objects->alt_odb_list; alt; alt = alt->next) {
+   prepare_midxed_git_one(r, alt->path);
prepare_packed_git_one(r, alt->path, 0);
+   }
rearrange_packed_git(r);
prepare_packed_git_mru(r);
r->objects->packed_git_initialized = 1;
-- 
2.18.0.rc1

RE: git question from a newbie

2018-06-07 Thread Heinz, Steve

Bryan

Thank you.  I didn't realize that when you set up a remote repository, it is 
just a folder.  I thought the fact that I had it setup as a website, was going 
to handle what I needed.
It wasn't until your email that I realized I had to use some type of client.  I 
installed Bonobo as the remote repository and bam it worked!

You are right that the info on Windows is a bit sparse.  I learned a lot and 
want to thank you again.

Steve Heinz

Steve Heinz | Lead Programmer Analyst, Information Technology
AAA Northeast | 1415 Kellum Place | Garden City, NY 11530
X8042 | T 516-535-2581 | F 516-873-2211
she...@aaanortheast.com | AAA.com

 It Pays to Belong.

-Original Message-
From: Bryan Turner 
Sent: Tuesday, June 05, 2018 6:29 PM
To: Heinz, Steve 
Cc: Git Users 
Subject: Re: git question from a newbie

On Tue, Jun 5, 2018 at 2:33 PM Heinz, Steve  wrote:
>
> Hi.
>
> I am new to Git and have read quite a few articles on it.
> I am planning on setting up a remote repository on a windows 2012 R2 server 
> and will access it via HTTPS.
> I am setting up a local repository on my desk top (others in my group will do 
> the same).
> On "server1":  I install Git and create a repository "repos".
> On "server1":  I create a dummy webpage "default.htm" and place it in the 
> repo folder.
> On "server1":  I create a web application in IIS pointing to Git
> On Server1":   change permissions so IIS_User  has access to the folders.
> On "server1":  inside the "repos" folder and right click and choose "bash 
> here"
> On "server1":   $ git init  -bare(it's really 2 hyphens)

This might create a _repository_, but it's not going to set up any Git hosting 
processing for it. You might be able to clone using the fallback to the "dumb" 
HTTP protocol (though I doubt it, with the steps you've shown) , but you won't 
be able to push.

You need handlers for git-http-backend which handle info/refs and other 
requests that are related to the Git HTTP wire protocol.[1]

Documentation for setting up Git's HTTP protocol via Apache are pretty easy to 
find[2], but IIS instructions are a bit more sparse. I don't know of any good 
ones off the top of my head. But that's your issue; your IIS setup isn't really 
a valid Git remote; it's just a Git repository with contents visible via HTTP.

[1] 
https://github.com/git/git/blob/master/Documentation/technical/http-protocol.txt
[2] 
https://github.com/git/git/blob/master/Documentation/howto/setup-git-server-over-http.txt

Bryan
The information contained in this email message is intended only for the 
private and confidential use of the recipient(s) named above, unless the sender 
expressly agrees otherwise. In no event shall AAA Northeast or any of its 
affiliates accept any responsibility for the loss, use or misuse of any 
information including confidential information, which is sent to AAA Northeast 
or its affiliates via email, or email attachment. AAA Northeast does not 
guarantee the accuracy of any email or email attachment. If the reader of this 
message is not the intended recipient and/or you have received this email in 
error, you must take no action based on the information in this email and you 
are hereby notified that any dissemination, misuse or copying or disclosure of 
this communication is strictly prohibited. If you have received this 
communication in error, please notify us immediately by email and delete the 
original message.

Re: [PATCH v2] completion: reduce overhead of clearing cached --options

2018-06-07 Thread Rick van Hattem

The zsh script expects the bash completion script to be available so
that might be the issue here.

To reproduce this is what I do. First, a sparse checkout:
# mkdir ~/git
# cd ~/git
# git init
# git remote add origin g...@github.com:git/git.git
# git config core.sparseCheckout true
# mkdir .git/info
# echo contrib/completion >> .git/info/sparse-checkout
# git pull origin master --depth 1

After that I simply link the zsh script to _git:
# cd git/contrib/completion
# ln git-completion.zsh _git

And add the following to a new .zshrc file:
autoload -U compinit; compinit
autoload -U bashcompinit; bashcompinit
fpath=(~/git/contrib/completion $fpath)

And that appears to be enough to reproduce:
# git
git/contrib/completion/git-completion.bash:354: read-only variable: QISUFFIX
zsh:12: command not found: ___main
zsh:15: _default: function definition file not found
_dispatch:70: bad math expression: operand expected at `/usr/bin/g...'
zsh: segmentation fault  zsh

~rick

On 7 June 2018 at 07:48, Jonathan Nieder  wrote:
> Hi,
>
> SZEDER Gábor wrote:
>
>> Other Bash versions, notably 4.4.19 on macOS via homebrew (i.e. a
>> newer version on the same platform) and 3.2.25 on CentOS (i.e. a
>> slightly earlier version, though on a different platform) are not
>> affected.  ZSH in macOS (the versions shipped by default or installed
>> via homebrew) or on other platforms isn't affected either.
> [...]
>> --- a/contrib/completion/git-completion.bash
>> +++ b/contrib/completion/git-completion.bash
>> @@ -282,7 +282,11 @@ __gitcomp ()
>>
>>  # Clear the variables caching builtins' options when (re-)sourcing
>>  # the completion script.
>> -unset $(set |sed -ne 
>> 's/^\(__gitcomp_builtin_[a-zA-Z0-9_][a-zA-Z0-9_]*\)=.*/\1/p') 2>/dev/null
>> +if [[ -n ${ZSH_VERSION-} ]]; then
>> + unset $(set |sed -ne 
>> 's/^\(__gitcomp_builtin_[a-zA-Z0-9_][a-zA-Z0-9_]*\)=.*/\1/p') 2>/dev/null
>> +else
>> + unset $(compgen -v __gitcomp_builtin_)
>> +fi
>
> As discussed at [1], this fails catastrophically when sourced by
> git-completion.zsh, which explicitly clears ZSH_VERSION.  By
> catastrophically, I mean that Rick van Hattem and Dave Borowitz report
> that it segfaults.
>
> I can't reproduce it since I am having trouble following the
> instructions at the top of git-completion.zsh to get the recommended
> zsh completion script loading mechanism working at all.  (I've
> succeeded in using git's bash completion on zsh before, but that was
> many years ago.)  First attempt:
>
>  1. rm -fr ~/.zsh ~/.zshrc
> # you can move them out of the way instead for a less destructive
> # reproduction recipe
>  2. zsh
>  3. hit "q"
>  4. zstyle ':completion:*:*:git:*' script \
>   $(pwd)/contrib/completion/git-completion.zsh
>  5. git checkout 
>
> Expected result: segfault
>
> Actual result: succeeds, using zsh's standard completion script (not
> git's).
>
> I also tried
>
>  1. rm -fr ~/.zsh ~/.zshrc
> # you can move them out of the way instead for a less destructive
> # reproduction recipe
>  2. zsh
>  3. hit "2"
>  4. mkdir ~/.zsh; cp contrib/completion/git-completion.zsh ~/.zsh/_git
>  5. echo 'fpath=(~/.zsh $fpath)' >>~/.zshrc
>  6. ^D; zsh
>  7. git checkout 
>
> and a similar process with fpath earlier in the zshrc file.  Whatever
> I try, I end up with one of two results: either it uses zsh's standard
> completion, or it spews a stream of errors about missing functions
> like compgen.  What am I doing wrong?
>
> Related question: what would it take to add a zsh completion sanity
> check to t/t9902-completion.sh?
>
> When running with "set -x", here is what Dave Borowitz gets before the
> segfault.  I don't have a stacktrace because, as mentioned above, I
> haven't been able to reproduce it.
>
> str=git
> [[ -n git ]]
> service=git
> i=
> _compskip=default
> eval ''
> ret=0
> [[ default = *patterns* ]]
> [[ default = all ]]
> str=/usr/bin/git
> [[ -n /usr/bin/git ]]
> service=_dispatch:70: bad math expression: operand expected at 
> `/usr/bin/g...'
>
> zsh: segmentation fault  zsh
>
> Here's a minimal fix, untested.  As a followup, as Gábor suggests at [2],
> it would presumably make sense to stop overriding ZSH_VERSION, using
> this GIT_ scoped variable everywhere instead.
>
> Thoughts?
>
> Reported-by: Rick van Hattem 
> Reported-by: Dave Borowitz 
> Signed-off-by: Jonathan Nieder 
>
> [1] 
> https://public-inbox.org/git/01020163c683e753-04629405-15f8-4a30-9dc3-e4e3f2a5aa26-000...@eu-west-1.amazonses.com/
> [2] https://public-inbox.org/git/20180606114147.7753-1-szeder@gmail.com/
>
> diff --git i/contrib/completion/git-completion.bash 
> w/contrib/completion/git-completion.bash
> index 12814e9bbf..e4bcc435ea 100644
> --- i/contrib/completion/git-completion.bash
> +++ w/contrib/completion/git-completion.bash
> @@ -348,7 +348,7 @@ __gitcomp ()
>
>  # Clear the variables caching builtins' options when (re-)sourcing

RE: git question from a newbie

2018-06-07 Thread Heinz, Steve

Randall

Thank you, I tried it but that didn't work either.  I did find out what my 
issue was.  I need some type of client that would be setup to listen for the 
requests.

Steve


-Original Message-
From: Randall S. Becker 
Sent: Tuesday, June 05, 2018 6:19 PM
To: Heinz, Steve ; git@vger.kernel.org
Subject: RE: git question from a newbie

On June 5, 2018 5:24 PM, Steve Heinz wrote:
> I am new to Git and have read quite a few articles on it.
> I am planning on setting up a remote repository on a windows 2012 R2
server
> and will access it via HTTPS.
> I am setting up a local repository on my desk top (others in my group
> will
do
> the same).
> On "server1":  I install Git and create a repository "repos".
> On "server1":  I create a dummy webpage "default.htm" and place it in
> the repo folder.
> On "server1":  I create a web application in IIS pointing to Git
> On Server1":   change permissions so IIS_User  has access to the folders.
> On "server1":  inside the "repos" folder and right click and choose
> "bash here"
> On "server1":   $ git init  -bare(it's really 2 hyphens)
>
> On Desktop:  open Chrome and type in URL to make sure we can access it
> https://xyz/repos/default.htm
>   ** make sure you have access, no certificate issues or firewall
issues.  The
> pages shows up fine
>
> On Desktop:  install Git and create repository "repos".
> On Desktop:  right click in "repos" folder and choose "bash here"
> On Desktop:  $ git init
> On Desktop : add a folder "testProject" under the "repos" folder and
> add some files to the folder
> On Desktop:  $ git add . (will add files and folder to
working tree)
> On Desktop   $ git status   (shows it recognizes the filed
were added)
> On Desktop   $ git commit -m "test project commit"   (will stage
changes)
> On Desktop   $ git push https://xyz.domainname.com/repos master
>
> ** this is the error I get,  I have tried many different things.  I am
sure I am
> doing something stupid
> ** I have tried a bunch of variations but I always get the same error.
> It
looks
> like some type of network/permission
> ** thing but everything seems OK.
>Fatal: repository 'https://xyz.domainname.com/repos/' not found
>
> *** this is where I get the error trying to push staged items to the
remote
> repository.
> *** I have tried to clone the empty remote repository still same error
> *** I checked port 443 is opened and being used for https
> *** tried to set origin to https://xyz.domainname.com/repos; and then $git
> push origin master   (same error)
> *** I tried passing credentials to the remote server as well

Missing glue - git remote

git remote add origin https://xyz.domainname.com/repos

Cheers,
Randall

-- Brief whoami:
 NonStop developer since approximately 2112884442  UNIX developer since 
approximately 421664400
-- In my real life, I talk too much.



The information contained in this email message is intended only for the 
private and confidential use of the recipient(s) named above, unless the sender 
expressly agrees otherwise. In no event shall AAA Northeast or any of its 
affiliates accept any responsibility for the loss, use or misuse of any 
information including confidential information, which is sent to AAA Northeast 
or its affiliates via email, or email attachment. AAA Northeast does not 
guarantee the accuracy of any email or email attachment. If the reader of this 
message is not the intended recipient and/or you have received this email in 
error, you must take no action based on the information in this email and you 
are hereby notified that any dissemination, misuse or copying or disclosure of 
this communication is strictly prohibited. If you have received this 
communication in error, please notify us immediately by email and delete the 
original message.

Re: [PATCH v2] completion: reduce overhead of clearing cached --options

2018-06-07 Thread Dave Borowitz

On Thu, Jun 7, 2018 at 1:48 AM Jonathan Nieder  wrote:
> Whatever
> I try, I end up with one of two results: either it uses zsh's standard
> completion, or it spews a stream of errors about missing functions
> like compgen.  What am I doing wrong?

Try adding to the top of your .zshrc:
autoload -U compinit; compinit
autoload -U bashcompinit; bashcompinit

Those are both in my .zshrc, and I think they are sufficient magic to
define compgen.

[PATCH] Use hyphenated "remote-tracking branch" (docs and comments)

2018-06-07 Thread Robert P. J. Day



Use the obvious consensus of hyphenated "remote-tracking branch", and
fix an obvious typo, all in documentation and comments.

Signed-off-by: Robert P. J. Day 

---

diff --git a/Documentation/git-submodule.txt b/Documentation/git-submodule.txt
index 4a5cc38a6f..ef9d9d28a9 100644
--- a/Documentation/git-submodule.txt
+++ b/Documentation/git-submodule.txt
@@ -42,8 +42,8 @@ have to use '../foo.git' instead of './foo.git' - as one 
might expect
 when following the rules for relative URLs - because the evaluation
 of relative URLs in Git is identical to that of relative directories).
 +
-The default remote is the remote of the remote tracking branch
-of the current branch. If no such remote tracking branch exists or
+The default remote is the remote of the remote-tracking branch
+of the current branch. If no such remote-tracking branch exists or
 the HEAD is detached, "origin" is assumed to be the default remote.
 If the superproject doesn't have a default remote configured
 the superproject is its own authoritative upstream and the current
diff --git a/builtin/branch.c b/builtin/branch.c
index d53f6e2ad4..5217ba3bde 100644
--- a/builtin/branch.c
+++ b/builtin/branch.c
@@ -701,7 +701,7 @@ int cmd_branch(int argc, const char **argv, const char 
*prefix)
 * If no sorting parameter is given then we default to sorting
 * by 'refname'. This would give us an alphabetically sorted
 * array with the 'HEAD' ref at the beginning followed by
-* local branches 'refs/heads/...' and finally remote-tacking
+* local branches 'refs/heads/...' and finally remote-tracking
 * branches 'refs/remotes/...'.
 */
if (!sorting)
diff --git a/builtin/pull.c b/builtin/pull.c
index 1f2ecf3a88..49cc3beb4c 100644
--- a/builtin/pull.c
+++ b/builtin/pull.c
@@ -673,7 +673,7 @@ static const char *get_upstream_branch(const char *remote)
 }

 /**
- * Derives the remote tracking branch from the remote and refspec.
+ * Derives the remote-tracking branch from the remote and refspec.
  *
  * FIXME: The current implementation assumes the default mapping of
  * refs/heads/ to refs/remotes//.
@@ -711,7 +711,7 @@ static const char *get_tracking_branch(const char *remote, 
const char *refspec)

 /**
  * Given the repo and refspecs, sets fork_point to the point at which the
- * current branch forked from its remote tracking branch. Returns 0 on success,
+ * current branch forked from its remote-tracking branch. Returns 0 on success,
  * -1 on failure.
  */
 static int get_rebase_fork_point(struct object_id *fork_point, const char 
*repo,

-- 


Robert P. J. Day Ottawa, Ontario, CANADA
  http://crashcourse.ca/dokuwiki

Twitter:   http://twitter.com/rpjday
LinkedIn:   http://ca.linkedin.com/in/rpjday

[GSoC][PATCH v3 1/1] rebase--interactive: rewrite append_todo_help() in C

2018-06-07 Thread Alban Gruin

This rewrites append_todo_help() from shell to C. It also incorporates
some parts of initiate_action() and complete_action() that also write
help texts to the todo file.

Two flags are added to rebase--helper.c: one to call
append_todo_help() (`--append-todo-help`), and another one to tell
append_todo_help() to write the help text suited for the edit-todo
mode (`--write-edit-todo`).

Finally, append_todo_help() is removed from git-rebase--interactive.sh
to use `rebase--helper --append-todo-help` instead.

Signed-off-by: Alban Gruin 
---
 builtin/rebase--helper.c   | 10 ++--
 git-rebase--interactive.sh | 52 ++--
 sequencer.c| 60 ++
 sequencer.h|  1 +
 4 files changed, 71 insertions(+), 52 deletions(-)

diff --git a/builtin/rebase--helper.c b/builtin/rebase--helper.c
index f7c2a5fdc..ded5e291d 100644
--- a/builtin/rebase--helper.c
+++ b/builtin/rebase--helper.c
@@ -12,12 +12,12 @@ static const char * const builtin_rebase_helper_usage[] = {
 int cmd_rebase__helper(int argc, const char **argv, const char *prefix)
 {
struct replay_opts opts = REPLAY_OPTS_INIT;
-   unsigned flags = 0, keep_empty = 0, rebase_merges = 0;
+   unsigned flags = 0, keep_empty = 0, rebase_merges = 0, write_edit_todo 
= 0;
int abbreviate_commands = 0, rebase_cousins = -1;
enum {
CONTINUE = 1, ABORT, MAKE_SCRIPT, SHORTEN_OIDS, EXPAND_OIDS,
CHECK_TODO_LIST, SKIP_UNNECESSARY_PICKS, REARRANGE_SQUASH,
-   ADD_EXEC
+   ADD_EXEC, APPEND_TODO_HELP
} command = 0;
struct option options[] = {
OPT_BOOL(0, "ff", _ff, N_("allow fast-forward")),
@@ -27,6 +27,8 @@ int cmd_rebase__helper(int argc, const char **argv, const 
char *prefix)
OPT_BOOL(0, "rebase-merges", _merges, N_("rebase merge 
commits")),
OPT_BOOL(0, "rebase-cousins", _cousins,
 N_("keep original branch points of cousins")),
+   OPT_BOOL(0, "write-edit-todo", _edit_todo,
+N_("append the edit-todo message to the todo-list")),
OPT_CMDMODE(0, "continue", , N_("continue rebase"),
CONTINUE),
OPT_CMDMODE(0, "abort", , N_("abort rebase"),
@@ -45,6 +47,8 @@ int cmd_rebase__helper(int argc, const char **argv, const 
char *prefix)
N_("rearrange fixup/squash lines"), REARRANGE_SQUASH),
OPT_CMDMODE(0, "add-exec-commands", ,
N_("insert exec commands in todo list"), ADD_EXEC),
+   OPT_CMDMODE(0, "append-todo-help", ,
+   N_("insert the help in the todo list"), 
APPEND_TODO_HELP),
OPT_END()
};
 
@@ -84,5 +88,7 @@ int cmd_rebase__helper(int argc, const char **argv, const 
char *prefix)
return !!rearrange_squash();
if (command == ADD_EXEC && argc == 2)
return !!sequencer_add_exec_commands(argv[1]);
+   if (command == APPEND_TODO_HELP && argc == 1)
+   return !!append_todo_help(write_edit_todo, keep_empty);
usage_with_options(builtin_rebase_helper_usage, options);
 }
diff --git a/git-rebase--interactive.sh b/git-rebase--interactive.sh
index 299ded213..94c23a7af 100644
--- a/git-rebase--interactive.sh
+++ b/git-rebase--interactive.sh
@@ -39,38 +39,6 @@ comment_for_reflog () {
esac
 }
 
-append_todo_help () {
-   gettext "
-Commands:
-p, pick  = use commit
-r, reword  = use commit, but edit the commit message
-e, edit  = use commit, but stop for amending
-s, squash  = use commit, but meld into previous commit
-f, fixup  = like \"squash\", but discard this commit's log message
-x, exec  = run command (the rest of the line) using shell
-d, drop  = remove commit
-l, label  = label current HEAD with a name
-t, reset  = reset HEAD to a label
-m, merge [-C  | -c ]  [# ]
-.   create a merge commit using the original merge commit's
-.   message (or the oneline, if no original merge commit was
-.   specified). Use -c  to reword the commit message.
-
-These lines can be re-ordered; they are executed from top to bottom.
-" | git stripspace --comment-lines >>"$todo"
-
-   if test $(get_missing_commit_check_level) = error
-   then
-   gettext "
-Do not remove any line. Use 'drop' explicitly to remove a commit.
-" | git stripspace --comment-lines >>"$todo"
-   else
-   gettext "
-If you remove a line here THAT COMMIT WILL BE LOST.
-" | git stripspace --comment-lines >>"$todo"
-   fi
-}
-
 die_abort () {
apply_autostash
rm -rf "$state_dir"
@@ -143,13 +111,7 @@ initiate_action () {
git stripspace --strip-comments <"$todo" >"$todo".new
mv -f "$todo".new "$todo"
collapse_todo_ids
-   append_todo_help
-

[GSoC][PATCH v3 0/1] rebase -i: rewrite append_todo_help() in C

2018-06-07 Thread Alban Gruin

This patch rewrites append_todo_help() from shell to C. The C version
covers a bit more than the old shell version. To achieve that, some
parameters were added to rebase--helper.

This is part of the effort to rewrite interactive rebase in C.

Changes since v2:

 - Renaming the variable `edit_todo` to `write_edit_todo` to avoid
   confusions, after a comment by Christian Couder[1].

[1] https://github.com/git/git/pull/503#discussion_r193392270

Alban Gruin (1):
  rebase--interactive: rewrite append_todo_help() in C

 builtin/rebase--helper.c   | 10 ++--
 git-rebase--interactive.sh | 52 ++--
 sequencer.c| 60 ++
 sequencer.h|  1 +
 4 files changed, 71 insertions(+), 52 deletions(-)

-- 
2.16.4

Re: [RFC PATCH 0/5] format-patch: automate cover letter range-diff

2018-06-07 Thread Eric Sunshine

On Wed, Jun 6, 2018 at 3:16 PM, Duy Nguyen  wrote:
> On Wed, May 30, 2018 at 10:03 AM, Eric Sunshine  
> wrote:
>> Dscho recently implemented a 'tbdiff' replacement as a Git builtin named
>> git-branch-diff[1] which computes differences between two versions of a
>> patch series. Such a diff can be a useful aid for reviewers when
>> inserted into a cover letter. However, doing so requires manual
>> generation (invoking git-branch-diff) and copy/pasting the result into
>> the cover letter.
>
> Another option which I wanted to go is delegate part of cover letter
> generation to a hook (or just a config key that contains a shell
> command). This way it's easier to customize cover letters. We could
> still have a good fallback that does shortlog, diffstat and tbdiff.

It is common on this mailing list to turn down requests for new hooks
when the requested functionality could just as easily be implemented
via a wrapper script. So, my knee-jerk reaction is that a hook to
customize the cover letter may be overkill when the same functionality
could likely be implemented relatively easily by a shell script which
invokes git-format-patch and customizes the cover letter
after-the-fact. Same argument regarding a config key holding a shell
command. But, perhaps there are cases which don't occur to me which
could be helped by a config variable or such.

Of course, by the same reasoning, the --range-diff functionality
implemented by this patch series, which is just a convenience, could
be handled by a wrapper script, thus is not strictly needed. On the
other hand, given that interdiffs and range-diffs are so regularly
used in re-rolls on this list (and perhaps other mailing list-based
projects) may be argument enough in favor of having such an option
built into git-format-patch.

Re: [PATCH v4 00/23] Fix incorrect use of the_index

2018-06-07 Thread Elijah Newren

On Wed, Jun 6, 2018 at 9:49 AM, Nguyễn Thái Ngọc Duy  wrote:
> v4 fixes some commit messages and killed a couple more the_index
> references after I tried to merge this with 'pu'

Thanks for tackling this.  The first 8 patches look good to me other
than the typo I pointed out on patch 4.  However, my eyes glazed over
a bit on the attr.c stuff in patch 7 that you specifically mentioned
in the commit message, so my "looks good" doesn't count for as much on
that one.

I'm getting sleepy, but I'll try to circle back and look over the rest
of the patches in the next few days.

Re: [PATCH v4 04/23] unpack-tress: convert clear_ce_flags* to avoid the_index

2018-06-07 Thread Elijah Newren

Subject line: unpack-trees rather than unpack-tress.

On Wed, Jun 6, 2018 at 9:49 AM, Nguyễn Thái Ngọc Duy  wrote:
> Prior to fba92be8f7, this code implicitly (and incorrectly) assumes
> the_index when running the exclude machinery. fba92be8f7 helps show
> this problem clearer because unpack-trees operation is supposed to
> work on whatever index the caller specifies... not specifically
> the_index.
>
> Update the code to use "istate" argument that's originally from
> mark_new_skip_worktree(). From the call sites, both in unpack_trees(),
> you can see that this function works on two separate indexes:
> o->src_index and o->result. The second mark_new_skip_worktree() so far
> has incorecctly applied exclude rules on o->src_index instead of
> o->result. It's unclear what is the consequences of this, but it's
> definitely wrong.
>
> [1] fba92be8f7 (dir: convert is_excluded_from_list to take an index -
> 2017-05-05)
>
> Signed-off-by: Nguyễn Thái Ngọc Duy 

A somewhat curious finding: when I was rebuilding and re-testing all
23 patches, I got a failure on this patch in test 31 of
t7063-status-untracked-cache.sh. I did not get any test failures with
any of the other patches.  However, after re-running that test or the
whole suite half a dozen times with just up to this patch applied, I
was not able to trigger the failure again.  Is there a rare race in
that testcase?  I certainly don't see anything in this patch that
appears problematic, and the fact that I couldn't readily reproduce
suggests it could well have been there before any of these patches.

Everything else in the patch looks fine to me.

Re: GDPR compliance best practices?

2018-06-07 Thread Peter Backes

Hi David,

thanks for your input on the issue.

> LEGAL GDPR NOTICE:
> According to the European data protection laws (GDPR), we would like to make 
> you
> aware that contributing to rsyslog via git will permanently store the
> name and email address you provide as well as the actual commit and the
> time and date you made it inside git's version history. This is inevitable,
> because it is a main feature git.

As we can, see, rsyslog tries to solve the issue by the already 
discussed legal "technology" of disclaimers (which is certainly not 
accepted as state of the art technology by the GDPR). In essence, they 
are giving excuses for why they are not honoring the right to be 
forgotten.

Disclaimers do not work. They have no legal effect, they are placebos.

The GDPR does not accept such excuses. If it would, companies could 
arbitrarily design their data storage such as to make it "the main 
feature" to not honor the right to be forgotten and/or other GDPR 
rights. It is obvious that this cannot work, as it would completely 
undermine those rights.

The GDPR honors technology as a means to protect the individual's 
rights, not as a means to subvert them.

> If you are concerned about your
> privacy, we strongly recommend to use
> 
> --author "anonymous "
> 
> together with your commit.

This can only be a solution if the project rejects any commits which 
are not anonymous.

> However, we have valid reasons why we cannot remove that information
> later on. The reasons are:
> 
> * this would break git history and make future merges unworkable

This is not a valid excuse (see above). The technology has to be 
designed or applied in such a way that the individuals rights are 
honored, not the other way around.

In absence of other means, the project has to rewrite history if it 
gets a valid request by someone exercising his right to be forgotten, 
even if that causes a lot of hazzle for everyone.

> * the rsyslog projects has legitimate interest to keep a permanent record of 
> the
>   contributor identity, once given, for
>   - copyright verification
>   - being able to provide proof should a malicious commit be made

True, but that doesn't justify publishing that information and keeping 
it published even when someone exercises his right to be forgotten.

In that case, "legitimate interest" is not enough. There need to be 
"overriding legitimate grounds". I don't see them here.

> Please also note that your commit is public and as such will potentially be
> processed by many third-parties. Git's distributed nature makes it impossible
> to track where exactly your commit, and thus your personal data, will be 
> stored
> and be processed. If you would not like to accept this risk, please do either
> commit anonymously or refrain from contributing to the rsyslog project.

This is one of those statements that ultimately say "we do not honor 
the GDPR; either accept that or don't submit". That's the old, arguably 
ignorant mentality, and won't stand.

The project has to have a legal basis for publishing the personal 
metadata contained in the repository. In doubt, it needs to be consent 
based, as that is practically the only basis that allows putting the 
data on the internet for everyone to download. And consent can be 
withdrawn at any time.

The GDPR's transitional period started over two years ago. There was 
enough time to get everything GDPR compliant.

It might be possible to implement my solution without changing git, 
btw. Simply use the anonymous hash as author name, and store the random 
number and the author as a git-notes. git-notes can be rewritten or 
deleted at any time without changing the commit ID. I am currently 
looking into this solution. One just needs to add something that can 
verify and resolve those anonymous hashes.

Best wishes
Peter


-- 
Peter Backes, r...@helen.plasma.xg8.de

80 matches

Mail list logo