Re: [RFC PATCH v1] http: add http.keepRejectedCredentials config
On Thu, Jun 07, 2018 at 08:15:16PM -0700, Lars Schneider wrote: > > In fact, this patch probably should give the user some advice in that > > regard (either in the documentation, or as a warning when we skip the > > rejection). If you _do_ have a bogus credential and set the new option, > > you'd need to reject it manually (you can do it with "git credential > > reject", but it's probably easier to just unset the option temporarily > > and re-invoke the original command). > > I like the advice idea very much! > > How about this? > > $ git fetch > hint: Git has stored invalid credentials. > hint: Reject them with 'git credential reject' or > hint: disable the Git config 'http.keepRejectedCredentials'. > remote: Invalid username or password. > fatal: Authentication failed for 'https://server.com/myrepo.git/' > > I am not really sure about the grammar :-) It's probably not worth pointing the user at "git credential reject", since it's not really meant to be friendly to users. In particular, you have to speak the credential protocol on stdin. I _think_ echo https://server.com/myrepo.git | git credential reject might be enough, but I didn't test. Probably better advice is to just repeat the command. Maybe: hint: Git kept invalid credentials due to the value of hint: http.keepRejectedCredentials. If you wish to drop these hint: credentials and be prompted for new ones, re-run your hint: command with "git -c http.keepRejectedCredentials=false". or something? -Peff
Re: [ANNOUNCE] Git v2.16.0-rc0
On Tue, Jan 02, 2018 at 09:35:16PM -0800, Jonathan Nieder wrote: > > bturner@ubuntu:~$ ssh -V > > OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8, OpenSSL 1.0.1f 6 Jan 2014 > > > > bturner@ubuntu:~$ ssh -G -p 7999 localhost > > unknown option -- G > > usage: ssh [-1246AaCfgKkMNnqsTtVvXxYy] [-b bind_address] [-c cipher_spec] > [...] > > Is it possible to adjust the check, somehow, so it doesn't impact > > older OpenSSH versions like this? As it stands, it seems likely a fair > > number of users who have an SSH command that does support -4, -6 and > > -p are going to end up getting "penalized" because it doesn't also > > support -G, and have to manually set their SSH variant to "ssh" (or > > something other than "auto") to avoid the automatic detection. > > > > I'd love to say I have a brilliant idea for how to work around this, > > oh and here's a patch, but I don't. One option might be trying to > > actually review the output, and another might be to run "ssh -V", but > > both of those have their own flaws (and the extra process forks aren't > > "free"). > > I have tomorrow off, so I've filed https://crbug.com/git/7 to make > sure I remember to follow up the day after. Of course I'll be happy > if someone updates that bug saying they've fixed it in the meantime. It doesn't look like we ever applied anything to deal with this regression. Just FYI, this bit me today when upgrading my git on a system that has an ssh wrapper that understands "-p" just fine, but not "-G". So the behavior described in [1], namely to just fallback to assuming some basic openssh-ish options, would have worked for me. To be honest, I could easily see an argument that I _should_ be setting GIT_SSH_VARIANT to explain what my wrapper is expecting, even though it happened to work before. But it seems like this discussion ended in favor of calling this a regression that should be fixed, and AFAICT nothing happened after. So I thought I'd ping and mention one more data point. -Peff [1] https://public-inbox.org/git/xmqqk1wyhcey@gitster.mtv.corp.google.com/
Re: [L10N] Kickoff for Git 2.18.0 l10n round 2
All these 144 new messages in this round of Git l10n are introduced by commit f318d73915 (generate-cmds.sh: export all commands to command-list.h). The updated script `generate-cmds.sh` will export more commands and one line introductions in file `command-list.h` than v2.18.0-rc0: $ git checkout -q v2.18.0-rc0 && ./generate-cmdlist.sh command-list.txt | wc -l 38 $ git checkout -q v2.18.0-rc1 && ./generate-cmdlist.sh command-list.txt | wc -l 207 2018-06-08 9:21 GMT+08:00 Jiang Xin : > Hi, > > Git 2.18.0-rc1 has been released, and introduced 36 more messages (144 > total) need to be translated. Let's start the 2nd round of l10n for > Git 2.18.0. > > The new "git.pot" is generated in commit v2.18.0-rc1: > > l10n: git.pot: v2.18.0 round 2 (144 new, 6 removed) > > Generate po/git.pot from v2.18.0-rc1 for git v2.18.0 l10n round 2. > > Signed-off-by: Jiang Xin > > You can get it from the usual place: > > https://github.com/git-l10n/git-po/ > > As how to update your XX.po and help to translate Git, please see > "Updating a XX.po file" and other sections in "po/README" file. > > -- > Jiang Xin
Re: [RFC PATCH v1] http: add http.keepRejectedCredentials config
> On 04 Jun 2018, at 11:55, Jeff King wrote: > > On Mon, Jun 04, 2018 at 12:18:59PM -0400, Martin-Louis Bright wrote: > >> Why must the credentials must be deleted after receiving the 401 (or >> any) error? What's the rationale for this? > > Because Git only tries a single credential per invocation. So if a > helper provides one, it doesn't prompt. If you get a 401 and then the > program aborts, invoking it again is just going to try the same > credential over and over. Dropping the credential from the helper breaks > out of that loop. > > In fact, this patch probably should give the user some advice in that > regard (either in the documentation, or as a warning when we skip the > rejection). If you _do_ have a bogus credential and set the new option, > you'd need to reject it manually (you can do it with "git credential > reject", but it's probably easier to just unset the option temporarily > and re-invoke the original command). I like the advice idea very much! How about this? $ git fetch hint: Git has stored invalid credentials. hint: Reject them with 'git credential reject' or hint: disable the Git config 'http.keepRejectedCredentials'. remote: Invalid username or password. fatal: Authentication failed for 'https://server.com/myrepo.git/' I am not really sure about the grammar :-) Thanks, Lars
Re: GDPR compliance best practices?
On Fri, Jun 08, 2018 at 01:21:29AM +0200, Peter Backes wrote: > On Thu, Jun 07, 2018 at 03:38:49PM -0700, David Lang wrote: > > > Again: The GDPR certainly allows you to keep a proof of copyright > > > privately if you have it. However, it does not allow you to keep > > > publishing it if someone exercises his right to be forgotten. > > someone is granting the world the right to use the code and you are claiming > > that the evidence that they have granted this right is illegal to have? > > Hell no! Please read what I wrote: > > - "allows you to keep a proof ... privately" > - "However, it does not allow you to keep publishing it" The problem is you've left undefined who is "you"? With an open source project, anyone who has contributed to open source project has a copyright interest. That hobbyist in German who submitted a patch? They have a copyright interest. That US Company based in Redmond, Washington? They own a copyright interest. Huawei in China? They have a copyright interest. So there is no "privately". And "you" numbers in the thousands and thousands of copyright holders of portions of the open source code. And of course, that's the other thing you seem to fundamentally not understand about how git works. Every developer in the world working on that open source project has their own copy. There is fundamentally no way that you can expunge that information from every single git repository in the world. You can remote a git note from a single repository. But that doesn't affect my copy of the repository on my laptop. And if I push that repository to my server, it git note will be out there for the whole world to see. So someone could *try* sending a public request to the entire world, saying, "I am a European and I demand that you disassociate commit DEADBEF12345 from my name". They could try serving legal papers on everyone. But at this point, it's going to trigger something called the "Streisand Effect". If you haven't heard of it, I suggest you look it up: http://mentalfloss.com/article/67299/how-barbra-streisand-inspired-streisand-effect Regards, - Ted
Proposal
-- Good day, i know you do not know me personally but i have checked your profile and i see generosity in you, There's an urgent offer attach to your name here in the office of Mr. Fawaz KhE. Al Saleh Member of the Board of Directors, Kuveyt Türk Participation Bank (Turkey) and head of private banking and wealth management Regards, Mr. Fawaz KhE. Al Saleh
[L10N] Kickoff for Git 2.18.0 l10n round 2
Hi, Git 2.18.0-rc1 has been released, and introduced 36 more messages (144 total) need to be translated. Let's start the 2nd round of l10n for Git 2.18.0. The new "git.pot" is generated in commit v2.18.0-rc1: l10n: git.pot: v2.18.0 round 2 (144 new, 6 removed) Generate po/git.pot from v2.18.0-rc1 for git v2.18.0 l10n round 2. Signed-off-by: Jiang Xin You can get it from the usual place: https://github.com/git-l10n/git-po/ As how to update your XX.po and help to translate Git, please see "Updating a XX.po file" and other sections in "po/README" file. -- Jiang Xin
Re: [PATCH 0/8] Reroll of sb/diff-color-move-more
On Thu, May 17, 2018 at 12:46 PM, Stefan Beller wrote: >>> * sb/diff-color-move-more (2018-04-25) 7 commits >>... >>> >>> Will merge to 'next'. >> >>I did not get around to fix it up, there are still review >>comments outstanding. (The test is broken in the last commit.) > > This is a reroll of sb/diff-color-move-more, with the test fixed as well > as another extra patch, that would have caught the bad test. > > The range diff is below. > > Thanks, > Stefan > > Stefan Beller (8): > xdiff/xdiff.h: remove unused flags > xdiff/xdiffi.c: remove unneeded function declarations > diff.c: do not pass diff options as keydata to hashmap > diff.c: adjust hash function signature to match hashmap expectation > diff.c: add a blocks mode for moved code detection > diff.c: decouple white space treatment from move detection algorithm > diff.c: add --color-moved-ignore-space-delta option > diff: color-moved white space handling options imply color-moved > I've been using this locally, and it's really nice. One question I had, are there plans to make the whitespace options configurable? I really like the option for enabling lines to count as moved when the whitespace changes uniformly, (it helps make changes more obvious when doing indentation changes such as wrapping code within a block). However, it's rather a long option name to type out. I didn't see any obvious config options to enable it by default though. Thanks, Jake
Re: GDPR compliance best practices?
On Fri, 8 Jun 2018, Peter Backes wrote: On Thu, Jun 07, 2018 at 03:38:49PM -0700, David Lang wrote: Again: The GDPR certainly allows you to keep a proof of copyright privately if you have it. However, it does not allow you to keep publishing it if someone exercises his right to be forgotten. someone is granting the world the right to use the code and you are claiming that the evidence that they have granted this right is illegal to have? Hell no! Please read what I wrote: - "allows you to keep a proof ... privately" - "However, it does not allow you to keep publishing it" And you are incorrect to say that the GDPR lets you keep records privately and only applies to publishing them. The GDPR is specifically targeted at companies like Facebook and Google that want to keep lots of data privately. It does no good to ask Facebook to not publish your info, they don't want to publish it in the first place, they want to keep it internally and use it. How can you misunderstand so badly what I wrote. Sure the GDPR does not let you keep records privately at will. You ultimately need to have overriding legitimate grounds for doing so. However, overriding legitimate grounds for keeping private records are rarely overriding legitimate grounds for publishing them. the license is granted to the world, so the world has an interest in it. Unless you are going to argue that the GDPR outlawed open source development. In case of git history metadata, for publishing, you may have consent or even legitimate interests, but not overriding legitimate grounds. For keeping a private copy of the metadata, your probably have overriding legitimate grounds, however. The GDPR is not an "all or nothing" thing. Facebook and Google certainly do not have overriding legitimate grounds for most of the data they keep privately. Is it that so hard to understand? you are the one arguing that the GDPR prohibits Git from storing and revealing this license granting data, not me. David Lang
Re: GDPR compliance best practices?
On Thu, Jun 07, 2018 at 03:38:49PM -0700, David Lang wrote: > > Again: The GDPR certainly allows you to keep a proof of copyright > > privately if you have it. However, it does not allow you to keep > > publishing it if someone exercises his right to be forgotten. > someone is granting the world the right to use the code and you are claiming > that the evidence that they have granted this right is illegal to have? Hell no! Please read what I wrote: - "allows you to keep a proof ... privately" - "However, it does not allow you to keep publishing it" > And you are incorrect to say that the GDPR lets you keep records privately > and only applies to publishing them. The GDPR is specifically targeted at > companies like Facebook and Google that want to keep lots of data privately. > It does no good to ask Facebook to not publish your info, they don't want to > publish it in the first place, they want to keep it internally and use it. How can you misunderstand so badly what I wrote. Sure the GDPR does not let you keep records privately at will. You ultimately need to have overriding legitimate grounds for doing so. However, overriding legitimate grounds for keeping private records are rarely overriding legitimate grounds for publishing them. In case of git history metadata, for publishing, you may have consent or even legitimate interests, but not overriding legitimate grounds. For keeping a private copy of the metadata, your probably have overriding legitimate grounds, however. The GDPR is not an "all or nothing" thing. Facebook and Google certainly do not have overriding legitimate grounds for most of the data they keep privately. Is it that so hard to understand? Best wishes Peter -- Peter Backes, r...@helen.plasma.xg8.de
Re: GDPR compliance best practices?
On Fri, 8 Jun 2018, Peter Backes wrote: On Thu, Jun 07, 2018 at 10:28:47PM +0100, Philip Oakley wrote: Some of Peter's fine distinctions may be technically valid, but that does not stop there being legal grounds. The proof of copyright is a legal grounds. Again: The GDPR certainly allows you to keep a proof of copyright privately if you have it. However, it does not allow you to keep publishing it if someone exercises his right to be forgotten. someone is granting the world the right to use the code and you are claiming that the evidence that they have granted this right is illegal to have? the GDPR recognizes that there are legal reasons why records need to be kept and does not insist that they be deleted. you can't sign a deal to buy something, then insist that the GDPR allows your name to be removed from the contract. And you are incorrect to say that the GDPR lets you keep records privately and only applies to publishing them. The GDPR is specifically targeted at companies like Facebook and Google that want to keep lots of data privately. It does no good to ask Facebook to not publish your info, they don't want to publish it in the first place, they want to keep it internally and use it. David Lang There is simply no justification for publishing against the explicit will of the subject, except for the rare circumstances where there are overriding legitimate grounds for doing so. I hardly see those for the average author entry in your everyday git repo. Such a justification is extremely fragile. Unfortunately once one gets into legal nitpicking the wording becomes tortuous and helps no-one. That's not nitpicking. If what you say were true, the GDPR would be without any practical validity at all. If one starts from an absolute "right to be forgotten" perspective one can demand all evidence of wrong doing , or authority to do something, be forgotten. The GDPR has the right to retain such evidence. Yes, but not to keep it published. I'll try and comment where I see the distinctions to be. You're essentially repeating what you already said there. Publishing (the meta data) is *distinct* from having it. Absolutely right. That is my point. You either start off public and stay public, or you start off private and stay there. Nope. The GDPR says you have to go from public to private if the subject wishes so and there are no overriding legitimate grounds. That is the entire purpose of the GDPR's right to be forgotten. Best wishes Peter
Re: GDPR compliance best practices?
On Thu, Jun 07, 2018 at 10:28:47PM +0100, Philip Oakley wrote: > Some of Peter's fine distinctions may be technically valid, but that does > not stop there being legal grounds. The proof of copyright is a legal > grounds. Again: The GDPR certainly allows you to keep a proof of copyright privately if you have it. However, it does not allow you to keep publishing it if someone exercises his right to be forgotten. There is simply no justification for publishing against the explicit will of the subject, except for the rare circumstances where there are overriding legitimate grounds for doing so. I hardly see those for the average author entry in your everyday git repo. Such a justification is extremely fragile. > Unfortunately once one gets into legal nitpicking the wording becomes > tortuous and helps no-one. That's not nitpicking. If what you say were true, the GDPR would be without any practical validity at all. > If one starts from an absolute "right to be forgotten" perspective one can > demand all evidence of wrong doing , or authority to do something, be > forgotten. The GDPR has the right to retain such evidence. Yes, but not to keep it published. > I'll try and comment where I see the distinctions to be. You're essentially repeating what you already said there. > Publishing (the meta data) is *distinct* from having it. Absolutely right. That is my point. > You either start off public and stay public, or you start off private and > stay there. Nope. The GDPR says you have to go from public to private if the subject wishes so and there are no overriding legitimate grounds. That is the entire purpose of the GDPR's right to be forgotten. Best wishes Peter -- Peter Backes, r...@helen.plasma.xg8.de
Re: GDPR compliance best practices?
Hi Peter, David, I thought that the legal notice (aka 'disclaimer') was pretty reaonable. Some of Peter's fine distinctions may be technically valid, but that does not stop there being legal grounds. The proof of copyright is a legal grounds. Unfortunately once one gets into legal nitpicking the wording becomes tortuous and helps no-one. If one starts from an absolute "right to be forgotten" perspective one can demand all evidence of wrong doing , or authority to do something, be forgotten. The GDPR has the right to retain such evidence. I'll try and comment where I see the distinctions to be. From: "Peter Backes" Hi David, thanks for your input on the issue. LEGAL GDPR NOTICE: According to the European data protection laws (GDPR), we would like to make you aware that contributing to rsyslog via git will permanently store the name and email address you provide as well as the actual commit and the time and date you made it inside git's version history. This is simply an information statement This is inevitable, because it is a main feature git. The "inevitable" word creates a point of argument within the GDPR. Removing the word (and 'because/main') brings the sentance back to be an informative statement without a GDPR claim. As we can, see, rsyslog tries to solve the issue by the already discussed legal "technology" of disclaimers (which is certainly not accepted as state of the art technology by the GDPR). In essence, they are giving excuses for why they are not honoring the right to be forgotten. Disclaimers do not work. They have no legal effect, they are placebos. The GDPR does not accept such excuses. If it would, companies could arbitrarily design their data storage such as to make it "the main feature" to not honor the right to be forgotten and/or other GDPR rights. It is obvious that this cannot work, as it would completely undermine those rights. The GDPR honors technology as a means to protect the individual's rights, not as a means to subvert them. If you are concerned about your privacy, we strongly recommend to use --author "anonymous " together with your commit. The [key] missing information here is whether rsyslog has a DCO (Developer Certificate of Origin) and what that contains. The git.git DCO is here https://github.com/git/git/blob/master/Documentation/SubmittingPatches#L304-L349 This will also help discriminate between the "name" part and the identifier, as both could be separately anonymised (given the right DCO). Thus it may be that the name is recored as "anonymous", but with a that bridges the legal evidence/right to be forgotten bridge. This can only be a solution if the project rejects any commits which are not anonymous. However, we have valid reasons why we cannot remove that information later on. The reasons are: * this would break git history and make future merges unworkable This is not a valid excuse (see above). Within the GDPR, that is correct. It (breaking history validation), of itself, should not be the reason. The technology has to be designed or applied in such a way that the individuals rights are honored, not the other way around. In absence of other means, the project has to rewrite history if it gets a valid request by someone exercising his right to be forgotten, even if that causes a lot of hazzle for everyone. * the rsyslog projects has legitimate interest to keep a permanent record of the contributor identity, once given, for - copyright verification - being able to provide proof should a malicious commit be made True, but that doesn't justify publishing that information and keeping it published even when someone exercises his right to be forgotten. Publishing (the meta data) is *distinct* from having it. However publishing the content and it's legal copyright is also associated with identifying the copyright holder (who has released it). This can be the uid if they hide behind a legal entity. This creates the catch 22 scenario. You either start off public and stay public, or you start off private and stay there. Whether the rsyslog folk want to accept copyrighted work without appropriate legal release (who guards the guards, what's their badge number?) is part of the same information requirement. Malicious intent makes the submission (commit) part of a legal evidence one needs to retain, so is supported by GDPR. In that case, "legitimate interest" is not enough. There need to be "overriding legitimate grounds". I don't see them here. Please also note that your commit is public and as such will potentially be processed by many third-parties. Git's distributed nature makes it impossible to track where exactly your commit, and thus your personal data, will be stored and be processed. If you would not like to accept this risk, please do either commit anonymously or refrain from contributing to the rsyslog project. The onward publishing and release
Re: [RFC PATCH v1] telemetry design overview (part 1)
Am 07.06.2018 um 16:53 schrieb g...@jeffhostetler.com: From: Jeff Hostetler I've been working to add code to Git to optionally collect telemetry data. The goal is to be able to collect performance data from Git commands and allow it to be aggregated over a user community to find "slow commands". Seriously? "add code to collect telemetry data" said by somebody whose email address ends with @microsoft.com is very irritating. I really don't want to have yet another switch that I must check after every update that it is still off. -- Hannes
Re: git grep with leading inverted bracket expression
On Thu, Jun 07 2018, Matthew Wilcox wrote: > On Thu, Jun 07, 2018 at 09:09:25PM +0200, Ævar Arnfjörð Bjarmason wrote: >> On Thu, Jun 07 2018, Matthew Wilcox wrote: >> > If the first atom of a regex is a bracket expression with an inverted >> > range, >> > git grep is very slow. >> >> I have some WIP patches to fix all of this, which I'll hopefully submit >> before 2.19 is out the door. >> >> What you've discovered here is how shitty your libc regex engine is, >> because unless you provide -P and compile with a reasonably up-to-date >> libpcre (preferably v2) with JIT that's what you'll get. > > I'm using Debian's build, and it is linked against a recent libpcre2: > $ ldd /usr/lib/git-core/git > libpcre2-8.so.0 => /usr/lib/x86_64-linux-gnu/libpcre2-8.so.0 > (0x7f59ad5f2000) > $ dpkg --status libpcre2-8-0 > Version: 10.31-3 > > But I wasn't using -P. If I do, then I see the performance numbers you do: > > $ time git grep -P '[^t]truct_size' >/dev/null > real 0m0.354s > user 0m0.340s > sys 0m0.639s > $ time git grep -P 'struct_size' >/dev/null > real 0m0.336s > user 0m0.552s > sys 0m0.457s > $ time git grep 'struct_size' >/dev/null > real 0m0.335s > user 0m0.535s > sys 0m0.474s > >> So you need to just use an up-to-date libpcre2 & -P and performance >> won't suck. Yeah that's recent enough & will get you all the benefits. > I don't tend to use terribly advanced regexps, so I'll just set > grep.patternType to 'perl' and then it'll automatically be fast for me > without your patches ;-) Indeed, if you're happy with that that'll do it. >> My WIP patches will make us use PCRE for all grep modes, using an API it >> has to convert basic & extended regexp syntax to its own syntax, so >> we'll be able to do that transparently. > > That's clearly the right answer. Thanks! Yeah, unfortunately git-grep's default is "basic" regexp which has a really atrocious syntax that's different enough from extended & Perl's that we probably couldn't just switch it over. That won't be needed with my patches, but maybe I'll follow-up with something to s/basic/extended/g by default, because on side effect of having the pattern converter is that we could have a warning whenever the user has a pattern that would be different under extended/perl, so we can see how common that is.
Re: git grep with leading inverted bracket expression
On Thu, Jun 07, 2018 at 09:09:25PM +0200, Ævar Arnfjörð Bjarmason wrote: > On Thu, Jun 07 2018, Matthew Wilcox wrote: > > If the first atom of a regex is a bracket expression with an inverted range, > > git grep is very slow. > > I have some WIP patches to fix all of this, which I'll hopefully submit > before 2.19 is out the door. > > What you've discovered here is how shitty your libc regex engine is, > because unless you provide -P and compile with a reasonably up-to-date > libpcre (preferably v2) with JIT that's what you'll get. I'm using Debian's build, and it is linked against a recent libpcre2: $ ldd /usr/lib/git-core/git libpcre2-8.so.0 => /usr/lib/x86_64-linux-gnu/libpcre2-8.so.0 (0x7f59ad5f2000) $ dpkg --status libpcre2-8-0 Version: 10.31-3 But I wasn't using -P. If I do, then I see the performance numbers you do: $ time git grep -P '[^t]truct_size' >/dev/null real0m0.354s user0m0.340s sys 0m0.639s $ time git grep -P 'struct_size' >/dev/null real0m0.336s user0m0.552s sys 0m0.457s $ time git grep 'struct_size' >/dev/null real0m0.335s user0m0.535s sys 0m0.474s > So you need to just use an up-to-date libpcre2 & -P and performance > won't suck. I don't tend to use terribly advanced regexps, so I'll just set grep.patternType to 'perl' and then it'll automatically be fast for me without your patches ;-) > My WIP patches will make us use PCRE for all grep modes, using an API it > has to convert basic & extended regexp syntax to its own syntax, so > we'll be able to do that transparently. That's clearly the right answer. Thanks!
Re: git grep with leading inverted bracket expression
On Thu, Jun 07 2018, Matthew Wilcox wrote: > If the first atom of a regex is a bracket expression with an inverted range, > git grep is very slow. > > $ time git grep 'struct_size' >/dev/null > > real 0m0.368s > user 0m0.563s > sys 0m0.453s > > $ time git grep '[^t]truct_size' >/dev/null > > real 0m31.529s > user 1m54.909s > sys 0m0.805s > > If the bracket expression is moved to even the second position in the string, > it runs much faster: > > $ time git grep 's[^p]ruct_size' >/dev/null > > real 0m3.989s > user 0m13.939s > sys 0m0.403s > > It's pretty bad with even a '.' as the first character: > > $ time git grep '.truct_size' >/dev/null > > real 0m14.514s > user 0m52.624s > sys 0m0.598s > > $ git --version > git version 2.17.1 > > Setting LANG=C improves matters by a factor of 3-4 (depending if you > count real or user time): > > $ time git grep '[^t]truct_size' >/dev/null > real 0m10.035s > user 0m28.795s > sys 0m0.537s > > (this is using something pretty close to Linus' current HEAD of the > linux repository, an i7-7500, 16GB memory). I have some WIP patches to fix all of this, which I'll hopefully submit before 2.19 is out the door. What you've discovered here is how shitty your libc regex engine is, because unless you provide -P and compile with a reasonably up-to-date libpcre (preferably v2) with JIT that's what you'll get. The reason stuff like 'struct_size' is so much faster is because there we don't use any regex engine at all, but rather an optimized fixed-string searcher. With our own benchmarks modified per your E-Mail: diff --git a/t/perf/p7820-grep-engines.sh b/t/perf/p7820-grep-engines.sh index 8b09c5bf32..fe4c5681da 100755 --- a/t/perf/p7820-grep-engines.sh +++ b/t/perf/p7820-grep-engines.sh @@ -28,11 +28,10 @@ then fi for pattern in \ - 'how.to' \ - '^how to' \ - '[how] to' \ - '\(e.t[^ ]*\|v.ry\) rare' \ - 'm\(ú\|u\)lt.b\(æ\|y\)te' + 'struct size' \ + '[^t]truct_size' \ + 's[^p]ruct_size' \ + '.truct_size' do for engine in basic extended perl do I get these results against linux.git: $ GIT_PERF_LARGE_REPO=~/g/linux ./run p7820-grep-engines.sh [...] Test this tree -- 7820.1: basic grep 'struct size' 0.23(0.52+0.76) 7820.2: extended grep 'struct size' 0.22(0.60+0.61) 7820.3: perl grep 'struct size' 0.22(0.56+0.65) 7820.5: basic grep '[^t]truct_size' 4.29(29.43+0.51) 7820.6: extended grep '[^t]truct_size'4.27(29.59+0.36) 7820.7: perl grep '[^t]truct_size'0.21(0.40+0.69) 7820.9: basic grep 's[^p]ruct_size' 0.49(2.22+0.49) 7820.10: extended grep 's[^p]ruct_size' 0.43(2.24+0.48) 7820.11: perl grep 's[^p]ruct_size' 0.21(0.38+0.71) 7820.13: basic grep '.truct_size' 4.42(31.29+0.44) 7820.14: extended grep '.truct_size' 4.50(31.18+0.46) 7820.15: perl grep '.truct_size' 0.21(0.35+0.75) So you need to just use an up-to-date libpcre2 & -P and performance won't suck. My WIP patches will make us use PCRE for all grep modes, using an API it has to convert basic & extended regexp syntax to its own syntax, so we'll be able to do that transparently.
[PATCH 2/2] pack-bitmap: add free function
Add a function to free struct bitmap_index instances, and use it where needed (except when rebuild_existing_bitmaps() is used, since it creates references to the bitmaps within the struct bitmap_index passed to it). Note that the hashes field in struct bitmap_index is not freed because it points to another field within the same struct. The documentation for that field has been updated to clarify that. Signed-off-by: Jonathan Tan --- builtin/pack-objects.c | 1 + builtin/rev-list.c | 2 ++ pack-bitmap-write.c| 4 pack-bitmap.c | 35 +-- pack-bitmap.h | 1 + 5 files changed, 37 insertions(+), 6 deletions(-) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index d064f944b..896e41300 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -2945,6 +2945,7 @@ static int get_object_list_from_bitmap(struct rev_info *revs) } traverse_bitmap_commit_list(bitmap_git, _object_entry_from_bitmap); + free_bitmap_index(bitmap_git); return 0; } diff --git a/builtin/rev-list.c b/builtin/rev-list.c index cce42ae1d..62776721f 100644 --- a/builtin/rev-list.c +++ b/builtin/rev-list.c @@ -521,6 +521,7 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix) if (max_count >= 0 && max_count < commit_count) commit_count = max_count; printf("%d\n", commit_count); + free_bitmap_index(bitmap_git); return 0; } } else if (revs.max_count < 0 && @@ -528,6 +529,7 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix) struct bitmap_index *bitmap_git; if ((bitmap_git = prepare_bitmap_walk())) { traverse_bitmap_commit_list(bitmap_git, _object_fast); + free_bitmap_index(bitmap_git); return 0; } } diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c index 03e122563..7896fedd3 100644 --- a/pack-bitmap-write.c +++ b/pack-bitmap-write.c @@ -367,6 +367,10 @@ void bitmap_writer_reuse_bitmaps(struct packing_data *to_pack) writer.reused = kh_init_sha1(); rebuild_existing_bitmaps(bitmap_git, to_pack, writer.reused, writer.show_progress); + /* +* NEEDSWORK: rebuild_existing_bitmaps() makes writer.reused reference +* some bitmaps in bitmap_git, so we can't free the latter. +*/ } static struct ewah_bitmap *find_reused_bitmap(const unsigned char *sha1) diff --git a/pack-bitmap.c b/pack-bitmap.c index 7795444b0..c06b19f49 100644 --- a/pack-bitmap.c +++ b/pack-bitmap.c @@ -66,7 +66,7 @@ struct bitmap_index { /* Number of bitmapped commits */ uint32_t entry_count; - /* Name-hash cache (or NULL if not present). */ + /* If not NULL, this is a name-hash cache pointing into map. */ uint32_t *hashes; /* @@ -350,6 +350,7 @@ struct bitmap_index *prepare_bitmap_git(void) if (!open_pack_bitmap(bitmap_git) && !load_pack_bitmap(bitmap_git)) return bitmap_git; + free_bitmap_index(bitmap_git); return NULL; } @@ -690,7 +691,7 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs) /* try to open a bitmapped pack, but don't parse it yet * because we may not need to use it */ if (open_pack_bitmap(bitmap_git) < 0) - return NULL; + goto cleanup; for (i = 0; i < revs->pending.nr; ++i) { struct object *object = revs->pending.objects[i].item; @@ -723,11 +724,11 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs) * optimize here */ if (haves && !in_bitmapped_pack(bitmap_git, haves)) - return NULL; + goto cleanup; /* if we don't want anything, we're done here */ if (!wants) - return NULL; + goto cleanup; /* * now we're going to use bitmaps, so load the actual bitmap entries @@ -735,7 +736,7 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs) * becomes invalidated and we must perform the revwalk through bitmaps */ if (!bitmap_git->loaded && load_pack_bitmap(bitmap_git) < 0) - return NULL; + goto cleanup; object_array_clear(>pending); @@ -761,6 +762,10 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs) bitmap_free(haves_bitmap); return bitmap_git; + +cleanup: + free_bitmap_index(bitmap_git); + return NULL; } int reuse_partial_packfile_from_bitmap(struct bitmap_index *bitmap_git, @@ -1001,7 +1006,7 @@ void
[PATCH 0/2] Object store refactoring: make bitmap_git not global
This is a continuation of the object store refactoring effort. We cannot truly free an object store without ensuring that any generated bitmaps are first freed, so here are patches to drastically reduce the lifetime of any bitmaps generated. As a bonus, the API is also improved, and global state reduced. Jonathan Tan (2): pack-bitmap: remove bitmap_git global variable pack-bitmap: add free function builtin/pack-objects.c | 7 +- builtin/rev-list.c | 13 +- pack-bitmap-write.c| 10 +- pack-bitmap.c | 344 - pack-bitmap.h | 20 ++- 5 files changed, 234 insertions(+), 160 deletions(-) -- 2.17.0.768.g1526ddbba1.dirty
[PATCH 1/2] pack-bitmap: remove bitmap_git global variable
Remove the bitmap_git global variable. Instead, generate on demand an instance of struct bitmap_index for code that needs to access it. This allows us significant control over the lifetime of instances of struct bitmap_index. In particular, packs can now be closed without worrying if an unnecessarily long-lived "pack" field in struct bitmap_index still points to it. The bitmap API is also clearer in that we need to first obtain a struct bitmap_index, then we use it. Helped-by: Stefan Beller Signed-off-by: Jonathan Tan --- builtin/pack-objects.c | 6 +- builtin/rev-list.c | 11 +- pack-bitmap-write.c| 6 +- pack-bitmap.c | 317 ++--- pack-bitmap.h | 19 ++- 5 files changed, 201 insertions(+), 158 deletions(-) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 71056d829..d064f944b 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -2929,11 +2929,13 @@ static int pack_options_allow_reuse(void) static int get_object_list_from_bitmap(struct rev_info *revs) { - if (prepare_bitmap_walk(revs) < 0) + struct bitmap_index *bitmap_git; + if (!(bitmap_git = prepare_bitmap_walk(revs))) return -1; if (pack_options_allow_reuse() && !reuse_partial_packfile_from_bitmap( + bitmap_git, _packfile, _packfile_objects, _packfile_offset)) { @@ -2942,7 +2944,7 @@ static int get_object_list_from_bitmap(struct rev_info *revs) display_progress(progress_state, nr_result); } - traverse_bitmap_commit_list(_object_entry_from_bitmap); + traverse_bitmap_commit_list(bitmap_git, _object_entry_from_bitmap); return 0; } diff --git a/builtin/rev-list.c b/builtin/rev-list.c index fadd3ec14..cce42ae1d 100644 --- a/builtin/rev-list.c +++ b/builtin/rev-list.c @@ -16,6 +16,7 @@ #include "reflog-walk.h" #include "oidset.h" #include "packfile.h" +#include "object-store.h" static const char rev_list_usage[] = "git rev-list [OPTION] ... [ -- paths... ]\n" @@ -514,8 +515,9 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix) if (revs.count && !revs.left_right && !revs.cherry_mark) { uint32_t commit_count; int max_count = revs.max_count; - if (!prepare_bitmap_walk()) { - count_bitmap_commit_list(_count, NULL, NULL, NULL); + struct bitmap_index *bitmap_git; + if ((bitmap_git = prepare_bitmap_walk())) { + count_bitmap_commit_list(bitmap_git, _count, NULL, NULL, NULL); if (max_count >= 0 && max_count < commit_count) commit_count = max_count; printf("%d\n", commit_count); @@ -523,8 +525,9 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix) } } else if (revs.max_count < 0 && revs.tag_objects && revs.tree_objects && revs.blob_objects) { - if (!prepare_bitmap_walk()) { - traverse_bitmap_commit_list(_object_fast); + struct bitmap_index *bitmap_git; + if ((bitmap_git = prepare_bitmap_walk())) { + traverse_bitmap_commit_list(bitmap_git, _object_fast); return 0; } } diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c index 7b2dc3e7d..03e122563 100644 --- a/pack-bitmap-write.c +++ b/pack-bitmap-write.c @@ -360,11 +360,13 @@ static int date_compare(const void *_a, const void *_b) void bitmap_writer_reuse_bitmaps(struct packing_data *to_pack) { - if (prepare_bitmap_git() < 0) + struct bitmap_index *bitmap_git; + if (!(bitmap_git = prepare_bitmap_git())) return; writer.reused = kh_init_sha1(); - rebuild_existing_bitmaps(to_pack, writer.reused, writer.show_progress); + rebuild_existing_bitmaps(bitmap_git, to_pack, writer.reused, +writer.show_progress); } static struct ewah_bitmap *find_reused_bitmap(const unsigned char *sha1) diff --git a/pack-bitmap.c b/pack-bitmap.c index 06771113f..7795444b0 100644 --- a/pack-bitmap.c +++ b/pack-bitmap.c @@ -25,14 +25,14 @@ struct stored_bitmap { }; /* - * The currently active bitmap index. By design, repositories only have + * The active bitmap index for a repository. By design, repositories only have * a single bitmap index available (the index for the biggest packfile in * the repository), since bitmap indexes need full closure. * * If there is more than one bitmap index available (e.g. because of alternates), * the
Re: [PATCH v7 2/2] json-writer: t0019: add Python unit test
On 07/06/18 03:23, Jeff King wrote: > On Thu, Jun 07, 2018 at 01:16:14AM +0100, Ramsay Jones wrote: > >>> Probably. We may want to go the same route as we did for perl in >>> a0e0ec9f7d (t: provide a perl() function which uses $PERL_PATH, >>> 2013-10-28) so that test writers don't have to remember this. >>> >>> That said, I wonder if it would be hard to simply do the python bits >>> here in perl. This is the first use of python in our test scripts (and >> >> Hmm, not quite the _first_ use: >> >> $ git grep PYTHON_PATH -- t >> t/lib-git-p4.sh:(cd / && "$PYTHON_PATH" -c 'import time; >> print(int(time.time()))') >> t/lib-git-p4.sh:"$PYTHON_PATH" "$TRASH_DIRECTORY/marshal-dump.py" >> t/t9020-remote-svn.sh:export PATH PYTHON_PATH GIT_BUILD_DIR >> t/t9020-remote-svn.sh:exec "$PYTHON_PATH" >> "$GIT_BUILD_DIR/contrib/svn-fe/svnrdump_sim.py" "$@" >> t/t9802-git-p4-filetype.sh: "$PYTHON_PATH" >> "$TRASH_DIRECTORY/k_smush.py" <"$cli/k-text-k" >cli-k-text-k-smush && >> t/t9802-git-p4-filetype.sh: "$PYTHON_PATH" >> "$TRASH_DIRECTORY/ko_smush.py" <"$cli/k-text-ko" >cli-k-text-ko-smush && >> t/t9802-git-p4-filetype.sh: "$PYTHON_PATH" >> "$TRASH_DIRECTORY/gendouble.py" >%double.png && >> t/t9810-git-p4-rcs.sh: "$PYTHON_PATH" "$TRASH_DIRECTORY/scrub_k.py" >> <"$git/$file" >"$scrub" && >> t/t9810-git-p4-rcs.sh: "$PYTHON_PATH" "$TRASH_DIRECTORY/scrub_ko.py" >> <"$git/$file" >"$scrub" && >> $ > > OK, the first for a feature that is not already written in python > (leading into my second claim that python is used only for fringe > commands ;) ). > > Though maybe I am wrong that the remote-svn stuff requires python. I > thought it did, but poking around, it looks like it's all C, and just > the "svnrdump_sim" helper is python. Heh, I was a bit tired last night, so although I knew that I required python to be installed to run the test-suite, I could not remember why. As soon as I went to bed, ... :) I recently installed Ubuntu 18.04, so that I could get a heads up on any possible problems later this month when I install Linux Mint 19. In order to get the test-suite to run, I had to set 'PYTHON_PATH = /usr/bin/python3' in my config.mak file. (yes, I could have set NO_PYTHON, but you get the idea). Ubuntu 18.04 no longer installs python2 by default (it has ported to python3), but presumably you can still install it as '/usr/bin/python' (I didn't check). In any event, it was t9020-remote-svn.sh that was failing which, despite the name, does not depend on 'svn'. As you already noted, it does run svnrdump_sim.py and the git-remote-testsvn. > At any rate, I think the point still stands that perl is our main > scripting language. I'd rather keep us to that unless there's a good > reason not to. Agreed. And I see it has come to pass ... :-D ATB, Ramsay Jones
Re: [PATCH 06/23] midx: struct midxed_git and 'read' subcommand
On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee wrote: > diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt > index dcaeb1a91b..919283fdd8 100644 > --- a/Documentation/git-midx.txt > +++ b/Documentation/git-midx.txt > @@ -23,6 +23,11 @@ OPTIONS > /packs/multi-pack-index for the current MIDX file, and > /packs for the pack-files to index. > > +read:: > + When given as the verb, read the current MIDX file and output > + basic information about its contents. Used for debugging > + purposes only. On second thought. If you just need a temporary debugging interface, adding a program in t/helper may be a better option. In the end we might still need 'read' to dump a file out, but we should have some stable output format (and json might be a good choice). That's it I'm done for today. I will continue on the rest some day, hopefully. -- Duy
Re: [PATCH 09/23] midx: write pack names in chunk
On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee wrote: > @@ -74,6 +80,31 @@ struct midxed_git *load_midxed_git(const char *object_dir) > m->num_chunks = *(m->data + 6); > m->num_packs = get_be32(m->data + 8); > > + for (i = 0; i < m->num_chunks; i++) { > + uint32_t chunk_id = get_be32(m->data + 12 + > MIDX_CHUNKLOOKUP_WIDTH * i); > + uint64_t chunk_offset = get_be64(m->data + 16 + > MIDX_CHUNKLOOKUP_WIDTH * i); Would be good to reduce magic numbers like 12 and 16, I think you have some header length constants for those already. > + switch (chunk_id) { > + case MIDX_CHUNKID_PACKNAMES: > + m->chunk_pack_names = m->data + chunk_offset; > + break; > + > + case 0: > + die("terminating MIDX chunk id appears > earlier than expected"); _() > + break; > + > + default: > + /* > +* Do nothing on unrecognized chunks, > allowing future > +* extensions to add optional chunks. > +*/ I wrote about the chunk term reminding me of PNG format then deleted it. But it may help to do similar to PNG here. The first letter can let us know if the chunk is optional and can be safely ignored. E.g. uppercase first letter cannot be ignored, lowercase go wild. > + break; > + } > + } > + > + if (!m->chunk_pack_names) > + die("MIDX missing required pack-name chunk"); _() > + > return m; > > cleanup_fail: > @@ -99,18 +130,88 @@ static size_t write_midx_header(struct hashfile *f, > return MIDX_HEADER_SIZE; > } > > +struct pack_pair { > + uint32_t pack_int_id; can this be just pack_id? > + char *pack_name; > +}; > + > +static int pack_pair_compare(const void *_a, const void *_b) > +{ > + struct pack_pair *a = (struct pack_pair *)_a; > + struct pack_pair *b = (struct pack_pair *)_b; > + return strcmp(a->pack_name, b->pack_name); > +} > + > +static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, > uint32_t *perm) > +{ > + uint32_t i; > + struct pack_pair *pairs; > + > + ALLOC_ARRAY(pairs, nr_packs); > + > + for (i = 0; i < nr_packs; i++) { > + pairs[i].pack_int_id = i; > + pairs[i].pack_name = pack_names[i]; > + } > + > + QSORT(pairs, nr_packs, pack_pair_compare); > + > + for (i = 0; i < nr_packs; i++) { > + pack_names[i] = pairs[i].pack_name; > + perm[pairs[i].pack_int_id] = i; > + } pairs[] is leaked? > +} > + > +static size_t write_midx_pack_names(struct hashfile *f, > + char **pack_names, > + uint32_t num_packs) > +{ > + uint32_t i; > + unsigned char padding[MIDX_CHUNK_ALIGNMENT]; > + size_t written = 0; > + > + for (i = 0; i < num_packs; i++) { > + size_t writelen = strlen(pack_names[i]) + 1; > + > + if (i && strcmp(pack_names[i], pack_names[i - 1]) <= 0) > + BUG("incorrect pack-file order: %s before %s", > + pack_names[i - 1], > + pack_names[i]); > + > + hashwrite(f, pack_names[i], writelen); > + written += writelen; side note. This pattern happens a lot. It may be a good idea to make hashwrite() return writelen so we can just write written += hashwrite(f, ..., writelen); > + } > + > + /* add padding to be aligned */ > + i = MIDX_CHUNK_ALIGNMENT - (written % MIDX_CHUNK_ALIGNMENT); > + if (i < MIDX_CHUNK_ALIGNMENT) { > + bzero(padding, sizeof(padding)); > + hashwrite(f, padding, i); > + written += i; > + } > + > + return written; > +} > + > int write_midx_file(const char *object_dir) > { > - unsigned char num_chunks = 0; > + unsigned char cur_chunk, num_chunks = 0; > char *midx_name; > struct hashfile *f; > struct lock_file lk; > struct packed_git **packs = NULL; > + char **pack_names = NULL; > + uint32_t *pack_perm; > uint32_t i, nr_packs = 0, alloc_packs = 0; > + uint32_t alloc_pack_names = 0; > DIR *dir; > struct dirent *de; > struct strbuf pack_dir = STRBUF_INIT; > size_t pack_dir_len; > + uint64_t pack_name_concat_len = 0; > + uint64_t written = 0; > + uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1]; > + uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1]; This long list of local vars may be a good indicator that this function needs split up into smaller ones. > > midx_name =
Re: [PATCH 08/23] midx: read packfiles from pack directory
On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee wrote: > @@ -114,14 +119,56 @@ int write_midx_file(const char *object_dir) > midx_name); > } > > + strbuf_addf(_dir, "%s/pack", object_dir); > + dir = opendir(pack_dir.buf); > + > + if (!dir) { > + error_errno("unable to open pack directory: %s", > + pack_dir.buf); _() > + strbuf_release(_dir); > + return 1; > + } > + > + strbuf_addch(_dir, '/'); > + pack_dir_len = pack_dir.len; > + ALLOC_ARRAY(packs, alloc_packs); > + while ((de = readdir(dir)) != NULL) { > + if (is_dot_or_dotdot(de->d_name)) > + continue; > + > + if (ends_with(de->d_name, ".idx")) { > + ALLOC_GROW(packs, nr_packs + 1, alloc_packs); > + > + strbuf_setlen(_dir, pack_dir_len); > + strbuf_addstr(_dir, de->d_name); > + > + packs[nr_packs] = add_packed_git(pack_dir.buf, > +pack_dir.len, > +0); > + if (!packs[nr_packs]) > + warning("failed to add packfile '%s'", > + pack_dir.buf); > + else > + nr_packs++; > + } > + } > + closedir(dir); > + strbuf_release(_dir); Can we refactor and share this scanning-for-packs code with packfile.c? I'm pretty sure it does something similar in there. > - write_midx_header(f, num_chunks, num_packs); > + write_midx_header(f, num_chunks, nr_packs); Hmm.. could have stuck to one name from the beginning... -- Duy
Re: [PATCH 06/23] midx: struct midxed_git and 'read' subcommand
On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee wrote: > As we build the multi-pack-index feature by adding chunks at a time, > we want to test that the data is being written correctly. > > Create struct midxed_git to store an in-memory representation of a A word play on 'packed_git'? Amusing. Some more descriptive name would be better though. midxed looks almost like random letters thrown together. > multi-pack-index and a memory-map of the binary file. Initialize this > struct in load_midxed_git(object_dir). > +static int read_midx_file(const char *object_dir) > +{ > + struct midxed_git *m = load_midxed_git(object_dir); > + > + if (!m) > + return 0; This looks like an error case, please don't just return zero, typically used to say "success". I don't know if this command stays "for debugging purposes" until the end. Of course in that case it does not really matter. > +struct midxed_git *load_midxed_git(const char *object_dir) > +{ > + struct midxed_git *m; > + int fd; > + struct stat st; > + size_t midx_size; > + void *midx_map; > + const char *midx_name = get_midx_filename(object_dir); mem leak? This function returns allocated memory if I remember correctly. > + > + fd = git_open(midx_name); > + if (fd < 0) > + return NULL; do an error_errno() so we know what went wrong at least. > + if (fstat(fd, )) { > + close(fd); > + return NULL; same here, we should know why fstat() fails. > + } > + midx_size = xsize_t(st.st_size); > + > + if (midx_size < MIDX_MIN_SIZE) { > + close(fd); > + die("multi-pack-index file %s is too small", midx_name); _() The use of die() should be discouraged though. Many people still try (or wish) to libify code and new die() does not help. I think error() here would be enough then you can return NULL. Or you can go fancier and store the error string in a strbuf like refs code. > + } > + > + midx_map = xmmap(NULL, midx_size, PROT_READ, MAP_PRIVATE, fd, 0); > + > + m = xcalloc(1, sizeof(*m) + strlen(object_dir) + 1); > + strcpy(m->object_dir, object_dir); > + m->data = midx_map; > + > + m->signature = get_be32(m->data); > + if (m->signature != MIDX_SIGNATURE) { > + error("multi-pack-index signature %X does not match signature > %X", > + m->signature, MIDX_SIGNATURE); _(). Maybe 0x%08x instead of %x > + goto cleanup_fail; > + } > + > + m->version = *(m->data + 4); m->data[4] instead? shorter and easier to understand. Same comment on "*(m->data + x)" and error() without _() for the rest. > + if (m->version != MIDX_VERSION) { > + error("multi-pack-index version %d not recognized", > + m->version); _() > + goto cleanup_fail; > + } > + > + m->hash_version = *(m->data + 5); m->data[5] > +cleanup_fail: > + FREE_AND_NULL(m); > + munmap(midx_map, midx_size); > + close(fd); > + exit(1); It's bad enough that you die() but exit() in this code seems too much. Please just return NULL and let the caller handle the error. > diff --git a/midx.h b/midx.h > index 3a63673952..a1d18ed991 100644 > --- a/midx.h > +++ b/midx.h > @@ -1,4 +1,13 @@ > +#ifndef MIDX_H > +#define MIDX_H > + > +#include "git-compat-util.h" > #include "cache.h" > +#include "object-store.h" I don't really think you need object-store here (git-compat-util.h too). "struct mixed_git;" would be enough for load_midxed_git declaration below. > #include "packfile.h" > > +struct midxed_git *load_midxed_git(const char *object_dir); > + > int write_midx_file(const char *object_dir); > + > +#endif > diff --git a/object-store.h b/object-store.h > index d683112fd7..77cb82621a 100644 > --- a/object-store.h > +++ b/object-store.h > @@ -84,6 +84,25 @@ struct packed_git { > char pack_name[FLEX_ARRAY]; /* more */ > }; > > +struct midxed_git { > + struct midxed_git *next; Do we really have multiple midx files? > + > + int fd; > + > + const unsigned char *data; > + size_t data_len; > + > + uint32_t signature; > + unsigned char version; > + unsigned char hash_version; > + unsigned char hash_len; > + unsigned char num_chunks; > + uint32_t num_packs; > + uint32_t num_objects; > + > + char object_dir[FLEX_ARRAY]; Why do you need to keep object_dir when it could be easily retrieved when the repo is available? > +}; > + > struct raw_object_store { > /* > * Path to the repository's object store. -- Duy
Re: [PATCH 05/23] midx: write header information to lockfile
On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee wrote: > +static char *get_midx_filename(const char *object_dir) > +{ > + struct strbuf midx_name = STRBUF_INIT; > + strbuf_addstr(_name, object_dir); > + strbuf_addstr(_name, "/pack/multi-pack-index"); > + return strbuf_detach(_name, NULL); > +} I think this whole function can be written as xstrfmt("%s/pack/multi-pack-index", object_dir); > + > +static size_t write_midx_header(struct hashfile *f, > + unsigned char num_chunks, > + uint32_t num_packs) > +{ > + char byte_values[4]; unsigned char just to be on the safe side? 'char' is signed on ARM if I remember correctly. -- Duy
Re: [PATCH 04/23] midx: add 'write' subcommand and basic wiring
On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee wrote: > diff --git a/builtin/midx.c b/builtin/midx.c > index 59ea92178f..dc0a5acd3f 100644 > --- a/builtin/midx.c > +++ b/builtin/midx.c > @@ -3,9 +3,10 @@ > #include "config.h" > #include "git-compat-util.h" > #include "parse-options.h" > +#include "midx.h" > > static char const * const builtin_midx_usage[] ={ > - N_("git midx [--object-dir ]"), > + N_("git midx [--object-dir ] [write]"), > NULL > }; > > @@ -34,5 +35,11 @@ int cmd_midx(int argc, const char **argv, const char > *prefix) > if (!opts.object_dir) > opts.object_dir = get_object_directory(); > > + if (argc == 0) > + return 0; Isn't it better to die here when no verb is given? I don't see any good use case for running a no-op "git midx" without verbs. It's more likely a mistake (e.g. "git midx $foo" where foo happens to be empty) > + > + if (!strcmp(argv[0], "write")) > + return write_midx_file(opts.object_dir); > + > return 0; > } > diff --git a/midx.c b/midx.c > new file mode 100644 > index 00..616af66b13 > --- /dev/null > +++ b/midx.c > @@ -0,0 +1,9 @@ > +#include "git-compat-util.h" > +#include "cache.h" Only one of the two is needed > +#include "dir.h" Not needed yet. It's better to include it in the patch that actually needs it. > +#include "midx.h" > + > +int write_midx_file(const char *object_dir) > +{ > + return 0; > +} > diff --git a/midx.h b/midx.h > new file mode 100644 > index 00..3a63673952 > --- /dev/null > +++ b/midx.h > @@ -0,0 +1,4 @@ > +#include "cache.h" > +#include "packfile.h" These includes are not needed, at least not now. And please protect the header file with #ifndef __MINDX_H__ .. #endif. > + > +int write_midx_file(const char *object_dir); > diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh > new file mode 100755 > index 00..a590137af7 > --- /dev/null > +++ b/t/t5319-midx.sh > @@ -0,0 +1,10 @@ > +#!/bin/sh > + > +test_description='multi-pack-indexes' > +. ./test-lib.sh > + > +test_expect_success 'write midx with no pakcs' ' no packs > + git midx --object-dir=. write > +' > + > +test_done > -- > 2.18.0.rc1 > -- Duy
Re: [PATCH v8 1/2] json_writer: new routines to create data in JSON format
On Thu, Jun 7, 2018 at 10:12 AM, wrote: > Add a series of jw_ routines and "struct json_writer" structure to compose > JSON data. The resulting string data can then be output by commands wanting > to support a JSON output format. > [...] > Signed-off-by: Jeff Hostetler > --- > diff --git a/t/t0019-json-writer.sh b/t/t0019-json-writer.sh > @@ -0,0 +1,236 @@ > +test_expect_success 'simple object' ' > + cat >expect <<-\EOF && > + {"a":"abc","b":42,"c":3.14,"d":true,"e":false,"f":null} > + EOF > + test-json-writer >actual \ > + @object \ > + @object-string a abc \ > + @object-int b 42 \ > + @object-double c 2 3.140 \ > + @object-true d \ > + @object-false e \ > + @object-null f \ > + @end && > + test_cmp expect actual > +' To make it easier on people writing these tests, it might be nice for this to be less noisy by getting rid of "@" and "\". To get rid of "\", the test program could grab its script commands from stdin (one instruction per line) rather than from argv[]. For instance: test-json-writer >actual <<-\EOF && object object-string a abc ... end EOF Not a big deal, and certainly not worth a re-roll.
Re: [PATCH 03/23] midx: add midx builtin
On Thu, Jun 7, 2018 at 4:03 PM, Derrick Stolee wrote: > diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt > new file mode 100644 > index 00..2bd886f1a2 > --- /dev/null > +++ b/Documentation/git-midx.txt > @@ -0,0 +1,29 @@ > +git-midx(1) > + > + > +NAME > + > +git-midx - Write and verify multi-pack-indexes (MIDX files). No full stop. This head line is collected automatically with others and its having a full stop while the rest does not looks strange/ > diff --git a/builtin/midx.c b/builtin/midx.c > new file mode 100644 > index 00..59ea92178f > --- /dev/null > +++ b/builtin/midx.c > @@ -0,0 +1,38 @@ > +#include "builtin.h" > +#include "cache.h" > +#include "config.h" > +#include "git-compat-util.h" You only need either cache.h or git-compat-util.h. If cache.h is here, git-compat-util can be removed. > +#include "parse-options.h" > + > +static char const * const builtin_midx_usage[] ={ > + N_("git midx [--object-dir ]"), > + NULL > +}; > + > +static struct opts_midx { > + const char *object_dir; > +} opts; > + > +int cmd_midx(int argc, const char **argv, const char *prefix) > +{ > + static struct option builtin_midx_options[] = { > + { OPTION_STRING, 0, "object-dir", _dir, For paths (including dir), OPTION_FILENAME may be a better option to handle correctly when the command is run in a subdir. See df217ed643 (parse-opts: add OPT_FILENAME and transition builtins - 2009-05-23) for more info. > + N_("dir"), > + N_("The object directory containing set of packfile and > pack-index pairs.") }, Other help strings do not have full stop either (I only checked a couple commands though) Also, doesn't OPT_STRING() work here too (if you avoid OPTION_FILENAME for some reason)? > + OPT_END(), > + }; > + > + if (argc == 2 && !strcmp(argv[1], "-h")) > + usage_with_options(builtin_midx_usage, builtin_midx_options); > + > + git_config(git_default_config, NULL); > + > + argc = parse_options(argc, argv, prefix, > +builtin_midx_options, > +builtin_midx_usage, 0); > + > + if (!opts.object_dir) > + opts.object_dir = get_object_directory(); > + > + return 0; > +} > diff --git a/git.c b/git.c > index c2f48d53dd..400fadd677 100644 > --- a/git.c > +++ b/git.c > @@ -503,6 +503,7 @@ static struct cmd_struct commands[] = { > { "merge-recursive-theirs", cmd_merge_recursive, RUN_SETUP | > NEED_WORK_TREE | NO_PARSEOPT }, > { "merge-subtree", cmd_merge_recursive, RUN_SETUP | NEED_WORK_TREE | > NO_PARSEOPT }, > { "merge-tree", cmd_merge_tree, RUN_SETUP | NO_PARSEOPT }, > + { "midx", cmd_midx, RUN_SETUP }, If it's a plumbing and can take an --object-dir, then I don't think you should require it to run in a repo (with RUN_SETUP). RUN_SETUP_GENTLY may be better. You could even leave it empty here and only call setup_git_directory() only when --object-dir is not set. > { "mktag", cmd_mktag, RUN_SETUP | NO_PARSEOPT }, > { "mktree", cmd_mktree, RUN_SETUP }, > { "mv", cmd_mv, RUN_SETUP | NEED_WORK_TREE }, -- Duy
[RFC PATCH 2/3] rebase: Implement --merge via git-rebase--interactive
Interactive rebases are implemented in terms of cherry-pick rather than the merge-recursive builtin, but cherry-pick also calls into the recursive merge machinery by default and can accept special merge strategies and/or special strategy options. As such, there really is not any need for having both git-rebase--merge and git-rebase--interactive anymore. Delete git-rebase--merge.sh and have the --merge option call in to git-rebase--interactive. Add one new variable ($actually_interactive) to help keep cases separate, and adjust the detection of the special circumstances under which a rebase boils down to a simple fast forward so that it keeps working with the new structure. Note that this change fixes a few known test failures (see t3421). testcase modification notes: t3406: --interactive and --merge had slightly different progress output while running; adjust a test to match t3420: tests of precise output while running, but rebase--am, rebase--merge, and rebase--interactive all were built on very different commands (am, merge-recursive, cherry-pick), so the tests expected different output for each type. Now we expect --merge and --interactive to have the same output. t3421: --interactive fixes some bugs in --merge! Wahoo! t3425: topology linearization was inconsistent across flavors of rebase, as already noted in a TODO comment in the testcase. This was not considered a bug before, so getting a different linearization due to switching out backends should not be considered a bug now. t5407: different rebase types varied slightly in how many times checkout or commit or equivalents were called based on a quick comparison of this tests and previous ones which covered different rebase flavors. I think this is just attributable to this difference. Signed-off-by: Elijah Newren --- .gitignore| 1 - Documentation/git-rebase.txt | 16 +-- Makefile | 1 - git-rebase--interactive.sh| 5 +- git-rebase--merge.sh | 164 -- git-rebase.sh | 49 - t/t3406-rebase-message.sh | 7 +- t/t3420-rebase-autostash.sh | 78 ++ t/t3421-rebase-topology-linear.sh | 10 +- t/t3425-rebase-topology-merges.sh | 6 +- t/t5407-post-rewrite-hook.sh | 1 + 11 files changed, 55 insertions(+), 283 deletions(-) delete mode 100644 git-rebase--merge.sh diff --git a/.gitignore b/.gitignore index 388cc4beee..747be69d10 100644 --- a/.gitignore +++ b/.gitignore @@ -118,7 +118,6 @@ /git-rebase--am /git-rebase--helper /git-rebase--interactive -/git-rebase--merge /git-receive-pack /git-reflog /git-remote diff --git a/Documentation/git-rebase.txt b/Documentation/git-rebase.txt index 451252c173..28d1658d7a 100644 --- a/Documentation/git-rebase.txt +++ b/Documentation/git-rebase.txt @@ -275,6 +275,10 @@ branch on top of the branch. Because of this, when a merge conflict happens, the side reported as 'ours' is the so-far rebased series, starting with , and 'theirs' is the working branch. In other words, the sides are swapped. ++ +This uses the `--interactive` machinery internally, and as such, +anything that is incompatible with --interactive is incompatible +with this option. -s :: --strategy=:: @@ -328,8 +332,8 @@ which makes little sense. and after each change. When fewer lines of surrounding context exist they all must match. By default no context is ever ignored. - Incompatible with the --merge and --interactive options, or - anything that implies those options or their machinery. + Incompatible with the --interactive option, or anything that + uses the `--interactive` machinery. -f:: --force-rebase:: @@ -361,15 +365,15 @@ default is `--no-fork-point`, otherwise the default is `--fork-point`. --whitespace=:: These flag are passed to the 'git apply' program (see linkgit:git-apply[1]) that applies the patch. - Incompatible with the --merge and --interactive options, or - anything that implies those options or their machinery. + Incompatible with the --interactive option, or anything that + uses the `--interactive` machinery. --committer-date-is-author-date:: --ignore-date:: These flags are passed to 'git am' to easily change the dates of the rebased commits (see linkgit:git-am[1]). - Incompatible with the --merge and --interactive options, or - anything that implies those options or their machinery. + Incompatible with the --interactive option, or anything that + uses the `--interactive` machinery. --signoff:: Add a Signed-off-by: trailer to all the rebased commits. Note diff --git a/Makefile b/Makefile index 1d27f36365..8d24aaf638 100644 --- a/Makefile +++ b/Makefile @@ -620,7 +620,6 @@
[RFC PATCH 3/3] git-rebase.sh: make git-rebase--interactive the default
am-based rebases suffer from an reduced ability to detect directory renames upstream, which is fundamental to the fact that it throws away information about the history: in particular, it dispenses with the original commits involved by turning them into patches, and without the knowledge of the original commits we cannot determine a proper merge base. The am-based rebase will proceed by first trying git-apply, and, only if it fails, will try to reconstruct a provisional merge base using build_fake_ancestor(). This fake ancestor will ONLY contain the files modified in the patch; without the full list of files in the real merge base, renames of any missing files cannot be detected. Directory rename detection works by looking at individual file renames and deducing when a full directory has been renamed. Trying to modify build_fake_ancestor() to instead create a merge base that includes common file information by looking for a commit that contained all the same blobs as the pre-image of the patch would require a very expensive search through history, may find the wrong commit, and outside of rebase may not find any commit that matches. But we had all the relevant information to start. So, instead of attempting to fix am which just doesn't have the relevant information, instead note its strength and weaknesses, and change the default rebase machinery to interactive since it does not suffer from the same problems. Documentation updates: Several documentation updates were made as part of previous patches to note which sets of options were incompatible. However, adding a new --am option allows us to make it clear which options imply this machinery and simplify the discussion of incompatible sets of options significantly. testcase modification nodes: t3400: git-rebase--interactive.sh explicitly captures output and adds die "could not detach HEAD"; adjust to match t3401: fixes directory rename detection problems for rebase by default, though --am still fails on it t3404: testcase said it was for a non-interactive rebase, so add --am t3407: instead of having one test per rebase type, also add an extra one for the default rebase type. That'll duplicate --merge for now, but will make switching default machinery again in the future clearer, if we ever choose to do so. Also, previously there was a conspicuously missing test for all --quit combinations. t3420: command output varies by type of rebase and this test script has lots of helper functions providing the appropriate expectation. Make sure we call the relevant one. t3425: Similar to t3407, add default rebase type to list of types to test. See also comments about t3425 in the switchover of --merge to the interactive machinery. t5407: This suite has tests for different rebase types, so specify some are --am ones. t5520: interactive rebases capture and reprint informational message about conflicts to stdout rather than stderr. Also, patches for interactive rebases are stored differently under .git than for am rebases. t9903: Add another testcase for am rebases. Prompt for default rebases is now REBASE-m. Signed-off-by: Elijah Newren --- Documentation/git-rebase.txt | 26 -- git-rebase.sh | 30 +- t/t3400-rebase.sh | 7 --- t/t3401-rebase-and-am-rename.sh | 18 +- t/t3404-rebase-interactive.sh | 2 +- t/t3407-rebase-abort.sh | 28 +++- t/t3420-rebase-autostash.sh | 15 ++- t/t3425-rebase-topology-merges.sh | 9 ++--- t/t5407-post-rewrite-hook.sh | 4 ++-- t/t5520-pull.sh | 9 ++--- t/t9903-bash-prompt.sh| 12 +++- 11 files changed, 117 insertions(+), 43 deletions(-) diff --git a/Documentation/git-rebase.txt b/Documentation/git-rebase.txt index 28d1658d7a..6cfb7479fd 100644 --- a/Documentation/git-rebase.txt +++ b/Documentation/git-rebase.txt @@ -240,13 +240,16 @@ leave out at most one of A and B, in which case it defaults to HEAD. original branch. The index and working tree are also left unchanged as a result. +--am: + Use git-am internally to rebase. This may run faster, but have + problems with rename detection on the upstream side. + Incompatible with the --interactive option, or anything that + uses the `--interactive` machinery. + --keep-empty:: Keep the commits that do not change anything from its parents in the result. -+ -This uses the `--interactive` machinery internally, and as such, -anything that is incompatible with --interactive is incompatible -with this option. + This uses the `--interactive` machinery internally. --allow-empty-message:: By default, rebasing commits with
[RFC PATCH 0/3] Send more rebases through interactive machinery
This series: * deletes git-rebase--merge, making git-rebase--interactive handle those cases instead * adds an --am option to git rebase, and changes the rebase default from using git-rebase--am to git-rebase--interactive, fixing directory rename detection for default rebases. However: * this series has several minor conflicts with ag/rebase-p in pu. But it's just an RFC for now; I can re-roll on top of that after getting some feedback. * this series depends on the other preparatory fixups under https://public-inbox.org/git/CABPp-BGxaroePB6aKWAkZeADLB7VE3y1CPy2RyNwpn=+c01...@mail.gmail.com/ To get this series with the preparatory fixups: Fetch-It-Via: git fetch https://github.com/newren/git rebase-new-default Elijah Newren (3): git-rebase, sequencer: add a --quiet option for the interactive machinery rebase: Implement --merge via git-rebase--interactive git-rebase.sh: make git-rebase--interactive the default .gitignore| 1 - Documentation/git-rebase.txt | 24 +++-- Makefile | 1 - git-rebase--interactive.sh| 14 ++- git-rebase--merge.sh | 164 -- git-rebase.sh | 75 -- sequencer.c | 19 ++-- sequencer.h | 1 + Yes, there are a significant number of testcase changes below. See the relevant commit messages where I explain why each was changed. t/t3400-rebase.sh | 7 +- t/t3401-rebase-and-am-rename.sh | 18 +++- t/t3404-rebase-interactive.sh | 2 +- t/t3406-rebase-message.sh | 7 +- t/t3407-rebase-abort.sh | 28 - t/t3420-rebase-autostash.sh | 89 +++- t/t3421-rebase-topology-linear.sh | 10 +- t/t3425-rebase-topology-merges.sh | 15 +-- t/t5407-post-rewrite-hook.sh | 5 +- t/t5520-pull.sh | 9 +- t/t9903-bash-prompt.sh| 12 ++- 19 files changed, 180 insertions(+), 321 deletions(-) delete mode 100644 git-rebase--merge.sh -- 2.18.0.rc1.12.g2996b9442d
[RFC PATCH 1/3] git-rebase, sequencer: add a --quiet option for the interactive machinery
While 'quiet' and 'interactive' may sound like antonyms, the interactive machinery actually has logic that implements several interactive_rebase=implied cases (--exec, --keep-empty, --rebase-merges) which won't pop up an editor. Further, we want to make the interactive machinery also take over for git-rebase--merge and become the default merge strategy, so it makes sense for these other cases to have a quiet option. git-rebase--interactive was already somewhat quieter than git-rebase--merge and git-rebase--am, possibly because cherry-pick has just traditionally been quieter. As such, we only drop a few informational messages -- "Rebasing (n/m)" and "Succesfully rebased..." Signed-off-by: Elijah Newren --- git-rebase--interactive.sh | 9 +++-- git-rebase.sh | 2 +- sequencer.c| 19 +-- sequencer.h| 1 + 4 files changed, 22 insertions(+), 9 deletions(-) diff --git a/git-rebase--interactive.sh b/git-rebase--interactive.sh index 06a7b79307..1f2401f702 100644 --- a/git-rebase--interactive.sh +++ b/git-rebase--interactive.sh @@ -139,8 +139,12 @@ mark_action_done () { if test "$last_count" != "$new_count" then last_count=$new_count - eval_gettext "Rebasing (\$new_count/\$total)"; printf "\r" - test -z "$verbose" || echo + if test -z "$GIT_QUIET" + then + eval_gettext "Rebasing (\$new_count/\$total)"; + printf "\r" + test -z "$verbose" || echo + fi fi } @@ -713,6 +717,7 @@ Commit or stash your changes, and then run "$hook" rebase < "$rewritten_list" true # we don't care if this hook failed fi && + test -z "$GIT_QUIET" && warn "$(eval_gettext "Successfully rebased and updated \$head_name.")" return 1 # not failure; just to break the do_rest loop diff --git a/git-rebase.sh b/git-rebase.sh index 7d1612b31b..b639c0d4fe 100755 --- a/git-rebase.sh +++ b/git-rebase.sh @@ -136,7 +136,7 @@ write_basic_state () { echo "$head_name" > "$state_dir"/head-name && echo "$onto" > "$state_dir"/onto && echo "$orig_head" > "$state_dir"/orig-head && - echo "$GIT_QUIET" > "$state_dir"/quiet && + test t = "$GIT_QUIET" && : > "$state_dir"/quiet test t = "$verbose" && : > "$state_dir"/verbose test -n "$strategy" && echo "$strategy" > "$state_dir"/strategy test -n "$strategy_opts" && echo "$strategy_opts" > \ diff --git a/sequencer.c b/sequencer.c index a2843e3906..d418d40bee 100644 --- a/sequencer.c +++ b/sequencer.c @@ -143,6 +143,7 @@ static GIT_PATH_FUNC(rebase_path_refs_to_delete, "rebase-merge/refs-to-delete") static GIT_PATH_FUNC(rebase_path_gpg_sign_opt, "rebase-merge/gpg_sign_opt") static GIT_PATH_FUNC(rebase_path_orig_head, "rebase-merge/orig-head") static GIT_PATH_FUNC(rebase_path_verbose, "rebase-merge/verbose") +static GIT_PATH_FUNC(rebase_path_quiet, "rebase-merge/quiet") static GIT_PATH_FUNC(rebase_path_signoff, "rebase-merge/signoff") static GIT_PATH_FUNC(rebase_path_head_name, "rebase-merge/head-name") static GIT_PATH_FUNC(rebase_path_onto, "rebase-merge/onto") @@ -2251,6 +2252,9 @@ static int read_populate_opts(struct replay_opts *opts) if (file_exists(rebase_path_verbose())) opts->verbose = 1; + if (file_exists(rebase_path_quiet())) + opts->quiet = 1; + if (file_exists(rebase_path_signoff())) { opts->allow_ff = 0; opts->signoff = 1; @@ -3171,10 +3175,11 @@ static int pick_commits(struct todo_list *todo_list, struct replay_opts *opts) fprintf(f, "%d\n", todo_list->done_nr); fclose(f); } - fprintf(stderr, "Rebasing (%d/%d)%s", - todo_list->done_nr, - todo_list->total_nr, - opts->verbose ? "\n" : "\r"); + if (!opts->quiet) + fprintf(stderr, "Rebasing (%d/%d)%s", + todo_list->done_nr, + todo_list->total_nr, + opts->verbose ? "\n" : "\r"); } unlink(rebase_path_message()); unlink(rebase_path_author_script()); @@ -3385,8 +3390,10 @@ static int pick_commits(struct todo_list *todo_list, struct replay_opts *opts) } apply_autostash(opts); - fprintf(stderr, "Successfully rebased and updated %s.\n", -
Re: [PATCH v8 2/2] json-writer: t0019: add perl unit test
On Thu, Jun 7, 2018 at 10:12 AM, wrote: > Test json-writer output using perl script. > > Signed-off-by: Jeff Hostetler > --- > diff --git a/t/t0019/parse_json.perl b/t/t0019/parse_json.perl > @@ -0,0 +1,52 @@ > +#!/usr/bin/perl > +use strict; > +use warnings; > +use JSON; This new script is going to have to be protected by a prerequisite since the JSON module is not part of the standard Perl installation, thus will not necessarily be installed everywhere (it isn't on any of my machines, for instance). Something like: test_lazy_prereq PERLJSON ' perl -MJSON -e "exit 0" ' which would be used like this: test_expect_success PERLJSON 'parse JSON using Perl' ' ... '
Re: [PATCH 00/23] Multi-pack-index (MIDX)
On 6/7/2018 10:45 AM, Ævar Arnfjörð Bjarmason wrote: On Thu, Jun 07 2018, Derrick Stolee wrote: To test the performance in this situation, I created a script that organizes the Linux repository in a similar fashion. I split the commit history into 50 parts by creating branches on every 10,000 commits of the first- parent history. Then, `git rev-list --objects A ^B` provides the list of objects reachable from A but not B, so I could send that to `git pack-objects` to create these "time-based" packfiles. With these 50 packfiles (deleting the old one from my fresh clone, and deleting all tags as they were no longer on-disk) I could then test 'git rev-list --objects HEAD^{tree}' and see: Before: 0.17s After: 0.13s % Diff: -23.5% By adding logic to count hits and misses to bsearch_pack, I was able to see that the command above calls that method 266,930 times with a hit rate of 33%. The MIDX has the same number of calls with a 100% hit rate. Do you have the script you used for this? It would be very interesting as something we could stick in t/perf/ to test this use-case in the future. How does this & the numbers below compare to just a naïve --max-pack-size= on linux.git? Is it possible for you to tar this test repo up and share it as a one-off? I've been polishing the core.validateAbbrev series I have, and it would be interesting to compare some of the (abbrev) numbers. Here is what I used. You will want to adjust your constants for whatever repo you are using. This is for the Linux kernel which has a first-parent history of ~50,000 commits. It also leaves a bunch of extra files around, so it is nowhere near incorporating into the code. #!/bin/bash for i in `seq 1 50` do ORDER=$((51 - $i)) NUM_BACK=$((1000 * ($i - 1))) echo creating batch/$ORDER git branch -f batch/$ORDER HEAD~$NUM_BACK echo batch/$ORDER git rev-parse batch/$ORDER done lastbranch="" for i in `seq 1 50` do branch=batch/$i if [$lastbranch -eq ""] then echo "$branch" git rev-list --objects $branch | sed 's/ .*//' >objects-$i.txt else echo "$lastbranch" echo "$branch" git rev-list --objects $branch ^$lastbranch | sed 's/ .*//' >objects-$i.txt fi git pack-objects --no-reuse-delta .git/objects/pack/branch-split2 lastbranch=$branch done for tag in `git tag --list` do git tag -d $tag done rm -rf .git/objects/pack/pack-* git midx write
git grep with leading inverted bracket expression
If the first atom of a regex is a bracket expression with an inverted range, git grep is very slow. $ time git grep 'struct_size' >/dev/null real0m0.368s user0m0.563s sys 0m0.453s $ time git grep '[^t]truct_size' >/dev/null real0m31.529s user1m54.909s sys 0m0.805s If the bracket expression is moved to even the second position in the string, it runs much faster: $ time git grep 's[^p]ruct_size' >/dev/null real0m3.989s user0m13.939s sys 0m0.403s It's pretty bad with even a '.' as the first character: $ time git grep '.truct_size' >/dev/null real0m14.514s user0m52.624s sys 0m0.598s $ git --version git version 2.17.1 Setting LANG=C improves matters by a factor of 3-4 (depending if you count real or user time): $ time git grep '[^t]truct_size' >/dev/null real0m10.035s user0m28.795s sys 0m0.537s (this is using something pretty close to Linus' current HEAD of the linux repository, an i7-7500, 16GB memory).
Re: [PATCH v4 04/23] unpack-tress: convert clear_ce_flags* to avoid the_index
On Thu, Jun 7, 2018 at 9:41 AM, Elijah Newren wrote: > Subject line: unpack-trees rather than unpack-tress. > > > > On Wed, Jun 6, 2018 at 9:49 AM, Nguyễn Thái Ngọc Duy > wrote: >> Prior to fba92be8f7, this code implicitly (and incorrectly) assumes >> the_index when running the exclude machinery. fba92be8f7 helps show >> this problem clearer because unpack-trees operation is supposed to >> work on whatever index the caller specifies... not specifically >> the_index. >> >> Update the code to use "istate" argument that's originally from >> mark_new_skip_worktree(). From the call sites, both in unpack_trees(), >> you can see that this function works on two separate indexes: >> o->src_index and o->result. The second mark_new_skip_worktree() so far >> has incorecctly applied exclude rules on o->src_index instead of >> o->result. It's unclear what is the consequences of this, but it's >> definitely wrong. >> >> [1] fba92be8f7 (dir: convert is_excluded_from_list to take an index - >> 2017-05-05) >> >> Signed-off-by: Nguyễn Thái Ngọc Duy > > A somewhat curious finding: when I was rebuilding and re-testing all > 23 patches, I got a failure on this patch in test 31 of > t7063-status-untracked-cache.sh. I did not get any test failures with > any of the other patches. However, after re-running that test or the > whole suite half a dozen times with just up to this patch applied, I > was not able to trigger the failure again. Is there a rare race in > that testcase? Untracked cache tests are very time-sensitive. I'll try to run and re-run them a couple times to understand more. Thanks for pointing it out. -- Duy
Re: [RFC PATCH 0/5] format-patch: automate cover letter range-diff
On Thu, Jun 7, 2018 at 10:34 AM, Eric Sunshine wrote: > On Wed, Jun 6, 2018 at 3:16 PM, Duy Nguyen wrote: >> On Wed, May 30, 2018 at 10:03 AM, Eric Sunshine >> wrote: >>> Dscho recently implemented a 'tbdiff' replacement as a Git builtin named >>> git-branch-diff[1] which computes differences between two versions of a >>> patch series. Such a diff can be a useful aid for reviewers when >>> inserted into a cover letter. However, doing so requires manual >>> generation (invoking git-branch-diff) and copy/pasting the result into >>> the cover letter. >> >> Another option which I wanted to go is delegate part of cover letter >> generation to a hook (or just a config key that contains a shell >> command). This way it's easier to customize cover letters. We could >> still have a good fallback that does shortlog, diffstat and tbdiff. > > It is common on this mailing list to turn down requests for new hooks > when the requested functionality could just as easily be implemented > via a wrapper script. So, my knee-jerk reaction is that a hook to > customize the cover letter may be overkill when the same functionality > could likely be implemented relatively easily by a shell script which > invokes git-format-patch and customizes the cover letter > after-the-fact. Same argument regarding a config key holding a shell > command. But, perhaps there are cases which don't occur to me which > could be helped by a config variable or such. I think format-patch --cover-letter nowadays does more stuff that's not so easy to simply rewrite it in a shell script. My original problem with format-patch is it hard codes shortlog settings and you can't list patches with patch number (e.g. "[1/2] foo bar"). The simplest way is let format-patch does it stuff as usual and "outsource" some cover letter's body generation to a script. But it's ok. I could try to code the patch numbering thing in format-patch and maybe submit a patch or two for that later. > Of course, by the same reasoning, the --range-diff functionality > implemented by this patch series, which is just a convenience, could > be handled by a wrapper script, thus is not strictly needed. On the > other hand, given that interdiffs and range-diffs are so regularly > used in re-rolls on this list (and perhaps other mailing list-based > projects) may be argument enough in favor of having such an option > built into git-format-patch. -- Duy
[RFC PATCH v1] telemetry design overview (part 1)
From: Jeff Hostetler I've been working to add code to Git to optionally collect telemetry data. The goal is to be able to collect performance data from Git commands and allow it to be aggregated over a user community to find "slow commands". I'm going to break this up into several parts rather than sending one large patch series. I think it is easier to review in pieces and in stages. Part 1 contains the overall design documentation. Part 2 will contain the basic telemetry event mechanism and the cmd_start and cmd_exit events. I'll post part 2 shortly -- as soon as I can convert my tests from python to perl. :-) Jeff Hostetler (1): telemetry: design documenation Documentation/technical/telemetry.txt | 475 ++ 1 file changed, 475 insertions(+) create mode 100644 Documentation/technical/telemetry.txt -- 2.9.3
[RFC PATCH v1] telemetry: design documenation
From: Jeff Hostetler Create design documentation to describe the telemetry feature. Signed-off-by: Jeff Hostetler --- Documentation/technical/telemetry.txt | 475 ++ 1 file changed, 475 insertions(+) create mode 100644 Documentation/technical/telemetry.txt diff --git a/Documentation/technical/telemetry.txt b/Documentation/technical/telemetry.txt new file mode 100644 index 000..0a708ad --- /dev/null +++ b/Documentation/technical/telemetry.txt @@ -0,0 +1,475 @@ +Telemetry Design Notes +== + +The telemetry feature allows Git to generate structured telemetry data +for executed commands. Data includes command line arguments, execution +times, error codes and messages, and information about child processes. + +Structued data is produced in a JSON-like format. (See the UTF-8 related +"limitations" described in json-writer.h) + +Telemetry data can be written to a local file or sent to a dynamically +loaded shared library via a plugin API. + +The telemetry feature is similar to the existing trace API (defined in +Documentation/technical/api-trace.txt). Telemetry events are generated +thoughout the life of a Git command just like trace messages. But where +as trace messages are essentially developer debug messages, telemetry +events are intended for logging and automated analysis. + +The goal of the telemetry feature is to be able to gather usage data across +a group of production users to identify real-world performance problems in +production. Additionally, it might help identify common user errors and +guide future user training. + +By default, telemetry is disabled. Telemetry is controlled using config +settings (see "telemetry.*" in Documentation/config.txt). + + +Telemetry Events + + +Telemetry data is generated as a series of events. Each event is written +as a self-describing JSON object. + +Events: cmd_start and cmd_exit +-- + +The `cmd_start` event is emitted the very beginning of the git.exe process +in cmd_main() and `cmd_exit` event is emitted at the end of the process in +the atexit cleanup routine. + +For example, running "git version" produces: + +{ + "event_name": "cmd_start", + "argv": [ +"C:\\work\\gfw\\git.exe", +"version" + ], + "clock": 1525978509976086000, + "pid": 25460, + "git_version": "2.17.0.windows.1", + "telemetry_version": "1", + "session_id": "1525978509976086000-25460" +} +{ + "event_name": "cmd_exit", + "argv": [ +"C:\\work\\gfw\\git.exe", +"version" + ], + "clock": 1525978509980903391, + "pid": 25460, + "git_version": "2.17.0.windows.1", + "telemetry_version": "1", + "session_id": "1525978509976086000-25460", + "is_interactive": false, + "exit_code": 0, + "elapsed_time_core": 0.004814, + "elapsed_time_total": 0.004817, + "builtin": { +"name": "version" + } +} + +Fields common to all events: + * `event_name` is the name of the event. + * `argv` is the array of command line arguments. + * `clock` is the time of the event in nanoseconds since the epoch. + * `pid` is the process id. + * `git_version` is the git version string. + * `telemetry_version` is the version of the telemetry format. + * `session_id` is described in a later section. + +Additional fields in cmd_exit: + * `is_interactive` is true if git.exe spawned an interactive child process, + such as a pager, editor, prompt, or gui tool. + * `exit_code` is the value passed to exit() from main(). + * `error_message` (not shown) is the array of error messages. + * `elapsed-core-time` measures the time in seconds until exit() was called. + * `elapsed-total-time` measures the time until the atexit() routine starts + (which will include time spend in other atexit() routines cleaning up + child processes and etc.). + * `alias` (not shown) the updated argv after alias expansion. + * `builtin.name` is the canonical command name (from the cmd_struct[] + table) of a builtin command. + * `builtin.mode` (not shown) is shown for some commands that have different + major modes and performance times. For example, checkout can switch + branches or repair a single file. + * `child_summary` (not shown) is described in a later section. + * `timers` (not shown) is described in a later section. + * `aux` (not shown) is described in a later section. + + +Events: child_start and child_exit +-- + +The child-start event is emitted just before a child process is started. +It includes a unique child-id and the child's command line arguments. + +The child-exit event is emitted after a child process exits and has +been reaped. This event extends the start event with the child's exit +status and elapsed time. + +For example, during a "git fetch origin", git.exe runs gc in the background +and these events are emitted by the fetch process before and after the +child gc process: + +{ + "event_name": "child_start", + "argv": [ +
Re: [PATCH 00/23] Multi-pack-index (MIDX)
On Thu, Jun 07 2018, Derrick Stolee wrote: > To test the performance in this situation, I created a > script that organizes the Linux repository in a similar > fashion. I split the commit history into 50 parts by > creating branches on every 10,000 commits of the first- > parent history. Then, `git rev-list --objects A ^B` > provides the list of objects reachable from A but not B, > so I could send that to `git pack-objects` to create > these "time-based" packfiles. With these 50 packfiles > (deleting the old one from my fresh clone, and deleting > all tags as they were no longer on-disk) I could then > test 'git rev-list --objects HEAD^{tree}' and see: > > Before: 0.17s > After: 0.13s > % Diff: -23.5% > > By adding logic to count hits and misses to bsearch_pack, > I was able to see that the command above calls that > method 266,930 times with a hit rate of 33%. The MIDX > has the same number of calls with a 100% hit rate. Do you have the script you used for this? It would be very interesting as something we could stick in t/perf/ to test this use-case in the future. How does this & the numbers below compare to just a naïve --max-pack-size= on linux.git? Is it possible for you to tar this test repo up and share it as a one-off? I've been polishing the core.validateAbbrev series I have, and it would be interesting to compare some of the (abbrev) numbers. > Abbreviation Speedups > - > > To fully disambiguate an abbreviation, we must iterate > through all packfiles to ensure no collision exists in > any packfile. This requires O(P log N) time. With the > MIDX, this is only O(log N) time. Our standard test [2] > is 'git log --oneline --parents --raw' because it writes > many abbreviations while also doing a lot of other work > (walking commits and trees to compute the raw diff). > > For a copy of the Linux repository with 50 packfiles > split by time, we observed the following: > > Before: 100.5 s > After: 58.2 s > % Diff: -59.7% > > > Request for Review Attention > > > I tried my best to take the feedback from the commit-graph > feature and apply it to this feature. I also worked to > follow the object-store refactoring as I could. I also have > some local commits that create a 'verify' subcommand and > integrate with 'fsck' similar to the commit-graph, but I'll > leave those for a later series (and review is still underway > for that part of the commit-graph). > > One place where I could use some guidance is related to the > current state of 'the_hash_algo' patches. The file format > allows a different "hash version" which then indicates the > length of the hash. What's the best way to ensure this > feature doesn't cause extra pain in the hash-agnostic series? > This will inform how I go back and make the commit-graph > feature better in this area, too. > > > Thanks, > -Stolee
[PATCH v8 1/2] json_writer: new routines to create data in JSON format
From: Jeff Hostetler Add a series of jw_ routines and "struct json_writer" structure to compose JSON data. The resulting string data can then be output by commands wanting to support a JSON output format. The json-writer routines can be used to generate structured data in a JSON-like format. We say "JSON-like" because we do not enforce the Unicode (usually UTF-8) requirement on string fields. Internally, Git does not necessarily have Unicode/UTF-8 data for most fields, so it is currently unclear the best way to enforce that requirement. For example, on Linx pathnames can contain arbitrary 8-bit character data, so a command like "status" would not know how to encode the reported pathnames. We may want to revisit this (or double encode such strings) in the future. The initial use for the json-writer routines is for generating telemetry data for executed Git commands. Later, we may want to use them in other commands, such as status. Helped-by: René Scharfe Helped-by: Wink Saville Helped-by: Ramsay Jones Signed-off-by: Jeff Hostetler --- Makefile| 2 + json-writer.c | 419 json-writer.h | 113 + t/helper/test-json-writer.c | 572 t/t0019-json-writer.sh | 236 ++ 5 files changed, 1342 insertions(+) create mode 100644 json-writer.c create mode 100644 json-writer.h create mode 100644 t/helper/test-json-writer.c create mode 100755 t/t0019-json-writer.sh diff --git a/Makefile b/Makefile index a1d8775..4ae6946 100644 --- a/Makefile +++ b/Makefile @@ -666,6 +666,7 @@ TEST_PROGRAMS_NEED_X += test-fake-ssh TEST_PROGRAMS_NEED_X += test-genrandom TEST_PROGRAMS_NEED_X += test-hashmap TEST_PROGRAMS_NEED_X += test-index-version +TEST_PROGRAMS_NEED_X += test-json-writer TEST_PROGRAMS_NEED_X += test-lazy-init-name-hash TEST_PROGRAMS_NEED_X += test-line-buffer TEST_PROGRAMS_NEED_X += test-match-trees @@ -820,6 +821,7 @@ LIB_OBJS += hashmap.o LIB_OBJS += help.o LIB_OBJS += hex.o LIB_OBJS += ident.o +LIB_OBJS += json-writer.o LIB_OBJS += kwset.o LIB_OBJS += levenshtein.o LIB_OBJS += line-log.o diff --git a/json-writer.c b/json-writer.c new file mode 100644 index 000..f35ce19 --- /dev/null +++ b/json-writer.c @@ -0,0 +1,419 @@ +#include "cache.h" +#include "json-writer.h" + +void jw_init(struct json_writer *jw) +{ + strbuf_reset(>json); + strbuf_reset(>open_stack); + strbuf_reset(>first_stack); + jw->pretty = 0; +} + +void jw_release(struct json_writer *jw) +{ + strbuf_release(>json); + strbuf_release(>open_stack); + strbuf_release(>first_stack); +} + +/* + * Append JSON-quoted version of the given string to 'out'. + */ +static void append_quoted_string(struct strbuf *out, const char *in) +{ + unsigned char c; + + strbuf_addch(out, '"'); + while ((c = *in++) != '\0') { + if (c == '"') + strbuf_addstr(out, "\\\""); + else if (c == '\\') + strbuf_addstr(out, ""); + else if (c == '\n') + strbuf_addstr(out, "\\n"); + else if (c == '\r') + strbuf_addstr(out, "\\r"); + else if (c == '\t') + strbuf_addstr(out, "\\t"); + else if (c == '\f') + strbuf_addstr(out, "\\f"); + else if (c == '\b') + strbuf_addstr(out, "\\b"); + else if (c < 0x20) + strbuf_addf(out, "\\u%04x", c); + else + strbuf_addch(out, c); + } + strbuf_addch(out, '"'); +} + +static inline void indent_pretty(struct json_writer *jw) +{ + int k; + + if (!jw->pretty) + return; + + for (k = 0; k < jw->open_stack.len; k++) + strbuf_addstr(>json, " "); +} + +/* + * Begin an object or array (either top-level or nested within the currently + * open object or array). + */ +static inline void begin(struct json_writer *jw, char ch_open, int pretty) +{ + jw->pretty = pretty; + + strbuf_addch(>json, ch_open); + + strbuf_addch(>open_stack, ch_open); + strbuf_addch(>first_stack, '1'); +} + +/* + * Assert that the top of the open-stack is an object. + */ +static inline void assert_in_object(const struct json_writer *jw, const char *key) +{ + if (!jw->open_stack.len) + BUG("json-writer: object: missing jw_object_begin(): '%s'", key); + if (jw->open_stack.buf[jw->open_stack.len - 1] != '{') + BUG("json-writer: object: not in object: '%s'", key); +} + +/* + * Assert that the top of the open-stack is an array. + */ +static inline void assert_in_array(const struct json_writer *jw) +{ + if (!jw->open_stack.len) + BUG("json-writer: array: missing jw_array_begin()"); + if
[PATCH v8 2/2] json-writer: t0019: add perl unit test
From: Jeff Hostetler Test json-writer output using perl script. Signed-off-by: Jeff Hostetler --- t/t0019-json-writer.sh | 38 t/t0019/parse_json.perl | 52 + 2 files changed, 90 insertions(+) create mode 100644 t/t0019/parse_json.perl diff --git a/t/t0019-json-writer.sh b/t/t0019-json-writer.sh index c9c2e23..fd61fe4 100755 --- a/t/t0019-json-writer.sh +++ b/t/t0019-json-writer.sh @@ -233,4 +233,42 @@ test_expect_success 'inline array with no members' ' test_cmp expect actual ' +# As a sanity check, ask Perl to parse our generated JSON and recursively +# dump the resulting data in sorted order. Confirm that that matches our +# expectations. +test_expect_success 'parse JSON using Perl' ' + cat >expect <<-\EOF && + row[0].a abc + row[0].b 42 + row[0].sub1 hash + row[0].sub1.c 3.14 + row[0].sub1.d 1 + row[0].sub1.sub2 array + row[0].sub1.sub2[0] 0 + row[0].sub1.sub2[1] hash + row[0].sub1.sub2[1].g 0 + row[0].sub1.sub2[1].h 1 + row[0].sub1.sub2[2] null + EOF + test-json-writer >output.json \ + @object \ + @object-string a abc \ + @object-int b 42 \ + @object-object "sub1" \ + @object-double c 2 3.140 \ + @object-true d \ + @object-array "sub2" \ + @array-false \ + @array-object \ + @object-int g 0 \ + @object-int h 1 \ + @end \ + @array-null \ + @end \ + @end \ + @end && + perl "$TEST_DIRECTORY"/t0019/parse_json.perl actual && + test_cmp expect actual +' + test_done diff --git a/t/t0019/parse_json.perl b/t/t0019/parse_json.perl new file mode 100644 index 000..ca4e5bf --- /dev/null +++ b/t/t0019/parse_json.perl @@ -0,0 +1,52 @@ +#!/usr/bin/perl +use strict; +use warnings; +use JSON; + +sub dump_array { +my ($label_in, $ary_ref) = @_; +my @ary = @$ary_ref; + +for ( my $i = 0; $i <= $#{ $ary_ref }; $i++ ) +{ + my $label = "$label_in\[$i\]"; + dump_item($label, $ary[$i]); +} +} + +sub dump_hash { +my ($label_in, $obj_ref) = @_; +my %obj = %$obj_ref; + +foreach my $k (sort keys %obj) { + my $label = (length($label_in) > 0) ? "$label_in.$k" : "$k"; + my $value = $obj{$k}; + + dump_item($label, $value); +} +} + +sub dump_item { +my ($label_in, $value) = @_; +if (ref($value) eq 'ARRAY') { + print "$label_in array\n"; + dump_array($label_in, $value); +} elsif (ref($value) eq 'HASH') { + print "$label_in hash\n"; + dump_hash($label_in, $value); +} elsif (defined $value) { + print "$label_in $value\n"; +} else { + print "$label_in null\n"; +} +} + +my $row = 0; +while (<>) { +my $data = decode_json( $_ ); +my $label = "row[$row]"; + +dump_hash($label, $data); +$row++; +} + -- 2.9.3
[PATCH v8 0/2] json-writer V8
From: Jeff Hostetler Here is V8 of my json-writer patches. Please replace the existing V5/V6/V7 version of the jh/json-writer branch with this one. This version uses perl rather than python to test the generated JSON. Jeff Hostetler (2): json_writer: new routines to create data in JSON format json-writer: t0019: add perl unit test Makefile| 2 + json-writer.c | 419 json-writer.h | 113 + t/helper/test-json-writer.c | 572 t/t0019-json-writer.sh | 274 + t/t0019/parse_json.perl | 52 6 files changed, 1432 insertions(+) create mode 100644 json-writer.c create mode 100644 json-writer.h create mode 100644 t/helper/test-json-writer.c create mode 100755 t/t0019-json-writer.sh create mode 100644 t/t0019/parse_json.perl -- 2.9.3
Re: [PATCH 00/23] Multi-pack-index (MIDX)
On 6/7/2018 10:03 AM, Derrick Stolee wrote: This patch series includes a rewrite of the previous multi-pack-index RFC [1] using the feedback from the commit-graph feature. Sorry to everyone who got a duplicate copy of this series. I misspelled 'kernel.org' and it didn't go to the list. I also have this series available as a GitHub PR [1] [1] https://github.com/derrickstolee/git/pull/7
[PATCH 07/23] midx: expand test data
As we build the multi-pack-index file format, we want to test the format on real repoasitories. Add tests to t5319-midx.sh that create repository data including multiple packfiles with both version 1 and version 2 formats. The current 'git midx write' command will always write the same file with no "real" data. This will be expanded in future commits, along with the test expectations. Signed-off-by: Derrick Stolee --- t/t5319-midx.sh | 101 1 file changed, 101 insertions(+) diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh index e78514d8e9..2c25a69744 100755 --- a/t/t5319-midx.sh +++ b/t/t5319-midx.sh @@ -14,8 +14,109 @@ midx_read_expect() { test_expect_success 'write midx with no packs' ' git midx --object-dir=. write && + test_when_finished rm pack/multi-pack-index && test_path_is_file pack/multi-pack-index && midx_read_expect ' +test_expect_success 'create objects' ' + for i in `test_seq 1 5` + do + iii=$(printf '%03i' $i) + test-tool genrandom "bar" 200 > wide_delta_$iii && + test-tool genrandom "baz $iii" 50 >> wide_delta_$iii && + test-tool genrandom "foo"$i 100 > deep_delta_$iii && + test-tool genrandom "foo"$(expr $i + 1) 100 >> deep_delta_$iii && + test-tool genrandom "foo"$(expr $i + 2) 100 >> deep_delta_$iii && + echo $iii >file_$iii && + test-tool genrandom "$iii" 8192 >>file_$iii && + git update-index --add file_$iii deep_delta_$iii wide_delta_$iii && + i=$(expr $i + 1) || return 1 + done && + { echo 101 && test-tool genrandom 100 8192; } >file_101 && + git update-index --add file_101 && + tree=$(git write-tree) && + commit=$(git commit-tree $tree obj-list && + git update-ref HEAD $commit +' + +test_expect_success 'write midx with one v1 pack' ' + pack=$(git pack-objects --index-version=1 pack/test wide_delta_$iii && + test-tool genrandom "baz $iii" 50 >> wide_delta_$iii && + test-tool genrandom "foo"$i 100 > deep_delta_$iii && + test-tool genrandom "foo"$(expr $i + 1) 100 >> deep_delta_$iii && + test-tool genrandom "foo"$(expr $i + 2) 100 >> deep_delta_$iii && + echo $iii >file_$iii && + test-tool genrandom "$iii" 8192 >>file_$iii && + git update-index --add file_$iii deep_delta_$iii wide_delta_$iii && + i=$(expr $i + 1) || return 1 + done && + { echo 101 && test-tool genrandom 100 8192; } >file_101 && + git update-index --add file_101 && + tree=$(git write-tree) && + commit=$(git commit-tree $tree -p HEADobj-list2 && + git update-ref HEAD $commit +' + +test_expect_success 'write midx with two packs' ' + pack1=$(git pack-objects --index-version=1 pack/test-1 wide_delta_$iii && + test-tool genrandom "baz $iii" 50 >> wide_delta_$iii && + test-tool genrandom "foo"$i 100 > deep_delta_$iii && + test-tool genrandom "foo"$(expr $i + 1) 100 >> deep_delta_$iii && + test-tool genrandom "foo"$(expr $i + 2) 100 >> deep_delta_$iii && + echo $iii >file_$iii && + test-tool genrandom "$iii" 8192 >>file_$iii && + git update-index --add file_$iii deep_delta_$iii wide_delta_$iii && + { echo 101 && test-tool genrandom 100 8192; } >file_101 && + git update-index --add file_101 && + tree=$(git write-tree) && + commit=$(git commit-tree $tree -p HEADobj-list && + git update-ref HEAD $commit && + git pack-objects --index-version=2 test-pack
[PATCH 06/23] midx: struct midxed_git and 'read' subcommand
As we build the multi-pack-index feature by adding chunks at a time, we want to test that the data is being written correctly. Create struct midxed_git to store an in-memory representation of a multi-pack-index and a memory-map of the binary file. Initialize this struct in load_midxed_git(object_dir). Create the 'git midx read' subcommand to output basic information about the multi-pack-index file. This will be expanded as more information is written to the file. Signed-off-by: Derrick Stolee --- Documentation/git-midx.txt | 11 +++ builtin/midx.c | 23 +- midx.c | 65 ++ midx.h | 9 ++ object-store.h | 19 +++ t/t5319-midx.sh| 12 ++- 6 files changed, 137 insertions(+), 2 deletions(-) diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt index dcaeb1a91b..919283fdd8 100644 --- a/Documentation/git-midx.txt +++ b/Documentation/git-midx.txt @@ -23,6 +23,11 @@ OPTIONS /packs/multi-pack-index for the current MIDX file, and /packs for the pack-files to index. +read:: + When given as the verb, read the current MIDX file and output + basic information about its contents. Used for debugging + purposes only. + write:: When given as the verb, write a new MIDX file to /packs/multi-pack-index. @@ -43,6 +48,12 @@ $ git midx write $ git midx --object-dir write --- +* Read the MIDX file in the .git/objects folder. ++ +--- +$ git midx read +--- + GIT --- diff --git a/builtin/midx.c b/builtin/midx.c index dc0a5acd3f..c7002f664a 100644 --- a/builtin/midx.c +++ b/builtin/midx.c @@ -6,7 +6,7 @@ #include "midx.h" static char const * const builtin_midx_usage[] ={ - N_("git midx [--object-dir ] [write]"), + N_("git midx [--object-dir ] [read|write]"), NULL }; @@ -14,6 +14,25 @@ static struct opts_midx { const char *object_dir; } opts; +static int read_midx_file(const char *object_dir) +{ + struct midxed_git *m = load_midxed_git(object_dir); + + if (!m) + return 0; + + printf("header: %08x %d %d %d %d\n", + m->signature, + m->version, + m->hash_version, + m->num_chunks, + m->num_packs); + + printf("object_dir: %s\n", m->object_dir); + + return 0; +} + int cmd_midx(int argc, const char **argv, const char *prefix) { static struct option builtin_midx_options[] = { @@ -38,6 +57,8 @@ int cmd_midx(int argc, const char **argv, const char *prefix) if (argc == 0) return 0; + if (!strcmp(argv[0], "read")) + return read_midx_file(opts.object_dir); if (!strcmp(argv[0], "write")) return write_midx_file(opts.object_dir); diff --git a/midx.c b/midx.c index 3e55422a21..fa18770f1d 100644 --- a/midx.c +++ b/midx.c @@ -3,12 +3,15 @@ #include "dir.h" #include "csum-file.h" #include "lockfile.h" +#include "object-store.h" #include "midx.h" #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */ #define MIDX_VERSION 1 #define MIDX_HASH_VERSION 1 /* SHA-1 */ #define MIDX_HEADER_SIZE 12 +#define MIDX_HASH_LEN 20 +#define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN) static char *get_midx_filename(const char *object_dir) { @@ -18,6 +21,68 @@ static char *get_midx_filename(const char *object_dir) return strbuf_detach(_name, NULL); } +struct midxed_git *load_midxed_git(const char *object_dir) +{ + struct midxed_git *m; + int fd; + struct stat st; + size_t midx_size; + void *midx_map; + const char *midx_name = get_midx_filename(object_dir); + + fd = git_open(midx_name); + if (fd < 0) + return NULL; + if (fstat(fd, )) { + close(fd); + return NULL; + } + midx_size = xsize_t(st.st_size); + + if (midx_size < MIDX_MIN_SIZE) { + close(fd); + die("multi-pack-index file %s is too small", midx_name); + } + + midx_map = xmmap(NULL, midx_size, PROT_READ, MAP_PRIVATE, fd, 0); + + m = xcalloc(1, sizeof(*m) + strlen(object_dir) + 1); + strcpy(m->object_dir, object_dir); + m->data = midx_map; + + m->signature = get_be32(m->data); + if (m->signature != MIDX_SIGNATURE) { + error("multi-pack-index signature %X does not match signature %X", + m->signature, MIDX_SIGNATURE); + goto cleanup_fail; + } + + m->version = *(m->data + 4); + if (m->version != MIDX_VERSION) { + error("multi-pack-index version %d not recognized", + m->version); + goto cleanup_fail; + } + + m->hash_version =
[PATCH 08/23] midx: read packfiles from pack directory
When constructing a multi-pack-index file for a given object directory, read the files within the enclosed pack directory and find matches that end with ".idx" and find the correct paired packfile using add_packed_git(). Signed-off-by: Derrick Stolee --- midx.c | 51 +++-- t/t5319-midx.sh | 15 --- 2 files changed, 57 insertions(+), 9 deletions(-) diff --git a/midx.c b/midx.c index fa18770f1d..9fb89c80a2 100644 --- a/midx.c +++ b/midx.c @@ -102,10 +102,15 @@ static size_t write_midx_header(struct hashfile *f, int write_midx_file(const char *object_dir) { unsigned char num_chunks = 0; - uint32_t num_packs = 0; char *midx_name; struct hashfile *f; struct lock_file lk; + struct packed_git **packs = NULL; + uint32_t i, nr_packs = 0, alloc_packs = 0; + DIR *dir; + struct dirent *de; + struct strbuf pack_dir = STRBUF_INIT; + size_t pack_dir_len; midx_name = get_midx_filename(object_dir); if (safe_create_leading_directories(midx_name)) { @@ -114,14 +119,56 @@ int write_midx_file(const char *object_dir) midx_name); } + strbuf_addf(_dir, "%s/pack", object_dir); + dir = opendir(pack_dir.buf); + + if (!dir) { + error_errno("unable to open pack directory: %s", + pack_dir.buf); + strbuf_release(_dir); + return 1; + } + + strbuf_addch(_dir, '/'); + pack_dir_len = pack_dir.len; + ALLOC_ARRAY(packs, alloc_packs); + while ((de = readdir(dir)) != NULL) { + if (is_dot_or_dotdot(de->d_name)) + continue; + + if (ends_with(de->d_name, ".idx")) { + ALLOC_GROW(packs, nr_packs + 1, alloc_packs); + + strbuf_setlen(_dir, pack_dir_len); + strbuf_addstr(_dir, de->d_name); + + packs[nr_packs] = add_packed_git(pack_dir.buf, +pack_dir.len, +0); + if (!packs[nr_packs]) + warning("failed to add packfile '%s'", + pack_dir.buf); + else + nr_packs++; + } + } + closedir(dir); + strbuf_release(_dir); + hold_lock_file_for_update(, midx_name, LOCK_DIE_ON_ERROR); f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf); FREE_AND_NULL(midx_name); - write_midx_header(f, num_chunks, num_packs); + write_midx_header(f, num_chunks, nr_packs); finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM); commit_lock_file(); + for (i = 0; i < nr_packs; i++) { + close_pack(packs[i]); + FREE_AND_NULL(packs[i]); + } + + FREE_AND_NULL(packs); return 0; } diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh index 2c25a69744..abe545c7c4 100755 --- a/t/t5319-midx.sh +++ b/t/t5319-midx.sh @@ -4,8 +4,9 @@ test_description='multi-pack-indexes' . ./test-lib.sh midx_read_expect() { + NUM_PACKS=$1 cat >expect <<- EOF - header: 4d494458 1 1 0 0 + header: 4d494458 1 1 0 $NUM_PACKS object_dir: . EOF git midx read --object-dir=. >actual && @@ -16,7 +17,7 @@ test_expect_success 'write midx with no packs' ' git midx --object-dir=. write && test_when_finished rm pack/multi-pack-index && test_path_is_file pack/multi-pack-index && - midx_read_expect + midx_read_expect 0 ' test_expect_success 'create objects' ' @@ -47,14 +48,14 @@ test_expect_success 'write midx with one v1 pack' ' pack=$(git pack-objects --index-version=1 pack/test obj-list && git update-ref HEAD $commit && - git pack-objects --index-version=2 test-pack
[PATCH 17/23] midx: read objects from multi-pack-index
Signed-off-by: Derrick Stolee --- midx.c | 96 -- midx.h | 2 ++ object-store.h | 1 + packfile.c | 8 - 4 files changed, 104 insertions(+), 3 deletions(-) diff --git a/midx.c b/midx.c index 5e9290ca8f..6eca8f1b12 100644 --- a/midx.c +++ b/midx.c @@ -3,6 +3,7 @@ #include "dir.h" #include "csum-file.h" #include "lockfile.h" +#include "sha1-lookup.h" #include "object-store.h" #include "packfile.h" #include "midx.h" @@ -64,7 +65,7 @@ struct midxed_git *load_midxed_git(const char *object_dir) m = xcalloc(1, sizeof(*m) + strlen(object_dir) + 1); strcpy(m->object_dir, object_dir); - m->data = midx_map; + m->data = (const unsigned char*)midx_map; m->signature = get_be32(m->data); if (m->signature != MIDX_SIGNATURE) { @@ -145,7 +146,9 @@ struct midxed_git *load_midxed_git(const char *object_dir) m->num_objects = ntohl(m->chunk_oid_fanout[255]); - m->pack_names = xcalloc(m->num_packs, sizeof(const char *)); + m->packs = xcalloc(m->num_packs, sizeof(*m->packs)); + + ALLOC_ARRAY(m->pack_names, m->num_packs); for (i = 0; i < m->num_packs; i++) { if (i) { if (ntohl(m->chunk_pack_lookup[i]) <= ntohl(m->chunk_pack_lookup[i - 1])) { @@ -175,6 +178,95 @@ struct midxed_git *load_midxed_git(const char *object_dir) exit(1); } +static int prepare_midx_pack(struct midxed_git *m, uint32_t pack_int_id) +{ + struct strbuf pack_name = STRBUF_INIT; + + if (pack_int_id >= m->num_packs) + BUG("bad pack-int-id"); + + if (m->packs[pack_int_id]) + return 0; + + strbuf_addstr(_name, m->object_dir); + strbuf_addstr(_name, "/pack/"); + strbuf_addstr(_name, m->pack_names[pack_int_id]); + + m->packs[pack_int_id] = add_packed_git(pack_name.buf, pack_name.len, 1); + strbuf_release(_name); + return !m->packs[pack_int_id]; +} + +int bsearch_midx(const struct object_id *oid, struct midxed_git *m, uint32_t *result) +{ + return bsearch_hash(oid->hash, m->chunk_oid_fanout, m->chunk_oid_lookup, + MIDX_HASH_LEN, result); +} + +static off_t nth_midxed_offset(struct midxed_git *m, uint32_t pos) +{ + const unsigned char *offset_data; + uint32_t offset32; + + offset_data = m->chunk_object_offsets + pos * MIDX_CHUNK_OFFSET_WIDTH; + offset32 = get_be32(offset_data + sizeof(uint32_t)); + + if (m->chunk_large_offsets && offset32 & MIDX_LARGE_OFFSET_NEEDED) { + if (sizeof(offset32) < sizeof(uint64_t)) + die(_("multi-pack-index stores a 64-bit offset, but off_t is too small")); + + offset32 ^= MIDX_LARGE_OFFSET_NEEDED; + return get_be64(m->chunk_large_offsets + sizeof(uint64_t) * offset32); + } + + return offset32; +} + +static uint32_t nth_midxed_pack_int_id(struct midxed_git *m, uint32_t pos) +{ + return get_be32(m->chunk_object_offsets + pos * MIDX_CHUNK_OFFSET_WIDTH); +} + +static int nth_midxed_pack_entry(struct midxed_git *m, struct pack_entry *e, uint32_t pos) +{ + uint32_t pack_int_id; + struct packed_git *p; + + if (pos >= m->num_objects) + return 0; + + pack_int_id = nth_midxed_pack_int_id(m, pos); + + if (prepare_midx_pack(m, pack_int_id)) + die(_("error preparing packfile from multi-pack-index")); + p = m->packs[pack_int_id]; + + /* + * We are about to tell the caller where they can locate the + * requested object. We better make sure the packfile is + * still here and can be accessed before supplying that + * answer, as it may have been deleted since the MIDX was + * loaded! + */ + if (!is_pack_valid(p)) + return 0; + + e->offset = nth_midxed_offset(m, pos); + e->p = p; + + return 1; +} + +int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct midxed_git *m) +{ + uint32_t pos; + + if (!bsearch_midx(oid, m, )) + return 0; + + return nth_midxed_pack_entry(m, e, pos); +} + int prepare_midxed_git_one(struct repository *r, const char *object_dir) { struct midxed_git *m = r->objects->midxed_git; diff --git a/midx.h b/midx.h index 793203fc4a..0c66812229 100644 --- a/midx.h +++ b/midx.h @@ -8,6 +8,8 @@ #include "repository.h" struct midxed_git *load_midxed_git(const char *object_dir); +int bsearch_midx(const struct object_id *oid, struct midxed_git *m, uint32_t *result); +int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct midxed_git *m); int prepare_midxed_git_one(struct repository *r, const char *object_dir); int write_midx_file(const char *object_dir); diff --git a/object-store.h b/object-store.h index 7908d46e34..5af2a852bc 100644 --- a/object-store.h +++
[PATCH 02/23] midx: add midx format details to pack-format.txt
The multi-pack-index (MIDX) feature generalizes the existing pack- index (IDX) feature by indexing objects across multiple pack-files. Describe the basic file format, using a 12-byte header followed by a lookup table for a list of "chunks" which will be described later. The file ends with a footer containing a checksum using the hash algorithm. The header allows later versions to create breaking changes by advancing the version number. We can also change the hash algorithm using a different version value. We will add the individual chunk format information as we introduce the code that writes that information. Signed-off-by: Derrick Stolee --- Documentation/technical/pack-format.txt | 49 + 1 file changed, 49 insertions(+) diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt index 70a99fd142..17666b4bfc 100644 --- a/Documentation/technical/pack-format.txt +++ b/Documentation/technical/pack-format.txt @@ -252,3 +252,52 @@ Pack file entry: <+ corresponding packfile. 20-byte SHA-1-checksum of all of the above. + +== midx-*.midx files have the following format: + +The meta-index files refer to multiple pack-files and loose objects. + +In order to allow extensions that add extra data to the MIDX, we organize +the body into "chunks" and provide a lookup table at the beginning of the +body. The header includes certain length values, such as the number of packs, +the number of base MIDX files, hash lengths and types. + +All 4-byte numbers are in network order. + +HEADER: + + 4-byte signature: + The signature is: {'M', 'I', 'D', 'X'} + + 1-byte version number: + Git only writes or recognizes version 1 + + 1-byte Object Id Version + Git only writes or recognizes verion 1 (SHA-1) + + 1-byte number (C) of "chunks" + + 1-byte number (I) of base multi-pack-index files: + This value is currently always zero. + + 4-byte number (P) of pack files + +CHUNK LOOKUP: + + (C + 1) * 12 bytes providing the chunk offsets: + First 4 bytes describe chunk id. Value 0 is a terminating label. + Other 8 bytes provide offset in current file for chunk to start. + (Chunks are provided in file-order, so you can infer the length + using the next chunk position if necessary.) + + The remaining data in the body is described one chunk at a time, and + these chunks may be given in any order. Chunks are required unless + otherwise specified. + +CHUNK DATA: + + (This section intentionally left incomplete.) + +TRAILER: + + H-byte HASH-checksum of all of the above. -- 2.18.0.rc1
[PATCH 05/23] midx: write header information to lockfile
As we begin writing the multi-pack-index format to disk, start with the basics: the 12-byte header and the 20-byte checksum footer. Start with these basics so we can add the rest of the format in small increments. As we implement the format, we will use a technique to check that our computed offsets within the multi-pack-index file match what we are actually writing. Each method that writes to the hashfile will return the number of bytes written, and we will track that those values match our expectations. Currently, write_midx_header() returns 12, but is not checked. We will check the return value in a later commit. Signed-off-by: Derrick Stolee --- midx.c | 53 + t/t5319-midx.sh | 5 +++-- 2 files changed, 56 insertions(+), 2 deletions(-) diff --git a/midx.c b/midx.c index 616af66b13..3e55422a21 100644 --- a/midx.c +++ b/midx.c @@ -1,9 +1,62 @@ #include "git-compat-util.h" #include "cache.h" #include "dir.h" +#include "csum-file.h" +#include "lockfile.h" #include "midx.h" +#define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */ +#define MIDX_VERSION 1 +#define MIDX_HASH_VERSION 1 /* SHA-1 */ +#define MIDX_HEADER_SIZE 12 + +static char *get_midx_filename(const char *object_dir) +{ + struct strbuf midx_name = STRBUF_INIT; + strbuf_addstr(_name, object_dir); + strbuf_addstr(_name, "/pack/multi-pack-index"); + return strbuf_detach(_name, NULL); +} + +static size_t write_midx_header(struct hashfile *f, + unsigned char num_chunks, + uint32_t num_packs) +{ + char byte_values[4]; + hashwrite_be32(f, MIDX_SIGNATURE); + byte_values[0] = MIDX_VERSION; + byte_values[1] = MIDX_HASH_VERSION; + byte_values[2] = num_chunks; + byte_values[3] = 0; /* unused */ + hashwrite(f, byte_values, sizeof(byte_values)); + hashwrite_be32(f, num_packs); + + return MIDX_HEADER_SIZE; +} + int write_midx_file(const char *object_dir) { + unsigned char num_chunks = 0; + uint32_t num_packs = 0; + char *midx_name; + struct hashfile *f; + struct lock_file lk; + + midx_name = get_midx_filename(object_dir); + if (safe_create_leading_directories(midx_name)) { + UNLEAK(midx_name); + die_errno(_("unable to create leading directories of %s"), + midx_name); + } + + hold_lock_file_for_update(, midx_name, LOCK_DIE_ON_ERROR); + f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf); + FREE_AND_NULL(midx_name); + + write_midx_header(f, num_chunks, num_packs); + + finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM); + commit_lock_file(); + return 0; } diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh index a590137af7..80f9389837 100755 --- a/t/t5319-midx.sh +++ b/t/t5319-midx.sh @@ -3,8 +3,9 @@ test_description='multi-pack-indexes' . ./test-lib.sh -test_expect_success 'write midx with no pakcs' ' - git midx --object-dir=. write +test_expect_success 'write midx with no packs' ' + git midx --object-dir=. write && + test_path_is_file pack/multi-pack-index ' test_done -- 2.18.0.rc1
[PATCH 21/23] midx: prevent duplicate packfile loads
If the multi-pack-index contains a packfile, then we do not need to add that packfile to the packed_git linked list or the MRU list. Signed-off-by: Derrick Stolee --- midx.c | 23 +++ midx.h | 1 + packfile.c | 7 +++ 3 files changed, 31 insertions(+) diff --git a/midx.c b/midx.c index 388d79b7d9..3242646fe0 100644 --- a/midx.c +++ b/midx.c @@ -278,6 +278,29 @@ int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct mi return nth_midxed_pack_entry(m, e, pos); } +int midx_contains_pack(struct midxed_git *m, const char *idx_name) +{ + uint32_t first = 0, last = m->num_packs; + + while (first < last) { + uint32_t mid = first + (last - first) / 2; + const char *current; + int cmp; + + current = m->pack_names[mid]; + cmp = strcmp(idx_name, current); + if (!cmp) + return 1; + if (cmp > 0) { + first = mid + 1; + continue; + } + last = mid; + } + + return 0; +} + int prepare_midxed_git_one(struct repository *r, const char *object_dir) { struct midxed_git *m = r->objects->midxed_git; diff --git a/midx.h b/midx.h index 497bdcc77c..c1db58d8c4 100644 --- a/midx.h +++ b/midx.h @@ -13,6 +13,7 @@ struct object_id *nth_midxed_object_oid(struct object_id *oid, struct midxed_git *m, uint32_t n); int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct midxed_git *m); +int midx_contains_pack(struct midxed_git *m, const char *idx_name); int prepare_midxed_git_one(struct repository *r, const char *object_dir); int write_midx_file(const char *object_dir); diff --git a/packfile.c b/packfile.c index 059b2aa097..479cb69b9f 100644 --- a/packfile.c +++ b/packfile.c @@ -746,6 +746,11 @@ static void prepare_packed_git_one(struct repository *r, char *objdir, int local DIR *dir; struct dirent *de; struct string_list garbage = STRING_LIST_INIT_DUP; + struct midxed_git *m = r->objects->midxed_git; + + /* look for the multi-pack-index for this object directory */ + while (m && strcmp(m->object_dir, objdir)) + m = m->next; strbuf_addstr(, objdir); strbuf_addstr(, "/pack"); @@ -772,6 +777,8 @@ static void prepare_packed_git_one(struct repository *r, char *objdir, int local base_len = path.len; if (strip_suffix_mem(path.buf, _len, ".idx")) { /* Don't reopen a pack we already have. */ + if (m && midx_contains_pack(m, de->d_name)) + continue; for (p = r->objects->packed_git; p; p = p->next) { size_t len; -- 2.18.0.rc1
[PATCH 15/23] midx: create core.midx config setting
The core.midx config setting controls the multi-pack-index (MIDX) feature. If false, the setting will disable all reads from the multi-pack-index file. Add comparison commands in t5319-midx.sh to check typical Git behavior remains the same as the config setting is turned on and off. This currently includes 'git rev-list' and 'git log' commands to trigger several object database reads. Signed-off-by: Derrick Stolee --- Documentation/config.txt | 4 +++ cache.h | 1 + config.c | 5 environment.c| 1 + t/t5319-midx.sh | 57 5 files changed, 57 insertions(+), 11 deletions(-) diff --git a/Documentation/config.txt b/Documentation/config.txt index ab641bf5a9..e78150e452 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -908,6 +908,10 @@ core.commitGraph:: Enable git commit graph feature. Allows reading from the commit-graph file. +core.midx:: + Enable multi-pack-index feature. Allows reading from the multi- + pack-index file. + core.sparseCheckout:: Enable "sparse checkout" feature. See section "Sparse checkout" in linkgit:git-read-tree[1] for more information. diff --git a/cache.h b/cache.h index 89a107a7f7..c7967f7643 100644 --- a/cache.h +++ b/cache.h @@ -814,6 +814,7 @@ extern char *git_replace_ref_base; extern int fsync_object_files; extern int core_preload_index; extern int core_commit_graph; +extern int core_midx; extern int core_apply_sparse_checkout; extern int precomposed_unicode; extern int protect_hfs; diff --git a/config.c b/config.c index fbbf0f8e9f..0df3dbdf74 100644 --- a/config.c +++ b/config.c @@ -1313,6 +1313,11 @@ static int git_default_core_config(const char *var, const char *value) return 0; } + if (!strcmp(var, "core.midx")) { + core_midx = git_config_bool(var, value); + return 0; + } + if (!strcmp(var, "core.sparsecheckout")) { core_apply_sparse_checkout = git_config_bool(var, value); return 0; diff --git a/environment.c b/environment.c index 2a6de2330b..dcb4417604 100644 --- a/environment.c +++ b/environment.c @@ -67,6 +67,7 @@ enum object_creation_mode object_creation_mode = OBJECT_CREATION_MODE; char *notes_ref_name; int grafts_replace_parents = 1; int core_commit_graph; +int core_midx; int core_apply_sparse_checkout; int merge_log_config = -1; int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */ diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh index 709652c635..1a50987778 100755 --- a/t/t5319-midx.sh +++ b/t/t5319-midx.sh @@ -3,6 +3,8 @@ test_description='multi-pack-indexes' . ./test-lib.sh +objdir=.git/objects + midx_read_expect() { NUM_PACKS=$1 NUM_OBJECTS=$2 @@ -62,13 +64,42 @@ test_expect_success 'write midx with one v1 pack' ' midx_read_expect 1 17 5 . ' +midx_git_two_modes() { + git -c core.midx=false $1 >expect && + git -c core.midx=true $1 >actual && + test_cmp expect actual +} + +compare_results_with_midx() { + MSG=$1 + test_expect_success "check normal git operations: $MSG" ' + midx_git_two_modes "rev-list --objects --all" && + midx_git_two_modes "log --raw" + ' +} + test_expect_success 'write midx with one v2 pack' ' - pack=$(git pack-objects --index-version=2,0x40 pack/test expect && + git -c core.midx=true $1 >actual && + test_cmp expect actual +} + +compare_results_with_midx() { + MSG=$1 + test_expect_success "check normal git operations: $MSG" ' + midx_git_two_modes "rev-list --objects --all" && + midx_git_two_modes "log --raw" + ' +} + +compare_results_with_midx "one v2 pack" + test_expect_success 'Add more objects' ' for i in `test_seq 6 10` do @@ -94,12 +125,13 @@ test_expect_success 'Add more objects' ' ' test_expect_success 'write midx with two packs' ' - pack1=$(git pack-objects --index-version=1 pack/test-1 obj-list && git update-ref HEAD $commit && - git pack-objects --index-version=2 pack/test-pack [] corrupt_data() { -- 2.18.0.rc1
[PATCH 19/23] midx: use existing midx when writing new one
Signed-off-by: Derrick Stolee --- midx.c | 68 +- 1 file changed, 63 insertions(+), 5 deletions(-) diff --git a/midx.c b/midx.c index 25d8142c2a..388d79b7d9 100644 --- a/midx.c +++ b/midx.c @@ -389,6 +389,23 @@ static int midx_oid_compare(const void *_a, const void *_b) return a->pack_int_id - b->pack_int_id; } +static int nth_midxed_pack_midx_entry(struct midxed_git *m, + uint32_t *pack_perm, + struct pack_midx_entry *e, + uint32_t pos) +{ + if (pos >= m->num_objects) + return 1; + + nth_midxed_object_oid(>oid, m, pos); + e->pack_int_id = pack_perm[nth_midxed_pack_int_id(m, pos)]; + e->offset = nth_midxed_offset(m, pos); + + /* consider objects in midx to be from "old" packs */ + e->pack_mtime = 0; + return 0; +} + static void fill_pack_entry(uint32_t pack_int_id, struct packed_git *p, uint32_t cur_object, @@ -414,7 +431,8 @@ static void fill_pack_entry(uint32_t pack_int_id, * Copy only the de-duplicated entries (selected by most-recent modified time * of a packfile containing the object). */ -static struct pack_midx_entry *get_sorted_entries(struct packed_git **p, +static struct pack_midx_entry *get_sorted_entries(struct midxed_git *m, + struct packed_git **p, uint32_t *perm, uint32_t nr_packs, uint32_t *nr_objects) @@ -423,8 +441,9 @@ static struct pack_midx_entry *get_sorted_entries(struct packed_git **p, uint32_t nr_fanout, alloc_fanout, alloc_objects, total_objects = 0; struct pack_midx_entry *entries_by_fanout = NULL; struct pack_midx_entry *deduplicated_entries = NULL; + uint32_t start_pack = m ? m->num_packs : 0; - for (cur_pack = 0; cur_pack < nr_packs; cur_pack++) { + for (cur_pack = start_pack; cur_pack < nr_packs; cur_pack++) { if (open_pack_index(p[cur_pack])) continue; @@ -445,7 +464,23 @@ static struct pack_midx_entry *get_sorted_entries(struct packed_git **p, for (cur_fanout = 0; cur_fanout < 256; cur_fanout++) { nr_fanout = 0; - for (cur_pack = 0; cur_pack < nr_packs; cur_pack++) { + if (m) { + uint32_t start = 0, end; + + if (cur_fanout) + start = ntohl(m->chunk_oid_fanout[cur_fanout - 1]); + end = ntohl(m->chunk_oid_fanout[cur_fanout]); + + for (cur_object = start; cur_object < end; cur_object++) { + ALLOC_GROW(entries_by_fanout, nr_fanout + 1, alloc_fanout); + nth_midxed_pack_midx_entry(m, perm, + _by_fanout[nr_fanout], + cur_object); + nr_fanout++; + } + } + + for (cur_pack = start_pack; cur_pack < nr_packs; cur_pack++) { uint32_t start = 0, end; if (cur_fanout) @@ -654,6 +689,7 @@ int write_midx_file(const char *object_dir) struct pack_midx_entry *entries; uint32_t nr_entries, num_large_offsets = 0; int large_offsets_needed = 0; + struct midxed_git *m = NULL; midx_name = get_midx_filename(object_dir); if (safe_create_leading_directories(midx_name)) { @@ -662,6 +698,8 @@ int write_midx_file(const char *object_dir) midx_name); } + m = load_midxed_git(object_dir); + strbuf_addf(_dir, "%s/pack", object_dir); dir = opendir(pack_dir.buf); @@ -676,11 +714,27 @@ int write_midx_file(const char *object_dir) pack_dir_len = pack_dir.len; ALLOC_ARRAY(packs, alloc_packs); ALLOC_ARRAY(pack_names, alloc_pack_names); + + if (m) { + for (i = 0; i < m->num_packs; i++) { + ALLOC_GROW(packs, nr_packs + 1, alloc_packs); + ALLOC_GROW(pack_names, nr_packs + 1, alloc_pack_names); + + packs[nr_packs] = NULL; + pack_names[nr_packs] = xstrdup(m->pack_names[i]); + pack_name_concat_len += strlen(pack_names[nr_packs]) + 1; + nr_packs++; + } + } + while ((de = readdir(dir)) != NULL) { if (is_dot_or_dotdot(de->d_name)) continue; if (ends_with(de->d_name, ".idx")) { + if (m &&
[PATCH 14/23] midx: write object offsets
The final pair of chunks for the multi-pack-index (MIDX) file stores the object offsets. We default to using 32-bit offsets as in the pack-index version 1 format, but if there exists an offset larger than 32-bits, we use a trick similar to the pack-index version 2 format by storing all offsets at least 2^31 in a 64-bit table; we use the 32-bit table to point into that 64-bit table as necessary. We only store these 64-bit offsets if necessary, so create a test that manipulates a version 2 pack-index to fake a large offset. This allows us to test that the large offset table is created, but the data does not match the actual packfile offsets. The MIDX offset does match the (corrupted) pack-index offset, so a later commit will compare these offsets during a 'verify' step. Signed-off-by: Derrick Stolee --- Documentation/technical/pack-format.txt | 15 +++- builtin/midx.c | 4 + midx.c | 100 +++- object-store.h | 2 + t/t5319-midx.sh | 45 --- 5 files changed, 151 insertions(+), 15 deletions(-) diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt index 77e88f85e4..0256cfb5e0 100644 --- a/Documentation/technical/pack-format.txt +++ b/Documentation/technical/pack-format.txt @@ -316,7 +316,20 @@ CHUNK DATA: The OIDs for all objects in the MIDX are stored in lexicographic order in this chunk. - (This section intentionally left incomplete.) + Object Offsets (ID: {'O', 'O', 'F', 'F'}) (N * 8 bytes) + Stores two 4-byte values for every object. + 1: The pack-int-id for the pack storing this object. + 2: The offset within the pack. + If all offsets are less than 2^31, then the large offset chunk + will not exist and offsets are stored as in IDX v1. + If there is at least one offset value larger than 2^32-1, then + the large offset chunk must exist. If the large offset chunk + exists and the 31st bit is on, then removing that bit reveals + the row in the large offsets containing the 8-byte offset of + this object. + + [Optional] Object Large Offsets (ID: {'L', 'O', 'F', 'F'}) + 8-byte offsets into large packfiles. TRAILER: diff --git a/builtin/midx.c b/builtin/midx.c index e1fd0e0de4..607d2b3544 100644 --- a/builtin/midx.c +++ b/builtin/midx.c @@ -39,6 +39,10 @@ static int read_midx_file(const char *object_dir) printf(" oid_fanout"); if (m->chunk_oid_lookup) printf(" oid_lookup"); + if (m->chunk_object_offsets) + printf(" object_offsets"); + if (m->chunk_large_offsets) + printf(" large_offsets"); printf("\nnum_objects: %d\n", m->num_objects); diff --git a/midx.c b/midx.c index 9458ced208..a49300bf75 100644 --- a/midx.c +++ b/midx.c @@ -14,14 +14,19 @@ #define MIDX_HASH_LEN 20 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN) -#define MIDX_MAX_CHUNKS 4 +#define MIDX_MAX_CHUNKS 6 #define MIDX_CHUNK_ALIGNMENT 4 #define MIDX_CHUNKID_PACKLOOKUP 0x504c4f4f /* "PLOO" */ #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */ #define MIDX_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */ #define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */ +#define MIDX_CHUNKID_OBJECTOFFSETS 0x4f4f4646 /* "OOFF" */ +#define MIDX_CHUNKID_LARGEOFFSETS 0x4c4f4646 /* "LOFF" */ #define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t)) #define MIDX_CHUNK_FANOUT_SIZE (sizeof(uint32_t) * 256) +#define MIDX_CHUNK_OFFSET_WIDTH (2 * sizeof(uint32_t)) +#define MIDX_CHUNK_LARGE_OFFSET_WIDTH (sizeof(uint64_t)) +#define MIDX_LARGE_OFFSET_NEEDED 0x8000 static char *get_midx_filename(const char *object_dir) { @@ -106,6 +111,14 @@ struct midxed_git *load_midxed_git(const char *object_dir) m->chunk_oid_lookup = m->data + chunk_offset; break; + case MIDX_CHUNKID_OBJECTOFFSETS: + m->chunk_object_offsets = m->data + chunk_offset; + break; + + case MIDX_CHUNKID_LARGEOFFSETS: + m->chunk_large_offsets = m->data + chunk_offset; + break; + case 0: die("terminating MIDX chunk id appears earlier than expected"); break; @@ -127,6 +140,8 @@ struct midxed_git *load_midxed_git(const char *object_dir) die("MIDX missing required OID fanout chunk"); if (!m->chunk_oid_lookup) die("MIDX missing required OID lookup chunk"); + if (!m->chunk_object_offsets) + die("MIDX missing required object offsets chunk");
[PATCH 20/23] midx: use midx in approximate_object_count
Signed-off-by: Derrick Stolee --- packfile.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/packfile.c b/packfile.c index 638e113972..059b2aa097 100644 --- a/packfile.c +++ b/packfile.c @@ -819,11 +819,14 @@ unsigned long approximate_object_count(void) { if (!the_repository->objects->approximate_object_count_valid) { unsigned long count; + struct midxed_git *m; struct packed_git *p; prepare_packed_git(the_repository); count = 0; - for (p = the_repository->objects->packed_git; p; p = p->next) { + for (m = get_midxed_git(the_repository); m; m = m->next) + count += m->num_objects; + for (p = get_packed_git(the_repository); p; p = p->next) { if (open_pack_index(p)) continue; count += p->num_objects; -- 2.18.0.rc1
[PATCH 23/23] midx: clear midx on repack
If a 'git repack' command replaces existing packfiles, then we must clear the existing multi-pack-index before moving the packfiles it references. Signed-off-by: Derrick Stolee --- builtin/repack.c | 8 midx.c | 8 midx.h | 1 + 3 files changed, 17 insertions(+) diff --git a/builtin/repack.c b/builtin/repack.c index 6c636e159e..66a7d8e8ea 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -8,6 +8,7 @@ #include "strbuf.h" #include "string-list.h" #include "argv-array.h" +#include "midx.h" static int delta_base_offset = 1; static int pack_kept_objects = -1; @@ -174,6 +175,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix) int no_update_server_info = 0; int quiet = 0; int local = 0; + int midx_cleared = 0; struct option builtin_repack_options[] = { OPT_BIT('a', NULL, _everything, @@ -340,6 +342,12 @@ int cmd_repack(int argc, const char **argv, const char *prefix) continue; } + if (!midx_cleared) { + /* if we move a packfile, it will invalidated the midx */ + clear_midx_file(get_object_directory()); + midx_cleared = 1; + } + fname_old = mkpathdup("%s/old-%s%s", packdir, item->string, exts[ext].name); if (file_exists(fname_old)) diff --git a/midx.c b/midx.c index e46f392fa4..1043c01fa7 100644 --- a/midx.c +++ b/midx.c @@ -913,3 +913,11 @@ int write_midx_file(const char *object_dir) FREE_AND_NULL(pack_names); return 0; } + +void clear_midx_file(const char *object_dir) +{ + char *midx = get_midx_filename(object_dir); + + if (remove_path(midx)) + die(_("failed to clear multi-pack-index at %s"), midx); +} diff --git a/midx.h b/midx.h index 6996b5ff6b..46f9f44c94 100644 --- a/midx.h +++ b/midx.h @@ -18,5 +18,6 @@ int midx_contains_pack(struct midxed_git *m, const char *idx_name); int prepare_midxed_git_one(struct repository *r, const char *object_dir); int write_midx_file(const char *object_dir); +void clear_midx_file(const char *object_dir); #endif -- 2.18.0.rc1
[PATCH 13/23] midx: write object id fanout chunk
Signed-off-by: Derrick Stolee --- Documentation/technical/pack-format.txt | 5 +++ builtin/midx.c | 4 +- midx.c | 53 +++-- object-store.h | 1 + t/t5319-midx.sh | 18 + 5 files changed, 69 insertions(+), 12 deletions(-) diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt index de9ac778b6..77e88f85e4 100644 --- a/Documentation/technical/pack-format.txt +++ b/Documentation/technical/pack-format.txt @@ -307,6 +307,11 @@ CHUNK DATA: name. This is the only chunk not guaranteed to be a multiple of four bytes in length, so should be the last chunk for alignment reasons. + OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes) + The ith entry, F[i], stores the number of OIDs with first + byte at most i. Thus F[255] stores the total + number of objects (N). + OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes) The OIDs for all objects in the MIDX are stored in lexicographic order in this chunk. diff --git a/builtin/midx.c b/builtin/midx.c index 86edd30174..e1fd0e0de4 100644 --- a/builtin/midx.c +++ b/builtin/midx.c @@ -35,10 +35,12 @@ static int read_midx_file(const char *object_dir) printf(" pack_lookup"); if (m->chunk_pack_names) printf(" pack_names"); + if (m->chunk_oid_fanout) + printf(" oid_fanout"); if (m->chunk_oid_lookup) printf(" oid_lookup"); - printf("\n"); + printf("\nnum_objects: %d\n", m->num_objects); printf("packs:\n"); for (i = 0; i < m->num_packs; i++) diff --git a/midx.c b/midx.c index d06bc6876a..9458ced208 100644 --- a/midx.c +++ b/midx.c @@ -14,12 +14,14 @@ #define MIDX_HASH_LEN 20 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN) -#define MIDX_MAX_CHUNKS 3 +#define MIDX_MAX_CHUNKS 4 #define MIDX_CHUNK_ALIGNMENT 4 #define MIDX_CHUNKID_PACKLOOKUP 0x504c4f4f /* "PLOO" */ #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */ +#define MIDX_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */ #define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */ #define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t)) +#define MIDX_CHUNK_FANOUT_SIZE (sizeof(uint32_t) * 256) static char *get_midx_filename(const char *object_dir) { @@ -96,6 +98,10 @@ struct midxed_git *load_midxed_git(const char *object_dir) m->chunk_pack_names = m->data + chunk_offset; break; + case MIDX_CHUNKID_OIDFANOUT: + m->chunk_oid_fanout = (uint32_t *)(m->data + chunk_offset); + break; + case MIDX_CHUNKID_OIDLOOKUP: m->chunk_oid_lookup = m->data + chunk_offset; break; @@ -117,9 +123,13 @@ struct midxed_git *load_midxed_git(const char *object_dir) die("MIDX missing required pack lookup chunk"); if (!m->chunk_pack_names) die("MIDX missing required pack-name chunk"); + if (!m->chunk_oid_fanout) + die("MIDX missing required OID fanout chunk"); if (!m->chunk_oid_lookup) die("MIDX missing required OID lookup chunk"); + m->num_objects = ntohl(m->chunk_oid_fanout[255]); + m->pack_names = xcalloc(m->num_packs, sizeof(const char *)); for (i = 0; i < m->num_packs; i++) { if (i) { @@ -377,6 +387,35 @@ static size_t write_midx_pack_names(struct hashfile *f, return written; } +static size_t write_midx_oid_fanout(struct hashfile *f, + struct pack_midx_entry *objects, + uint32_t nr_objects) +{ + struct pack_midx_entry *list = objects; + struct pack_midx_entry *last = objects + nr_objects; + uint32_t count = 0; + uint32_t i; + + /* + * Write the first-level table (the list is sorted, + * but we use a 256-entry lookup to be able to avoid + * having to do eight extra binary search iterations). + */ + for (i = 0; i < 256; i++) { + struct pack_midx_entry *next = list; + + while (next < last && next->oid.hash[0] == i) { + count++; + next++; + } + + hashwrite_be32(f, count); + list = next; + } + + return MIDX_CHUNK_FANOUT_SIZE; +} + static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char hash_len, struct pack_midx_entry *objects, uint32_t nr_objects) @@ -489,7 +528,7 @@ int write_midx_file(const char *object_dir) FREE_AND_NULL(midx_name);
[PATCH 18/23] midx: use midx in abbreviation calculations
Signed-off-by: Derrick Stolee --- midx.c | 11 midx.h | 3 +++ packfile.c | 6 + packfile.h | 1 + sha1-name.c | 70 + t/t5319-midx.sh | 3 ++- 6 files changed, 93 insertions(+), 1 deletion(-) diff --git a/midx.c b/midx.c index 6eca8f1b12..25d8142c2a 100644 --- a/midx.c +++ b/midx.c @@ -203,6 +203,17 @@ int bsearch_midx(const struct object_id *oid, struct midxed_git *m, uint32_t *re MIDX_HASH_LEN, result); } +struct object_id *nth_midxed_object_oid(struct object_id *oid, + struct midxed_git *m, + uint32_t n) +{ + if (n >= m->num_objects) + return NULL; + + hashcpy(oid->hash, m->chunk_oid_lookup + m->hash_len * n); + return oid; +} + static off_t nth_midxed_offset(struct midxed_git *m, uint32_t pos) { const unsigned char *offset_data; diff --git a/midx.h b/midx.h index 0c66812229..497bdcc77c 100644 --- a/midx.h +++ b/midx.h @@ -9,6 +9,9 @@ struct midxed_git *load_midxed_git(const char *object_dir); int bsearch_midx(const struct object_id *oid, struct midxed_git *m, uint32_t *result); +struct object_id *nth_midxed_object_oid(struct object_id *oid, + struct midxed_git *m, + uint32_t n); int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct midxed_git *m); int prepare_midxed_git_one(struct repository *r, const char *object_dir); diff --git a/packfile.c b/packfile.c index 73f8cc28ee..638e113972 100644 --- a/packfile.c +++ b/packfile.c @@ -919,6 +919,12 @@ struct packed_git *get_packed_git(struct repository *r) return r->objects->packed_git; } +struct midxed_git *get_midxed_git(struct repository *r) +{ + prepare_packed_git(r); + return r->objects->midxed_git; +} + struct list_head *get_packed_git_mru(struct repository *r) { prepare_packed_git(r); diff --git a/packfile.h b/packfile.h index e0a38aba93..01e14b93fd 100644 --- a/packfile.h +++ b/packfile.h @@ -39,6 +39,7 @@ extern void install_packed_git(struct repository *r, struct packed_git *pack); struct packed_git *get_packed_git(struct repository *r); struct list_head *get_packed_git_mru(struct repository *r); +struct midxed_git *get_midxed_git(struct repository *r); /* * Give a rough count of objects in the repository. This sacrifices accuracy diff --git a/sha1-name.c b/sha1-name.c index 60d9ef3c7e..d975a186c9 100644 --- a/sha1-name.c +++ b/sha1-name.c @@ -12,6 +12,7 @@ #include "packfile.h" #include "object-store.h" #include "repository.h" +#include "midx.h" static int get_oid_oneline(const char *, struct object_id *, struct commit_list *); @@ -149,6 +150,32 @@ static int match_sha(unsigned len, const unsigned char *a, const unsigned char * return 1; } +static void unique_in_midx(struct midxed_git *m, + struct disambiguate_state *ds) +{ + uint32_t num, i, first = 0; + const struct object_id *current = NULL; + num = m->num_objects; + + if (!num) + return; + + bsearch_midx(>bin_pfx, m, ); + + /* +* At this point, "first" is the location of the lowest object +* with an object name that could match "bin_pfx". See if we have +* 0, 1 or more objects that actually match(es). +*/ + for (i = first; i < num && !ds->ambiguous; i++) { + struct object_id oid; + current = nth_midxed_object_oid(, m, i); + if (!match_sha(ds->len, ds->bin_pfx.hash, current->hash)) + break; + update_candidates(ds, current); + } +} + static void unique_in_pack(struct packed_git *p, struct disambiguate_state *ds) { @@ -177,8 +204,12 @@ static void unique_in_pack(struct packed_git *p, static void find_short_packed_object(struct disambiguate_state *ds) { + struct midxed_git *m; struct packed_git *p; + for (m = get_midxed_git(the_repository); m && !ds->ambiguous; +m = m->next) + unique_in_midx(m, ds); for (p = get_packed_git(the_repository); p && !ds->ambiguous; p = p->next) unique_in_pack(p, ds); @@ -527,6 +558,42 @@ static int extend_abbrev_len(const struct object_id *oid, void *cb_data) return 0; } +static void find_abbrev_len_for_midx(struct midxed_git *m, +struct min_abbrev_data *mad) +{ + int match = 0; + uint32_t num, first = 0; + struct object_id oid; + const struct object_id *mad_oid; + + if (!m->num_objects) + return; + + num = m->num_objects; + mad_oid = mad->oid; + match = bsearch_midx(mad_oid, m, ); + + /* +* first is now the position in
[PATCH 22/23] midx: use midx to find ref-deltas
Signed-off-by: Derrick Stolee --- midx.c | 2 +- midx.h | 1 + packfile.c | 15 +++ 3 files changed, 17 insertions(+), 1 deletion(-) diff --git a/midx.c b/midx.c index 3242646fe0..e46f392fa4 100644 --- a/midx.c +++ b/midx.c @@ -214,7 +214,7 @@ struct object_id *nth_midxed_object_oid(struct object_id *oid, return oid; } -static off_t nth_midxed_offset(struct midxed_git *m, uint32_t pos) +off_t nth_midxed_offset(struct midxed_git *m, uint32_t pos) { const unsigned char *offset_data; uint32_t offset32; diff --git a/midx.h b/midx.h index c1db58d8c4..6996b5ff6b 100644 --- a/midx.h +++ b/midx.h @@ -9,6 +9,7 @@ struct midxed_git *load_midxed_git(const char *object_dir); int bsearch_midx(const struct object_id *oid, struct midxed_git *m, uint32_t *result); +off_t nth_midxed_offset(struct midxed_git *m, uint32_t n); struct object_id *nth_midxed_object_oid(struct object_id *oid, struct midxed_git *m, uint32_t n); diff --git a/packfile.c b/packfile.c index 479cb69b9f..9b814c89c7 100644 --- a/packfile.c +++ b/packfile.c @@ -1794,6 +1794,21 @@ off_t find_pack_entry_one(const unsigned char *sha1, uint32_t result; if (!index) { + /* +* If we have a MIDX, then we want to +* check the MIDX for the offset instead. +*/ + struct midxed_git *m; + + for (m = get_midxed_git(the_repository); m; m = m->next) { + if (midx_contains_pack(m, p->pack_name)) { + if (bsearch_midx(, m, )) + return nth_midxed_offset(m, result); + + break; + } + } + if (open_pack_index(p)) return 0; } -- 2.18.0.rc1
[PATCH 11/23] midx: sort and deduplicate objects from packfiles
Before writing a list of objects and their offsets to a multi-pack-index (MIDX), we need to collect the list of objects contained in the packfiles. There may be multiple copies of some objects, so this list must be deduplicated. It is possible to artificially get into a state where there are many duplicate copies of objects. That can create high memory pressure if we are to create a list of all objects before de-duplication. To reduce this memory pressure without a significant performance drop, automatically group objects by the first byte of their object id. Use the IDX fanout tables to group the data, copy to a local array, then sort. Copy only the de-duplicated entries. Select the duplicate based on the most-recent modified time of a packfile containing the object. Signed-off-by: Derrick Stolee --- midx.c | 138 + 1 file changed, 138 insertions(+) diff --git a/midx.c b/midx.c index 923acda72e..b20d52713c 100644 --- a/midx.c +++ b/midx.c @@ -4,6 +4,7 @@ #include "csum-file.h" #include "lockfile.h" #include "object-store.h" +#include "packfile.h" #include "midx.h" #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */ @@ -190,6 +191,140 @@ static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *p } } +static uint32_t get_pack_fanout(struct packed_git *p, uint32_t value) +{ + const uint32_t *level1_ofs = p->index_data; + + if (!level1_ofs) { + if (open_pack_index(p)) + return 0; + level1_ofs = p->index_data; + } + + if (p->index_version > 1) { + level1_ofs += 2; + } + + return ntohl(level1_ofs[value]); +} + +struct pack_midx_entry { + struct object_id oid; + uint32_t pack_int_id; + time_t pack_mtime; + uint64_t offset; +}; + +static int midx_oid_compare(const void *_a, const void *_b) +{ + struct pack_midx_entry *a = (struct pack_midx_entry *)_a; + struct pack_midx_entry *b = (struct pack_midx_entry *)_b; + int cmp = oidcmp(>oid, >oid); + + if (cmp) + return cmp; + + if (a->pack_mtime > b->pack_mtime) + return -1; + else if (a->pack_mtime < b->pack_mtime) + return 1; + + return a->pack_int_id - b->pack_int_id; +} + +static void fill_pack_entry(uint32_t pack_int_id, + struct packed_git *p, + uint32_t cur_object, + struct pack_midx_entry *entry) +{ + if (!nth_packed_object_oid(>oid, p, cur_object)) + die("failed to located object %d in packfile", cur_object); + + entry->pack_int_id = pack_int_id; + entry->pack_mtime = p->mtime; + + entry->offset = nth_packed_object_offset(p, cur_object); +} + +/* + * It is possible to artificially get into a state where there are many + * duplicate copies of objects. That can create high memory pressure if + * we are to create a list of all objects before de-duplication. To reduce + * this memory pressure without a significant performance drop, automatically + * group objects by the first byte of their object id. Use the IDX fanout + * tables to group the data, copy to a local array, then sort. + * + * Copy only the de-duplicated entries (selected by most-recent modified time + * of a packfile containing the object). + */ +static struct pack_midx_entry *get_sorted_entries(struct packed_git **p, + uint32_t *perm, + uint32_t nr_packs, + uint32_t *nr_objects) +{ + uint32_t cur_fanout, cur_pack, cur_object; + uint32_t nr_fanout, alloc_fanout, alloc_objects, total_objects = 0; + struct pack_midx_entry *entries_by_fanout = NULL; + struct pack_midx_entry *deduplicated_entries = NULL; + + for (cur_pack = 0; cur_pack < nr_packs; cur_pack++) { + if (open_pack_index(p[cur_pack])) + continue; + + total_objects += p[cur_pack]->num_objects; + } + + /* +* As we de-duplicate by fanout value, we expect the fanout +* slices to be evenly distributed, with some noise. Hence, +* allocate slightly more than one 256th. +*/ + alloc_objects = alloc_fanout = total_objects > 3200 ? total_objects / 200 : 16; + + ALLOC_ARRAY(entries_by_fanout, alloc_fanout); + ALLOC_ARRAY(deduplicated_entries, alloc_objects); + *nr_objects = 0; + + for (cur_fanout = 0; cur_fanout < 256; cur_fanout++) { + nr_fanout = 0; + + for (cur_pack = 0; cur_pack < nr_packs; cur_pack++) { + uint32_t start = 0, end; + + if (cur_fanout) + start = get_pack_fanout(p[cur_pack], cur_fanout - 1); + end =
[PATCH 10/23] midx: write a lookup into the pack names chunk
Signed-off-by: Derrick Stolee --- Documentation/technical/pack-format.txt | 5 +++ builtin/midx.c | 7 midx.c | 56 +++-- object-store.h | 2 + t/t5319-midx.sh | 11 +++-- 5 files changed, 75 insertions(+), 6 deletions(-) diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt index 2b37be7b33..29bf87283a 100644 --- a/Documentation/technical/pack-format.txt +++ b/Documentation/technical/pack-format.txt @@ -296,6 +296,11 @@ CHUNK LOOKUP: CHUNK DATA: + Packfile Name Lookup (ID: {'P', 'L', 'O', 'O'}) (P * 4 bytes) + P * 4 bytes storing the offset in the packfile name chunk for + the null-terminated string containing the filename for the + ith packfile. + Packfile Names (ID: {'P', 'N', 'A', 'M'}) Stores the packfile names as concatenated, null-terminated strings. Packfiles must be listed in lexicographic order for fast lookups by diff --git a/builtin/midx.c b/builtin/midx.c index fe56560853..3a261e9bbf 100644 --- a/builtin/midx.c +++ b/builtin/midx.c @@ -16,6 +16,7 @@ static struct opts_midx { static int read_midx_file(const char *object_dir) { + uint32_t i; struct midxed_git *m = load_midxed_git(object_dir); if (!m) @@ -30,11 +31,17 @@ static int read_midx_file(const char *object_dir) printf("chunks:"); + if (m->chunk_pack_lookup) + printf(" pack_lookup"); if (m->chunk_pack_names) printf(" pack_names"); printf("\n"); + printf("packs:\n"); + for (i = 0; i < m->num_packs; i++) + printf("%s\n", m->pack_names[i]); + printf("object_dir: %s\n", m->object_dir); return 0; diff --git a/midx.c b/midx.c index d4f4a01a51..923acda72e 100644 --- a/midx.c +++ b/midx.c @@ -13,8 +13,9 @@ #define MIDX_HASH_LEN 20 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN) -#define MIDX_MAX_CHUNKS 1 +#define MIDX_MAX_CHUNKS 2 #define MIDX_CHUNK_ALIGNMENT 4 +#define MIDX_CHUNKID_PACKLOOKUP 0x504c4f4f /* "PLOO" */ #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */ #define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t)) @@ -85,6 +86,10 @@ struct midxed_git *load_midxed_git(const char *object_dir) uint64_t chunk_offset = get_be64(m->data + 16 + MIDX_CHUNKLOOKUP_WIDTH * i); switch (chunk_id) { + case MIDX_CHUNKID_PACKLOOKUP: + m->chunk_pack_lookup = (uint32_t *)(m->data + chunk_offset); + break; + case MIDX_CHUNKID_PACKNAMES: m->chunk_pack_names = m->data + chunk_offset; break; @@ -102,9 +107,32 @@ struct midxed_git *load_midxed_git(const char *object_dir) } } + if (!m->chunk_pack_lookup) + die("MIDX missing required pack lookup chunk"); if (!m->chunk_pack_names) die("MIDX missing required pack-name chunk"); + m->pack_names = xcalloc(m->num_packs, sizeof(const char *)); + for (i = 0; i < m->num_packs; i++) { + if (i) { + if (ntohl(m->chunk_pack_lookup[i]) <= ntohl(m->chunk_pack_lookup[i - 1])) { + error("MIDX pack lookup value %d before %d", + ntohl(m->chunk_pack_lookup[i - 1]), + ntohl(m->chunk_pack_lookup[i])); + goto cleanup_fail; + } + } + + m->pack_names[i] = (const char *)(m->chunk_pack_names + ntohl(m->chunk_pack_lookup[i])); + + if (i && strcmp(m->pack_names[i], m->pack_names[i - 1]) <= 0) { + error("MIDX pack names out of order: '%s' before '%s'", + m->pack_names[i - 1], + m->pack_names[i]); + goto cleanup_fail; + } + } + return m; cleanup_fail: @@ -162,6 +190,20 @@ static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *p } } +static size_t write_midx_pack_lookup(struct hashfile *f, +char **pack_names, +uint32_t nr_packs) +{ + uint32_t i, cur_len = 0; + + for (i = 0; i < nr_packs; i++) { + hashwrite_be32(f, cur_len); + cur_len += strlen(pack_names[i]) + 1; + } + + return sizeof(uint32_t) * (size_t)nr_packs; +} + static size_t write_midx_pack_names(struct hashfile *f, char **pack_names, uint32_t num_packs) @@ -275,13 +317,17 @@ int
[PATCH 04/23] midx: add 'write' subcommand and basic wiring
In anticipation of writing multi-pack-indexes (MIDX files), add a 'git midx write' subcommand and send the options to a write_midx_file() method. Also create a basic test file that tests the 'write' subcommand. Signed-off-by: Derrick Stolee --- Documentation/git-midx.txt | 22 +- Makefile | 1 + builtin/midx.c | 9 - midx.c | 9 + midx.h | 4 t/t5319-midx.sh| 10 ++ 6 files changed, 53 insertions(+), 2 deletions(-) create mode 100644 midx.c create mode 100644 midx.h create mode 100755 t/t5319-midx.sh diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt index 2bd886f1a2..dcaeb1a91b 100644 --- a/Documentation/git-midx.txt +++ b/Documentation/git-midx.txt @@ -9,7 +9,7 @@ git-midx - Write and verify multi-pack-indexes (MIDX files). SYNOPSIS [verse] -'git midx' [--object-dir ] +'git midx' [--object-dir ] DESCRIPTION --- @@ -23,6 +23,26 @@ OPTIONS /packs/multi-pack-index for the current MIDX file, and /packs for the pack-files to index. +write:: + When given as the verb, write a new MIDX file to + /packs/multi-pack-index. + + +EXAMPLES + + +* Write a MIDX file for the packfiles in the current .git folder. ++ +--- +$ git midx write +--- + +* Write a MIDX file for the packfiles in an alternate. ++ +--- +$ git midx --object-dir write +--- + GIT --- diff --git a/Makefile b/Makefile index 88958c7b42..aa86fcd8ec 100644 --- a/Makefile +++ b/Makefile @@ -890,6 +890,7 @@ LIB_OBJS += merge.o LIB_OBJS += merge-blobs.o LIB_OBJS += merge-recursive.o LIB_OBJS += mergesort.o +LIB_OBJS += midx.o LIB_OBJS += name-hash.o LIB_OBJS += notes.o LIB_OBJS += notes-cache.o diff --git a/builtin/midx.c b/builtin/midx.c index 59ea92178f..dc0a5acd3f 100644 --- a/builtin/midx.c +++ b/builtin/midx.c @@ -3,9 +3,10 @@ #include "config.h" #include "git-compat-util.h" #include "parse-options.h" +#include "midx.h" static char const * const builtin_midx_usage[] ={ - N_("git midx [--object-dir ]"), + N_("git midx [--object-dir ] [write]"), NULL }; @@ -34,5 +35,11 @@ int cmd_midx(int argc, const char **argv, const char *prefix) if (!opts.object_dir) opts.object_dir = get_object_directory(); + if (argc == 0) + return 0; + + if (!strcmp(argv[0], "write")) + return write_midx_file(opts.object_dir); + return 0; } diff --git a/midx.c b/midx.c new file mode 100644 index 00..616af66b13 --- /dev/null +++ b/midx.c @@ -0,0 +1,9 @@ +#include "git-compat-util.h" +#include "cache.h" +#include "dir.h" +#include "midx.h" + +int write_midx_file(const char *object_dir) +{ + return 0; +} diff --git a/midx.h b/midx.h new file mode 100644 index 00..3a63673952 --- /dev/null +++ b/midx.h @@ -0,0 +1,4 @@ +#include "cache.h" +#include "packfile.h" + +int write_midx_file(const char *object_dir); diff --git a/t/t5319-midx.sh b/t/t5319-midx.sh new file mode 100755 index 00..a590137af7 --- /dev/null +++ b/t/t5319-midx.sh @@ -0,0 +1,10 @@ +#!/bin/sh + +test_description='multi-pack-indexes' +. ./test-lib.sh + +test_expect_success 'write midx with no pakcs' ' + git midx --object-dir=. write +' + +test_done -- 2.18.0.rc1
[PATCH 12/23] midx: write object ids in a chunk
Signed-off-by: Derrick Stolee --- Documentation/technical/pack-format.txt | 4 ++ builtin/midx.c | 2 + midx.c | 50 +++-- object-store.h | 1 + t/t5319-midx.sh | 4 +- 5 files changed, 55 insertions(+), 6 deletions(-) diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt index 29bf87283a..de9ac778b6 100644 --- a/Documentation/technical/pack-format.txt +++ b/Documentation/technical/pack-format.txt @@ -307,6 +307,10 @@ CHUNK DATA: name. This is the only chunk not guaranteed to be a multiple of four bytes in length, so should be the last chunk for alignment reasons. + OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes) + The OIDs for all objects in the MIDX are stored in lexicographic + order in this chunk. + (This section intentionally left incomplete.) TRAILER: diff --git a/builtin/midx.c b/builtin/midx.c index 3a261e9bbf..86edd30174 100644 --- a/builtin/midx.c +++ b/builtin/midx.c @@ -35,6 +35,8 @@ static int read_midx_file(const char *object_dir) printf(" pack_lookup"); if (m->chunk_pack_names) printf(" pack_names"); + if (m->chunk_oid_lookup) + printf(" oid_lookup"); printf("\n"); diff --git a/midx.c b/midx.c index b20d52713c..d06bc6876a 100644 --- a/midx.c +++ b/midx.c @@ -14,10 +14,11 @@ #define MIDX_HASH_LEN 20 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN) -#define MIDX_MAX_CHUNKS 2 +#define MIDX_MAX_CHUNKS 3 #define MIDX_CHUNK_ALIGNMENT 4 #define MIDX_CHUNKID_PACKLOOKUP 0x504c4f4f /* "PLOO" */ #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */ +#define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */ #define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t)) static char *get_midx_filename(const char *object_dir) @@ -95,6 +96,10 @@ struct midxed_git *load_midxed_git(const char *object_dir) m->chunk_pack_names = m->data + chunk_offset; break; + case MIDX_CHUNKID_OIDLOOKUP: + m->chunk_oid_lookup = m->data + chunk_offset; + break; + case 0: die("terminating MIDX chunk id appears earlier than expected"); break; @@ -112,6 +117,8 @@ struct midxed_git *load_midxed_git(const char *object_dir) die("MIDX missing required pack lookup chunk"); if (!m->chunk_pack_names) die("MIDX missing required pack-name chunk"); + if (!m->chunk_oid_lookup) + die("MIDX missing required OID lookup chunk"); m->pack_names = xcalloc(m->num_packs, sizeof(const char *)); for (i = 0; i < m->num_packs; i++) { @@ -370,6 +377,32 @@ static size_t write_midx_pack_names(struct hashfile *f, return written; } +static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char hash_len, + struct pack_midx_entry *objects, + uint32_t nr_objects) +{ + struct pack_midx_entry *list = objects; + uint32_t i; + size_t written = 0; + + for (i = 0; i < nr_objects; i++) { + struct pack_midx_entry *obj = list++; + + if (i < nr_objects - 1) { + struct pack_midx_entry *next = list; + if (oidcmp(>oid, >oid) >= 0) + BUG("OIDs not in order: %s >= %s", + oid_to_hex(>oid), + oid_to_hex(>oid)); + } + + hashwrite(f, obj->oid.hash, (int)hash_len); + written += hash_len; + } + + return written; +} + int write_midx_file(const char *object_dir) { unsigned char cur_chunk, num_chunks = 0; @@ -389,6 +422,7 @@ int write_midx_file(const char *object_dir) uint64_t written = 0; uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1]; uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1]; + struct pack_midx_entry *entries; uint32_t nr_entries; midx_name = get_midx_filename(object_dir); @@ -448,14 +482,14 @@ int write_midx_file(const char *object_dir) ALLOC_ARRAY(pack_perm, nr_packs); sort_packs_by_name(pack_names, nr_packs, pack_perm); - get_sorted_entries(packs, pack_perm, nr_packs, _entries); + entries = get_sorted_entries(packs, pack_perm, nr_packs, _entries); hold_lock_file_for_update(, midx_name, LOCK_DIE_ON_ERROR); f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf); FREE_AND_NULL(midx_name); cur_chunk = 0; - num_chunks = 2; + num_chunks = 3; written = write_midx_header(f, num_chunks,
[PATCH 00/23] Multi-pack-index (MIDX)
This patch series includes a rewrite of the previous multi-pack-index RFC [1] using the feedback from the commit-graph feature. I based this series on 'next' as it requires the recent object-store patches. The multi-pack-index (MIDX) is explained fully in the design document 'Documentation/technical/midx.txt'. The short description is that the MIDX stores the information from all of the IDX files in a pack directory. The crucial design decision is that the IDX files still exist, so we can fall back to the IDX files if there is any issue with the MIDX (or core.midx is set to false, or a user downgrades Git, etc.) The MIDX feature has been part of our GVFS releases for a few months (since the RFC). It has behaved well, indexing over 31 million commits and trees across up to 250 packfiles. These MIDX files are nearly 1GB in size and take ~20 seconds to rewrite when adding new IDX information. This ~20s mark is something I'd like to improve, and I mention how to make the file incremental (similar to split-index) in the design document. I also want to make the commit-graph file incremental, so I'd like to do that at the same time after both the MIDX and commit-graph are stable. Lookup Speedups --- When looking for an object, Git uses an most-recently- used (MRU) cache of packfiles. This does pretty well to minimize the number of misses when searching through packfiles for an object, especially if there is one "big" packfile that contains most of the objets (so it will rarely miss and is usually one of the first two packfiles in the list). The MIDX does provide a way to remove these misses, improving lookup time. However, this lookup time greatly depends on the arrangement of the packfiles. For instance, if you take the Linux repository and repack using `git repack -adfF --max-pack-size=128m` then all commits will be in one packfile, all trees will be in a small set of packfiles and organized well so 'git rev-list --objects HEAD^{tree}' only inspects one or two packfiles. GVFS has the notion of a "prefetch packfile". These are packfiles that are precomputed by cache servers to contain the commits and trees introduced to the remote each day. GVFS downloads these packfiles and places them in an alternate. Since these are organized by "first time introduced" and the working directory is so large, the MRU misses are significant when performing a checkout and updating the .git/index file. To test the performance in this situation, I created a script that organizes the Linux repository in a similar fashion. I split the commit history into 50 parts by creating branches on every 10,000 commits of the first- parent history. Then, `git rev-list --objects A ^B` provides the list of objects reachable from A but not B, so I could send that to `git pack-objects` to create these "time-based" packfiles. With these 50 packfiles (deleting the old one from my fresh clone, and deleting all tags as they were no longer on-disk) I could then test 'git rev-list --objects HEAD^{tree}' and see: Before: 0.17s After: 0.13s % Diff: -23.5% By adding logic to count hits and misses to bsearch_pack, I was able to see that the command above calls that method 266,930 times with a hit rate of 33%. The MIDX has the same number of calls with a 100% hit rate. Abbreviation Speedups - To fully disambiguate an abbreviation, we must iterate through all packfiles to ensure no collision exists in any packfile. This requires O(P log N) time. With the MIDX, this is only O(log N) time. Our standard test [2] is 'git log --oneline --parents --raw' because it writes many abbreviations while also doing a lot of other work (walking commits and trees to compute the raw diff). For a copy of the Linux repository with 50 packfiles split by time, we observed the following: Before: 100.5 s After: 58.2 s % Diff: -59.7% Request for Review Attention I tried my best to take the feedback from the commit-graph feature and apply it to this feature. I also worked to follow the object-store refactoring as I could. I also have some local commits that create a 'verify' subcommand and integrate with 'fsck' similar to the commit-graph, but I'll leave those for a later series (and review is still underway for that part of the commit-graph). One place where I could use some guidance is related to the current state of 'the_hash_algo' patches. The file format allows a different "hash version" which then indicates the length of the hash. What's the best way to ensure this feature doesn't cause extra pain in the hash-agnostic series? This will inform how I go back and make the commit-graph feature better in this area, too. Thanks, -Stolee [1] https://public-inbox.org/git/20180107181459.222909-1-dsto...@microsoft.com/T/#u Previous MIDX RFC. [2] https://public-inbox.org/git/20171012120220.226427-1-dsto...@microsoft.com/ A patch series on
[PATCH 03/23] midx: add midx builtin
This new 'git midx' builtin will be the plumbing access for writing, reading, and checking multi-pack-index (MIDX) files. The initial implementation is a no-op. Signed-off-by: Derrick Stolee --- .gitignore | 1 + Documentation/git-midx.txt | 29 + Makefile | 1 + builtin.h | 1 + builtin/midx.c | 38 ++ command-list.txt | 1 + git.c | 1 + 7 files changed, 72 insertions(+) create mode 100644 Documentation/git-midx.txt create mode 100644 builtin/midx.c diff --git a/.gitignore b/.gitignore index 388cc4beee..e309644d6b 100644 --- a/.gitignore +++ b/.gitignore @@ -97,6 +97,7 @@ /git-merge-subtree /git-mergetool /git-mergetool--lib +/git-midx /git-mktag /git-mktree /git-name-rev diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt new file mode 100644 index 00..2bd886f1a2 --- /dev/null +++ b/Documentation/git-midx.txt @@ -0,0 +1,29 @@ +git-midx(1) + + +NAME + +git-midx - Write and verify multi-pack-indexes (MIDX files). + + +SYNOPSIS + +[verse] +'git midx' [--object-dir ] + +DESCRIPTION +--- +Write or verify a MIDX file. + +OPTIONS +--- + +--object-dir :: + Use given directory for the location of Git objects. We check + /packs/multi-pack-index for the current MIDX file, and + /packs for the pack-files to index. + + +GIT +--- +Part of the linkgit:git[1] suite diff --git a/Makefile b/Makefile index 1d27f36365..88958c7b42 100644 --- a/Makefile +++ b/Makefile @@ -1045,6 +1045,7 @@ BUILTIN_OBJS += builtin/merge-index.o BUILTIN_OBJS += builtin/merge-ours.o BUILTIN_OBJS += builtin/merge-recursive.o BUILTIN_OBJS += builtin/merge-tree.o +BUILTIN_OBJS += builtin/midx.o BUILTIN_OBJS += builtin/mktag.o BUILTIN_OBJS += builtin/mktree.o BUILTIN_OBJS += builtin/mv.o diff --git a/builtin.h b/builtin.h index 4e0f64723e..7b5bd46c7d 100644 --- a/builtin.h +++ b/builtin.h @@ -189,6 +189,7 @@ extern int cmd_merge_ours(int argc, const char **argv, const char *prefix); extern int cmd_merge_file(int argc, const char **argv, const char *prefix); extern int cmd_merge_recursive(int argc, const char **argv, const char *prefix); extern int cmd_merge_tree(int argc, const char **argv, const char *prefix); +extern int cmd_midx(int argc, const char **argv, const char *prefix); extern int cmd_mktag(int argc, const char **argv, const char *prefix); extern int cmd_mktree(int argc, const char **argv, const char *prefix); extern int cmd_mv(int argc, const char **argv, const char *prefix); diff --git a/builtin/midx.c b/builtin/midx.c new file mode 100644 index 00..59ea92178f --- /dev/null +++ b/builtin/midx.c @@ -0,0 +1,38 @@ +#include "builtin.h" +#include "cache.h" +#include "config.h" +#include "git-compat-util.h" +#include "parse-options.h" + +static char const * const builtin_midx_usage[] ={ + N_("git midx [--object-dir ]"), + NULL +}; + +static struct opts_midx { + const char *object_dir; +} opts; + +int cmd_midx(int argc, const char **argv, const char *prefix) +{ + static struct option builtin_midx_options[] = { + { OPTION_STRING, 0, "object-dir", _dir, + N_("dir"), + N_("The object directory containing set of packfile and pack-index pairs.") }, + OPT_END(), + }; + + if (argc == 2 && !strcmp(argv[1], "-h")) + usage_with_options(builtin_midx_usage, builtin_midx_options); + + git_config(git_default_config, NULL); + + argc = parse_options(argc, argv, prefix, +builtin_midx_options, +builtin_midx_usage, 0); + + if (!opts.object_dir) + opts.object_dir = get_object_directory(); + + return 0; +} diff --git a/command-list.txt b/command-list.txt index e1c26c1bb7..a21bd7470e 100644 --- a/command-list.txt +++ b/command-list.txt @@ -123,6 +123,7 @@ git-merge-index plumbingmanipulators git-merge-one-file purehelpers git-mergetool ancillarymanipulators complete git-merge-tree ancillaryinterrogators +git-midxplumbingmanipulators git-mktag plumbingmanipulators git-mktree plumbingmanipulators git-mv mainporcelain worktree diff --git a/git.c b/git.c index c2f48d53dd..400fadd677 100644 --- a/git.c +++ b/git.c @@ -503,6 +503,7 @@ static struct cmd_struct commands[] = { { "merge-recursive-theirs", cmd_merge_recursive, RUN_SETUP | NEED_WORK_TREE | NO_PARSEOPT }, { "merge-subtree", cmd_merge_recursive, RUN_SETUP | NEED_WORK_TREE | NO_PARSEOPT }, { "merge-tree", cmd_merge_tree, RUN_SETUP | NO_PARSEOPT }, + { "midx",
[PATCH 01/23] midx: add design document
Signed-off-by: Derrick Stolee --- Documentation/technical/midx.txt | 109 +++ 1 file changed, 109 insertions(+) create mode 100644 Documentation/technical/midx.txt diff --git a/Documentation/technical/midx.txt b/Documentation/technical/midx.txt new file mode 100644 index 00..789f410d71 --- /dev/null +++ b/Documentation/technical/midx.txt @@ -0,0 +1,109 @@ +Multi-Pack-Index (MIDX) Design Notes + + +The Git object directory contains a 'pack' directory containing +packfiles (with suffix ".pack") and pack-indexes (with suffix +".idx"). The pack-indexes provide a way to lookup objects and +navigate to their offset within the pack, but these must come +in pairs with the packfiles. This pairing depends on the file +names, as the pack-index differs only in suffix with its pack- +file. While the pack-indexes provide fast lookup per packfile, +this performance degrades as the number of packfiles increases, +because abbreviations need to inspect every packfile and we are +more likely to have a miss on our most-recently-used packfile. +For some large repositories, repacking into a single packfile +is not feasible due to storage space or excessive repack times. + +The multi-pack-index (MIDX for short) stores a list of objects +and their offsets into multiple packfiles. It contains: + +- A list of packfile names. +- A sorted list of object IDs. +- A list of metadata for the ith object ID including: + - A value j referring to the jth packfile. + - An offset within the jth packfile for the object. +- If large offsets are required, we use another list of large + offsets similar to version 2 pack-indexes. + +Thus, we can provide O(log N) lookup time for any number +of packfiles. + +Design Details +-- + +- The MIDX is stored in a file named 'multi-pack-index' in the + .git/objects/pack directory. This could be stored in the pack + directory of an alternate. It refers only to packfiles in that + same directory. + +- The core.midx config setting must be on to consume MIDX files. + +- The file format includes parameters for the object ID hash + function, so a future change of hash algorithm does not require + a change in format. + +- The MIDX keeps only one record per object ID. If an object appears + in multiple packfiles, then the MIDX selects the copy in the most- + recently modified packfile. + +- If there exist packfiles in the pack directory not registered in + the MIDX, then those packfiles are loaded into the `packed_git` + list and `packed_git_mru` cache. + +- The pack-indexes (.idx files) remain in the pack directory so we + can delete the MIDX file, set core.midx to false, or downgrade + without any loss of information. + +- The MIDX file format uses a chunk-based approach (similar to the + commit-graph file) that allows optional data to be added. + +Future Work +--- + +- Add a 'verify' subcommand to the 'git midx' builtin to verify the + contents of the multi-pack-index file match the offsets listed in + the corresponding pack-indexes. + +- The multi-pack-index allows many packfiles, especially in a context + where repacking is expensive (such as a very large repo), or + unexpected maintenance time is unacceptable (such as a high-demand + build machine). However, the multi-pack-index needs to be rewritten + in full every time. We can extend the format to be incremental, so + writes are fast. By storing a small "tip" multi-pack-index that + points to large "base" MIDX files, we can keep writes fast while + still reducing the number of binary searches required for object + lookups. + +- The reachability bitmap is currently paired directly with a single + packfile, using the pack-order as the object order to hopefully + compress the bitmaps well using run-length encoding. This could be + extended to pair a reachability bitmap with a multi-pack-index. If + the multi-pack-index is extended to store a "stable object order" + (a function Order(hash) = integer that is constant for a given hash, + even as the multi-pack-index is updated) then a reachability bitmap + could point to a multi-pack-index and be updated independently. + +- Packfiles can be marked as "special" using empty files that share + the initial name but replace ".pack" with ".keep" or ".promisor". + We can add an optional chunk of data to the multi-pack-index that + records flags of information about the packfiles. This allows new + states, such as 'repacked' or 'redeltified', that can help with + pack maintenance in a multi-pack environment. It may also be + helpful to organize packfiles by object type (commit, tree, blob, + etc.) and use this metadata to help that maintenance. + +- The partial clone feature records special "promisor" packs that + may point to objects that are not stored locally, but available + on request to a server. The multi-pack-index does not currently + track these promisor packs. + +Related
[PATCH 09/23] midx: write pack names in chunk
The multi-pack-index (MIDX) needs to track which pack-files are covered by the MIDX file. Store these in our first required chunk. Since filenames are not well structured, add padding to keep good alignment in later chunks. Modify the 'git midx read' subcommand to output the existence of the pack-file name chunk. Modify t5319-midx.sh to reflect this new output and the new expected number of chunks. Defense in depth: A pattern we are using in the multi-pack-index feature is to verify the data as we write it. We want to ensure we never write invalid data to the multi-pack-index. There are many checks during the write of a MIDX file that double-check that the values we are writing fit the format definitions. If any value is incorrect, then we notice before writing invalid data. This mainly helps developers while working on the feature, but it can also identify issues that only appear when dealing with very large data sets. These large sets are hard to encode into test cases. Signed-off-by: Derrick Stolee --- Documentation/technical/pack-format.txt | 6 + builtin/midx.c | 7 + midx.c | 176 +++- object-store.h | 2 + t/t5319-midx.sh | 3 +- 5 files changed, 188 insertions(+), 6 deletions(-) diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt index 17666b4bfc..2b37be7b33 100644 --- a/Documentation/technical/pack-format.txt +++ b/Documentation/technical/pack-format.txt @@ -296,6 +296,12 @@ CHUNK LOOKUP: CHUNK DATA: + Packfile Names (ID: {'P', 'N', 'A', 'M'}) + Stores the packfile names as concatenated, null-terminated strings. + Packfiles must be listed in lexicographic order for fast lookups by + name. This is the only chunk not guaranteed to be a multiple of four + bytes in length, so should be the last chunk for alignment reasons. + (This section intentionally left incomplete.) TRAILER: diff --git a/builtin/midx.c b/builtin/midx.c index c7002f664a..fe56560853 100644 --- a/builtin/midx.c +++ b/builtin/midx.c @@ -28,6 +28,13 @@ static int read_midx_file(const char *object_dir) m->num_chunks, m->num_packs); + printf("chunks:"); + + if (m->chunk_pack_names) + printf(" pack_names"); + + printf("\n"); + printf("object_dir: %s\n", m->object_dir); return 0; diff --git a/midx.c b/midx.c index 9fb89c80a2..d4f4a01a51 100644 --- a/midx.c +++ b/midx.c @@ -13,6 +13,11 @@ #define MIDX_HASH_LEN 20 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN) +#define MIDX_MAX_CHUNKS 1 +#define MIDX_CHUNK_ALIGNMENT 4 +#define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */ +#define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t)) + static char *get_midx_filename(const char *object_dir) { struct strbuf midx_name = STRBUF_INIT; @@ -29,6 +34,7 @@ struct midxed_git *load_midxed_git(const char *object_dir) size_t midx_size; void *midx_map; const char *midx_name = get_midx_filename(object_dir); + uint32_t i; fd = git_open(midx_name); if (fd < 0) @@ -74,6 +80,31 @@ struct midxed_git *load_midxed_git(const char *object_dir) m->num_chunks = *(m->data + 6); m->num_packs = get_be32(m->data + 8); + for (i = 0; i < m->num_chunks; i++) { + uint32_t chunk_id = get_be32(m->data + 12 + MIDX_CHUNKLOOKUP_WIDTH * i); + uint64_t chunk_offset = get_be64(m->data + 16 + MIDX_CHUNKLOOKUP_WIDTH * i); + + switch (chunk_id) { + case MIDX_CHUNKID_PACKNAMES: + m->chunk_pack_names = m->data + chunk_offset; + break; + + case 0: + die("terminating MIDX chunk id appears earlier than expected"); + break; + + default: + /* +* Do nothing on unrecognized chunks, allowing future +* extensions to add optional chunks. +*/ + break; + } + } + + if (!m->chunk_pack_names) + die("MIDX missing required pack-name chunk"); + return m; cleanup_fail: @@ -99,18 +130,88 @@ static size_t write_midx_header(struct hashfile *f, return MIDX_HEADER_SIZE; } +struct pack_pair { + uint32_t pack_int_id; + char *pack_name; +}; + +static int pack_pair_compare(const void *_a, const void *_b) +{ + struct pack_pair *a = (struct pack_pair *)_a; + struct pack_pair *b = (struct pack_pair *)_b; + return strcmp(a->pack_name, b->pack_name); +} + +static void sort_packs_by_name(char **pack_names, uint32_t nr_packs,
[PATCH 16/23] midx: prepare midxed_git struct
Signed-off-by: Derrick Stolee --- midx.c | 22 ++ midx.h | 2 ++ object-store.h | 7 +++ packfile.c | 6 +- 4 files changed, 36 insertions(+), 1 deletion(-) diff --git a/midx.c b/midx.c index a49300bf75..5e9290ca8f 100644 --- a/midx.c +++ b/midx.c @@ -175,6 +175,28 @@ struct midxed_git *load_midxed_git(const char *object_dir) exit(1); } +int prepare_midxed_git_one(struct repository *r, const char *object_dir) +{ + struct midxed_git *m = r->objects->midxed_git; + struct midxed_git *m_search; + + if (!core_midx) + return 0; + + for (m_search = m; m_search; m_search = m_search->next) + if (!strcmp(object_dir, m_search->object_dir)) + return 1; + + r->objects->midxed_git = load_midxed_git(object_dir); + + if (r->objects->midxed_git) { + r->objects->midxed_git->next = m; + return 1; + } + + return 0; +} + static size_t write_midx_header(struct hashfile *f, unsigned char num_chunks, uint32_t num_packs) diff --git a/midx.h b/midx.h index a1d18ed991..793203fc4a 100644 --- a/midx.h +++ b/midx.h @@ -5,8 +5,10 @@ #include "cache.h" #include "object-store.h" #include "packfile.h" +#include "repository.h" struct midxed_git *load_midxed_git(const char *object_dir); +int prepare_midxed_git_one(struct repository *r, const char *object_dir); int write_midx_file(const char *object_dir); diff --git a/object-store.h b/object-store.h index 9b671f1b0a..7908d46e34 100644 --- a/object-store.h +++ b/object-store.h @@ -130,6 +130,13 @@ struct raw_object_store { */ struct oidmap *replace_map; + /* +* private data +* +* should only be accessed directly by packfile.c and midx.c +*/ + struct midxed_git *midxed_git; + /* * private data * diff --git a/packfile.c b/packfile.c index 1a714fbde9..b91ca9b9f5 100644 --- a/packfile.c +++ b/packfile.c @@ -15,6 +15,7 @@ #include "tree-walk.h" #include "tree.h" #include "object-store.h" +#include "midx.h" char *odb_pack_name(struct strbuf *buf, const unsigned char *sha1, @@ -893,10 +894,13 @@ static void prepare_packed_git(struct repository *r) if (r->objects->packed_git_initialized) return; + prepare_midxed_git_one(r, r->objects->objectdir); prepare_packed_git_one(r, r->objects->objectdir, 1); prepare_alt_odb(r); - for (alt = r->objects->alt_odb_list; alt; alt = alt->next) + for (alt = r->objects->alt_odb_list; alt; alt = alt->next) { + prepare_midxed_git_one(r, alt->path); prepare_packed_git_one(r, alt->path, 0); + } rearrange_packed_git(r); prepare_packed_git_mru(r); r->objects->packed_git_initialized = 1; -- 2.18.0.rc1
RE: git question from a newbie
Bryan Thank you. I didn't realize that when you set up a remote repository, it is just a folder. I thought the fact that I had it setup as a website, was going to handle what I needed. It wasn't until your email that I realized I had to use some type of client. I installed Bonobo as the remote repository and bam it worked! You are right that the info on Windows is a bit sparse. I learned a lot and want to thank you again. Steve Heinz Steve Heinz | Lead Programmer Analyst, Information Technology AAA Northeast | 1415 Kellum Place | Garden City, NY 11530 X8042 | T 516-535-2581 | F 516-873-2211 she...@aaanortheast.com | AAA.com It Pays to Belong. -Original Message- From: Bryan Turner Sent: Tuesday, June 05, 2018 6:29 PM To: Heinz, Steve Cc: Git Users Subject: Re: git question from a newbie On Tue, Jun 5, 2018 at 2:33 PM Heinz, Steve wrote: > > Hi. > > I am new to Git and have read quite a few articles on it. > I am planning on setting up a remote repository on a windows 2012 R2 server > and will access it via HTTPS. > I am setting up a local repository on my desk top (others in my group will do > the same). > On "server1": I install Git and create a repository "repos". > On "server1": I create a dummy webpage "default.htm" and place it in the > repo folder. > On "server1": I create a web application in IIS pointing to Git > On Server1": change permissions so IIS_User has access to the folders. > On "server1": inside the "repos" folder and right click and choose "bash > here" > On "server1": $ git init -bare(it's really 2 hyphens) This might create a _repository_, but it's not going to set up any Git hosting processing for it. You might be able to clone using the fallback to the "dumb" HTTP protocol (though I doubt it, with the steps you've shown) , but you won't be able to push. You need handlers for git-http-backend which handle info/refs and other requests that are related to the Git HTTP wire protocol.[1] Documentation for setting up Git's HTTP protocol via Apache are pretty easy to find[2], but IIS instructions are a bit more sparse. I don't know of any good ones off the top of my head. But that's your issue; your IIS setup isn't really a valid Git remote; it's just a Git repository with contents visible via HTTP. [1] https://github.com/git/git/blob/master/Documentation/technical/http-protocol.txt [2] https://github.com/git/git/blob/master/Documentation/howto/setup-git-server-over-http.txt Bryan The information contained in this email message is intended only for the private and confidential use of the recipient(s) named above, unless the sender expressly agrees otherwise. In no event shall AAA Northeast or any of its affiliates accept any responsibility for the loss, use or misuse of any information including confidential information, which is sent to AAA Northeast or its affiliates via email, or email attachment. AAA Northeast does not guarantee the accuracy of any email or email attachment. If the reader of this message is not the intended recipient and/or you have received this email in error, you must take no action based on the information in this email and you are hereby notified that any dissemination, misuse or copying or disclosure of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by email and delete the original message.
Re: [PATCH v2] completion: reduce overhead of clearing cached --options
The zsh script expects the bash completion script to be available so that might be the issue here. To reproduce this is what I do. First, a sparse checkout: # mkdir ~/git # cd ~/git # git init # git remote add origin g...@github.com:git/git.git # git config core.sparseCheckout true # mkdir .git/info # echo contrib/completion >> .git/info/sparse-checkout # git pull origin master --depth 1 After that I simply link the zsh script to _git: # cd git/contrib/completion # ln git-completion.zsh _git And add the following to a new .zshrc file: autoload -U compinit; compinit autoload -U bashcompinit; bashcompinit fpath=(~/git/contrib/completion $fpath) And that appears to be enough to reproduce: # git git/contrib/completion/git-completion.bash:354: read-only variable: QISUFFIX zsh:12: command not found: ___main zsh:15: _default: function definition file not found _dispatch:70: bad math expression: operand expected at `/usr/bin/g...' zsh: segmentation fault zsh ~rick On 7 June 2018 at 07:48, Jonathan Nieder wrote: > Hi, > > SZEDER Gábor wrote: > >> Other Bash versions, notably 4.4.19 on macOS via homebrew (i.e. a >> newer version on the same platform) and 3.2.25 on CentOS (i.e. a >> slightly earlier version, though on a different platform) are not >> affected. ZSH in macOS (the versions shipped by default or installed >> via homebrew) or on other platforms isn't affected either. > [...] >> --- a/contrib/completion/git-completion.bash >> +++ b/contrib/completion/git-completion.bash >> @@ -282,7 +282,11 @@ __gitcomp () >> >> # Clear the variables caching builtins' options when (re-)sourcing >> # the completion script. >> -unset $(set |sed -ne >> 's/^\(__gitcomp_builtin_[a-zA-Z0-9_][a-zA-Z0-9_]*\)=.*/\1/p') 2>/dev/null >> +if [[ -n ${ZSH_VERSION-} ]]; then >> + unset $(set |sed -ne >> 's/^\(__gitcomp_builtin_[a-zA-Z0-9_][a-zA-Z0-9_]*\)=.*/\1/p') 2>/dev/null >> +else >> + unset $(compgen -v __gitcomp_builtin_) >> +fi > > As discussed at [1], this fails catastrophically when sourced by > git-completion.zsh, which explicitly clears ZSH_VERSION. By > catastrophically, I mean that Rick van Hattem and Dave Borowitz report > that it segfaults. > > I can't reproduce it since I am having trouble following the > instructions at the top of git-completion.zsh to get the recommended > zsh completion script loading mechanism working at all. (I've > succeeded in using git's bash completion on zsh before, but that was > many years ago.) First attempt: > > 1. rm -fr ~/.zsh ~/.zshrc > # you can move them out of the way instead for a less destructive > # reproduction recipe > 2. zsh > 3. hit "q" > 4. zstyle ':completion:*:*:git:*' script \ > $(pwd)/contrib/completion/git-completion.zsh > 5. git checkout > > Expected result: segfault > > Actual result: succeeds, using zsh's standard completion script (not > git's). > > I also tried > > 1. rm -fr ~/.zsh ~/.zshrc > # you can move them out of the way instead for a less destructive > # reproduction recipe > 2. zsh > 3. hit "2" > 4. mkdir ~/.zsh; cp contrib/completion/git-completion.zsh ~/.zsh/_git > 5. echo 'fpath=(~/.zsh $fpath)' >>~/.zshrc > 6. ^D; zsh > 7. git checkout > > and a similar process with fpath earlier in the zshrc file. Whatever > I try, I end up with one of two results: either it uses zsh's standard > completion, or it spews a stream of errors about missing functions > like compgen. What am I doing wrong? > > Related question: what would it take to add a zsh completion sanity > check to t/t9902-completion.sh? > > When running with "set -x", here is what Dave Borowitz gets before the > segfault. I don't have a stacktrace because, as mentioned above, I > haven't been able to reproduce it. > > str=git > [[ -n git ]] > service=git > i= > _compskip=default > eval '' > ret=0 > [[ default = *patterns* ]] > [[ default = all ]] > str=/usr/bin/git > [[ -n /usr/bin/git ]] > service=_dispatch:70: bad math expression: operand expected at > `/usr/bin/g...' > > zsh: segmentation fault zsh > > Here's a minimal fix, untested. As a followup, as Gábor suggests at [2], > it would presumably make sense to stop overriding ZSH_VERSION, using > this GIT_ scoped variable everywhere instead. > > Thoughts? > > Reported-by: Rick van Hattem > Reported-by: Dave Borowitz > Signed-off-by: Jonathan Nieder > > [1] > https://public-inbox.org/git/01020163c683e753-04629405-15f8-4a30-9dc3-e4e3f2a5aa26-000...@eu-west-1.amazonses.com/ > [2] https://public-inbox.org/git/20180606114147.7753-1-szeder@gmail.com/ > > diff --git i/contrib/completion/git-completion.bash > w/contrib/completion/git-completion.bash > index 12814e9bbf..e4bcc435ea 100644 > --- i/contrib/completion/git-completion.bash > +++ w/contrib/completion/git-completion.bash > @@ -348,7 +348,7 @@ __gitcomp () > > # Clear the variables caching builtins' options when (re-)sourcing
RE: git question from a newbie
Randall Thank you, I tried it but that didn't work either. I did find out what my issue was. I need some type of client that would be setup to listen for the requests. Steve -Original Message- From: Randall S. Becker Sent: Tuesday, June 05, 2018 6:19 PM To: Heinz, Steve ; git@vger.kernel.org Subject: RE: git question from a newbie On June 5, 2018 5:24 PM, Steve Heinz wrote: > I am new to Git and have read quite a few articles on it. > I am planning on setting up a remote repository on a windows 2012 R2 server > and will access it via HTTPS. > I am setting up a local repository on my desk top (others in my group > will do > the same). > On "server1": I install Git and create a repository "repos". > On "server1": I create a dummy webpage "default.htm" and place it in > the repo folder. > On "server1": I create a web application in IIS pointing to Git > On Server1": change permissions so IIS_User has access to the folders. > On "server1": inside the "repos" folder and right click and choose > "bash here" > On "server1": $ git init -bare(it's really 2 hyphens) > > On Desktop: open Chrome and type in URL to make sure we can access it > https://xyz/repos/default.htm > ** make sure you have access, no certificate issues or firewall issues. The > pages shows up fine > > On Desktop: install Git and create repository "repos". > On Desktop: right click in "repos" folder and choose "bash here" > On Desktop: $ git init > On Desktop : add a folder "testProject" under the "repos" folder and > add some files to the folder > On Desktop: $ git add . (will add files and folder to working tree) > On Desktop $ git status (shows it recognizes the filed were added) > On Desktop $ git commit -m "test project commit" (will stage changes) > On Desktop $ git push https://xyz.domainname.com/repos master > > ** this is the error I get, I have tried many different things. I am sure I am > doing something stupid > ** I have tried a bunch of variations but I always get the same error. > It looks > like some type of network/permission > ** thing but everything seems OK. >Fatal: repository 'https://xyz.domainname.com/repos/' not found > > *** this is where I get the error trying to push staged items to the remote > repository. > *** I have tried to clone the empty remote repository still same error > *** I checked port 443 is opened and being used for https > *** tried to set origin to https://xyz.domainname.com/repos; and then $git > push origin master (same error) > *** I tried passing credentials to the remote server as well Missing glue - git remote git remote add origin https://xyz.domainname.com/repos Cheers, Randall -- Brief whoami: NonStop developer since approximately 2112884442 UNIX developer since approximately 421664400 -- In my real life, I talk too much. The information contained in this email message is intended only for the private and confidential use of the recipient(s) named above, unless the sender expressly agrees otherwise. In no event shall AAA Northeast or any of its affiliates accept any responsibility for the loss, use or misuse of any information including confidential information, which is sent to AAA Northeast or its affiliates via email, or email attachment. AAA Northeast does not guarantee the accuracy of any email or email attachment. If the reader of this message is not the intended recipient and/or you have received this email in error, you must take no action based on the information in this email and you are hereby notified that any dissemination, misuse or copying or disclosure of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by email and delete the original message.
Re: [PATCH v2] completion: reduce overhead of clearing cached --options
On Thu, Jun 7, 2018 at 1:48 AM Jonathan Nieder wrote: > Whatever > I try, I end up with one of two results: either it uses zsh's standard > completion, or it spews a stream of errors about missing functions > like compgen. What am I doing wrong? Try adding to the top of your .zshrc: autoload -U compinit; compinit autoload -U bashcompinit; bashcompinit Those are both in my .zshrc, and I think they are sufficient magic to define compgen.
[PATCH] Use hyphenated "remote-tracking branch" (docs and comments)
Use the obvious consensus of hyphenated "remote-tracking branch", and fix an obvious typo, all in documentation and comments. Signed-off-by: Robert P. J. Day --- diff --git a/Documentation/git-submodule.txt b/Documentation/git-submodule.txt index 4a5cc38a6f..ef9d9d28a9 100644 --- a/Documentation/git-submodule.txt +++ b/Documentation/git-submodule.txt @@ -42,8 +42,8 @@ have to use '../foo.git' instead of './foo.git' - as one might expect when following the rules for relative URLs - because the evaluation of relative URLs in Git is identical to that of relative directories). + -The default remote is the remote of the remote tracking branch -of the current branch. If no such remote tracking branch exists or +The default remote is the remote of the remote-tracking branch +of the current branch. If no such remote-tracking branch exists or the HEAD is detached, "origin" is assumed to be the default remote. If the superproject doesn't have a default remote configured the superproject is its own authoritative upstream and the current diff --git a/builtin/branch.c b/builtin/branch.c index d53f6e2ad4..5217ba3bde 100644 --- a/builtin/branch.c +++ b/builtin/branch.c @@ -701,7 +701,7 @@ int cmd_branch(int argc, const char **argv, const char *prefix) * If no sorting parameter is given then we default to sorting * by 'refname'. This would give us an alphabetically sorted * array with the 'HEAD' ref at the beginning followed by -* local branches 'refs/heads/...' and finally remote-tacking +* local branches 'refs/heads/...' and finally remote-tracking * branches 'refs/remotes/...'. */ if (!sorting) diff --git a/builtin/pull.c b/builtin/pull.c index 1f2ecf3a88..49cc3beb4c 100644 --- a/builtin/pull.c +++ b/builtin/pull.c @@ -673,7 +673,7 @@ static const char *get_upstream_branch(const char *remote) } /** - * Derives the remote tracking branch from the remote and refspec. + * Derives the remote-tracking branch from the remote and refspec. * * FIXME: The current implementation assumes the default mapping of * refs/heads/ to refs/remotes//. @@ -711,7 +711,7 @@ static const char *get_tracking_branch(const char *remote, const char *refspec) /** * Given the repo and refspecs, sets fork_point to the point at which the - * current branch forked from its remote tracking branch. Returns 0 on success, + * current branch forked from its remote-tracking branch. Returns 0 on success, * -1 on failure. */ static int get_rebase_fork_point(struct object_id *fork_point, const char *repo, -- Robert P. J. Day Ottawa, Ontario, CANADA http://crashcourse.ca/dokuwiki Twitter: http://twitter.com/rpjday LinkedIn: http://ca.linkedin.com/in/rpjday
[GSoC][PATCH v3 1/1] rebase--interactive: rewrite append_todo_help() in C
This rewrites append_todo_help() from shell to C. It also incorporates some parts of initiate_action() and complete_action() that also write help texts to the todo file. Two flags are added to rebase--helper.c: one to call append_todo_help() (`--append-todo-help`), and another one to tell append_todo_help() to write the help text suited for the edit-todo mode (`--write-edit-todo`). Finally, append_todo_help() is removed from git-rebase--interactive.sh to use `rebase--helper --append-todo-help` instead. Signed-off-by: Alban Gruin --- builtin/rebase--helper.c | 10 ++-- git-rebase--interactive.sh | 52 ++-- sequencer.c| 60 ++ sequencer.h| 1 + 4 files changed, 71 insertions(+), 52 deletions(-) diff --git a/builtin/rebase--helper.c b/builtin/rebase--helper.c index f7c2a5fdc..ded5e291d 100644 --- a/builtin/rebase--helper.c +++ b/builtin/rebase--helper.c @@ -12,12 +12,12 @@ static const char * const builtin_rebase_helper_usage[] = { int cmd_rebase__helper(int argc, const char **argv, const char *prefix) { struct replay_opts opts = REPLAY_OPTS_INIT; - unsigned flags = 0, keep_empty = 0, rebase_merges = 0; + unsigned flags = 0, keep_empty = 0, rebase_merges = 0, write_edit_todo = 0; int abbreviate_commands = 0, rebase_cousins = -1; enum { CONTINUE = 1, ABORT, MAKE_SCRIPT, SHORTEN_OIDS, EXPAND_OIDS, CHECK_TODO_LIST, SKIP_UNNECESSARY_PICKS, REARRANGE_SQUASH, - ADD_EXEC + ADD_EXEC, APPEND_TODO_HELP } command = 0; struct option options[] = { OPT_BOOL(0, "ff", _ff, N_("allow fast-forward")), @@ -27,6 +27,8 @@ int cmd_rebase__helper(int argc, const char **argv, const char *prefix) OPT_BOOL(0, "rebase-merges", _merges, N_("rebase merge commits")), OPT_BOOL(0, "rebase-cousins", _cousins, N_("keep original branch points of cousins")), + OPT_BOOL(0, "write-edit-todo", _edit_todo, +N_("append the edit-todo message to the todo-list")), OPT_CMDMODE(0, "continue", , N_("continue rebase"), CONTINUE), OPT_CMDMODE(0, "abort", , N_("abort rebase"), @@ -45,6 +47,8 @@ int cmd_rebase__helper(int argc, const char **argv, const char *prefix) N_("rearrange fixup/squash lines"), REARRANGE_SQUASH), OPT_CMDMODE(0, "add-exec-commands", , N_("insert exec commands in todo list"), ADD_EXEC), + OPT_CMDMODE(0, "append-todo-help", , + N_("insert the help in the todo list"), APPEND_TODO_HELP), OPT_END() }; @@ -84,5 +88,7 @@ int cmd_rebase__helper(int argc, const char **argv, const char *prefix) return !!rearrange_squash(); if (command == ADD_EXEC && argc == 2) return !!sequencer_add_exec_commands(argv[1]); + if (command == APPEND_TODO_HELP && argc == 1) + return !!append_todo_help(write_edit_todo, keep_empty); usage_with_options(builtin_rebase_helper_usage, options); } diff --git a/git-rebase--interactive.sh b/git-rebase--interactive.sh index 299ded213..94c23a7af 100644 --- a/git-rebase--interactive.sh +++ b/git-rebase--interactive.sh @@ -39,38 +39,6 @@ comment_for_reflog () { esac } -append_todo_help () { - gettext " -Commands: -p, pick = use commit -r, reword = use commit, but edit the commit message -e, edit = use commit, but stop for amending -s, squash = use commit, but meld into previous commit -f, fixup = like \"squash\", but discard this commit's log message -x, exec = run command (the rest of the line) using shell -d, drop = remove commit -l, label = label current HEAD with a name -t, reset = reset HEAD to a label -m, merge [-C | -c ] [# ] -. create a merge commit using the original merge commit's -. message (or the oneline, if no original merge commit was -. specified). Use -c to reword the commit message. - -These lines can be re-ordered; they are executed from top to bottom. -" | git stripspace --comment-lines >>"$todo" - - if test $(get_missing_commit_check_level) = error - then - gettext " -Do not remove any line. Use 'drop' explicitly to remove a commit. -" | git stripspace --comment-lines >>"$todo" - else - gettext " -If you remove a line here THAT COMMIT WILL BE LOST. -" | git stripspace --comment-lines >>"$todo" - fi -} - die_abort () { apply_autostash rm -rf "$state_dir" @@ -143,13 +111,7 @@ initiate_action () { git stripspace --strip-comments <"$todo" >"$todo".new mv -f "$todo".new "$todo" collapse_todo_ids - append_todo_help -
[GSoC][PATCH v3 0/1] rebase -i: rewrite append_todo_help() in C
This patch rewrites append_todo_help() from shell to C. The C version covers a bit more than the old shell version. To achieve that, some parameters were added to rebase--helper. This is part of the effort to rewrite interactive rebase in C. Changes since v2: - Renaming the variable `edit_todo` to `write_edit_todo` to avoid confusions, after a comment by Christian Couder[1]. [1] https://github.com/git/git/pull/503#discussion_r193392270 Alban Gruin (1): rebase--interactive: rewrite append_todo_help() in C builtin/rebase--helper.c | 10 ++-- git-rebase--interactive.sh | 52 ++-- sequencer.c| 60 ++ sequencer.h| 1 + 4 files changed, 71 insertions(+), 52 deletions(-) -- 2.16.4
Re: [RFC PATCH 0/5] format-patch: automate cover letter range-diff
On Wed, Jun 6, 2018 at 3:16 PM, Duy Nguyen wrote: > On Wed, May 30, 2018 at 10:03 AM, Eric Sunshine > wrote: >> Dscho recently implemented a 'tbdiff' replacement as a Git builtin named >> git-branch-diff[1] which computes differences between two versions of a >> patch series. Such a diff can be a useful aid for reviewers when >> inserted into a cover letter. However, doing so requires manual >> generation (invoking git-branch-diff) and copy/pasting the result into >> the cover letter. > > Another option which I wanted to go is delegate part of cover letter > generation to a hook (or just a config key that contains a shell > command). This way it's easier to customize cover letters. We could > still have a good fallback that does shortlog, diffstat and tbdiff. It is common on this mailing list to turn down requests for new hooks when the requested functionality could just as easily be implemented via a wrapper script. So, my knee-jerk reaction is that a hook to customize the cover letter may be overkill when the same functionality could likely be implemented relatively easily by a shell script which invokes git-format-patch and customizes the cover letter after-the-fact. Same argument regarding a config key holding a shell command. But, perhaps there are cases which don't occur to me which could be helped by a config variable or such. Of course, by the same reasoning, the --range-diff functionality implemented by this patch series, which is just a convenience, could be handled by a wrapper script, thus is not strictly needed. On the other hand, given that interdiffs and range-diffs are so regularly used in re-rolls on this list (and perhaps other mailing list-based projects) may be argument enough in favor of having such an option built into git-format-patch.
Re: [PATCH v4 00/23] Fix incorrect use of the_index
On Wed, Jun 6, 2018 at 9:49 AM, Nguyễn Thái Ngọc Duy wrote: > v4 fixes some commit messages and killed a couple more the_index > references after I tried to merge this with 'pu' Thanks for tackling this. The first 8 patches look good to me other than the typo I pointed out on patch 4. However, my eyes glazed over a bit on the attr.c stuff in patch 7 that you specifically mentioned in the commit message, so my "looks good" doesn't count for as much on that one. I'm getting sleepy, but I'll try to circle back and look over the rest of the patches in the next few days.
Re: [PATCH v4 04/23] unpack-tress: convert clear_ce_flags* to avoid the_index
Subject line: unpack-trees rather than unpack-tress. On Wed, Jun 6, 2018 at 9:49 AM, Nguyễn Thái Ngọc Duy wrote: > Prior to fba92be8f7, this code implicitly (and incorrectly) assumes > the_index when running the exclude machinery. fba92be8f7 helps show > this problem clearer because unpack-trees operation is supposed to > work on whatever index the caller specifies... not specifically > the_index. > > Update the code to use "istate" argument that's originally from > mark_new_skip_worktree(). From the call sites, both in unpack_trees(), > you can see that this function works on two separate indexes: > o->src_index and o->result. The second mark_new_skip_worktree() so far > has incorecctly applied exclude rules on o->src_index instead of > o->result. It's unclear what is the consequences of this, but it's > definitely wrong. > > [1] fba92be8f7 (dir: convert is_excluded_from_list to take an index - > 2017-05-05) > > Signed-off-by: Nguyễn Thái Ngọc Duy A somewhat curious finding: when I was rebuilding and re-testing all 23 patches, I got a failure on this patch in test 31 of t7063-status-untracked-cache.sh. I did not get any test failures with any of the other patches. However, after re-running that test or the whole suite half a dozen times with just up to this patch applied, I was not able to trigger the failure again. Is there a rare race in that testcase? I certainly don't see anything in this patch that appears problematic, and the fact that I couldn't readily reproduce suggests it could well have been there before any of these patches. Everything else in the patch looks fine to me.
Re: GDPR compliance best practices?
Hi David, thanks for your input on the issue. > LEGAL GDPR NOTICE: > According to the European data protection laws (GDPR), we would like to make > you > aware that contributing to rsyslog via git will permanently store the > name and email address you provide as well as the actual commit and the > time and date you made it inside git's version history. This is inevitable, > because it is a main feature git. As we can, see, rsyslog tries to solve the issue by the already discussed legal "technology" of disclaimers (which is certainly not accepted as state of the art technology by the GDPR). In essence, they are giving excuses for why they are not honoring the right to be forgotten. Disclaimers do not work. They have no legal effect, they are placebos. The GDPR does not accept such excuses. If it would, companies could arbitrarily design their data storage such as to make it "the main feature" to not honor the right to be forgotten and/or other GDPR rights. It is obvious that this cannot work, as it would completely undermine those rights. The GDPR honors technology as a means to protect the individual's rights, not as a means to subvert them. > If you are concerned about your > privacy, we strongly recommend to use > > --author "anonymous " > > together with your commit. This can only be a solution if the project rejects any commits which are not anonymous. > However, we have valid reasons why we cannot remove that information > later on. The reasons are: > > * this would break git history and make future merges unworkable This is not a valid excuse (see above). The technology has to be designed or applied in such a way that the individuals rights are honored, not the other way around. In absence of other means, the project has to rewrite history if it gets a valid request by someone exercising his right to be forgotten, even if that causes a lot of hazzle for everyone. > * the rsyslog projects has legitimate interest to keep a permanent record of > the > contributor identity, once given, for > - copyright verification > - being able to provide proof should a malicious commit be made True, but that doesn't justify publishing that information and keeping it published even when someone exercises his right to be forgotten. In that case, "legitimate interest" is not enough. There need to be "overriding legitimate grounds". I don't see them here. > Please also note that your commit is public and as such will potentially be > processed by many third-parties. Git's distributed nature makes it impossible > to track where exactly your commit, and thus your personal data, will be > stored > and be processed. If you would not like to accept this risk, please do either > commit anonymously or refrain from contributing to the rsyslog project. This is one of those statements that ultimately say "we do not honor the GDPR; either accept that or don't submit". That's the old, arguably ignorant mentality, and won't stand. The project has to have a legal basis for publishing the personal metadata contained in the repository. In doubt, it needs to be consent based, as that is practically the only basis that allows putting the data on the internet for everyone to download. And consent can be withdrawn at any time. The GDPR's transitional period started over two years ago. There was enough time to get everything GDPR compliant. It might be possible to implement my solution without changing git, btw. Simply use the anonymous hash as author name, and store the random number and the author as a git-notes. git-notes can be rewritten or deleted at any time without changing the commit ID. I am currently looking into this solution. One just needs to add something that can verify and resolve those anonymous hashes. Best wishes Peter -- Peter Backes, r...@helen.plasma.xg8.de