Re: Borrowing objects from nearby repositories

2014-03-28 Thread Junio C Hamano
Andrew Keller and...@kellerfarm.com writes:

 Okay, so to re-frame my idea, like you said, the goal is to find a user-
 friendly way for the user to tell git-clone to set up the alternates file
 (or perhaps just use the --alternates parameter), and run a repack,
 and disconnect the alternate.  And yet, we still want to be able to use
 --reference on its own, because there are existing use cases for that.

Here are a few possible action items that came out of this
discussion:

 1. Introduce a new --borrow option to git clone.

The updates to the SYNOPSIS section may go like this:

-'git clone' [--reference repository] ...other options...
+'git clone' [[--reference|--borrow] repository] ...other options...

The new option can be used instead of --reference and they
will be mutually incompatible.  The first implementation of the
--borrow option would do the following:

  (1) run the same git clone with the same command line but
  replacing --borrow with --reference; if this fails, exit
  with the same failure.

  (2) in the resulting repository, run git repack -a -d; if this
  fails, remove the entire directory the first step created,
  and exit with failure.

  (3) remove .git/objects/info/alternates from the resulting
  repository and exit with success.

and it may be acceptable as the final implementation as well.


 2. Make git repack safer for the users of clone --reference who
want to keep sharing objects from the original.

- Introduce the repack.local configuration variable that can
  be set to either true or false.  Missing variable defaults to
  false.  

- A repack that is run without -l option on the command line
  will pretend as if it was given -l from the command line if
  repack.local is set to true.  Add repack --no-local
  option to countermand this configuration variable from the
  command line.

- Teach git clone --reference (but not git clone --borrow)
  to set repack.local = true in the configuration of the
  resulting repository.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Borrowing objects from nearby repositories

2014-03-28 Thread Andrew Keller
On Mar 26, 2014, at 1:29 PM, Junio C Hamano gits...@pobox.com wrote:

 Andrew Keller and...@kellerfarm.com writes:
 
 On Mar 25, 2014, at 6:17 PM, Junio C Hamano gits...@pobox.com wrote:
 ...
 I think that the standard practice with the existing toolset is to
 clone with reference and then repack.  That is:
 
   $ git clone --reference borrowee git://over/there mine
   $ cd mine
   $ git repack -a -d
 
 And then you can try this:
 
   $ mv .git/objects/info/alternates .git/objects/info/alternates.disabled
   $ git fsck
 
 to make sure that you are no longer borrowing anything from the
 borrowee.  Once you are satisfied, you can remove the saved-away
 alternates.disabled file.
 
 Oh, I forgot to say that I am not opposed if somebody wants to teach
 git clone a new option to copy its objects from two places,
 (hopefully) the majority from near-by reference repository and the
 remainder over the network, without permanently relying on the
 former via the alternates mechanism.  The implementation of such a
 feature could even literally be clone with reference first and then
 repack at least initially but even in the final version.
 
 [Administrivia: please wrap your lines to a reasonable length]
 
 That was actually one of my first ideas - adding some sort of
 '--auto-repack' option to git-clone.  It's a relatively small
 change, and would work.  However, keeping in mind my end goal of
 automating the feature to the point where you could run simply
 'git clone url', an '--auto-repack' option is more difficult to
 undo.  You would need a new parameter to disable the automatic
 adding of reference repositories, and a new parameter to undo
 '--auto-repack', and you'd have to remember to actually undo both
 of those settings.
 
 In contrast, if the new feature was '--borrow', and the evolution
 of the feature was a global configuration 'fetch.autoBorrow', then
 to turn it off temporarily, one only needs a single new parameter
 '--no-auto-borrow'.  I think this is a cleaner approach than the
 former, although much more work.
 
 I think you may have misread me.  With the new option, I was
 hinting that the clone --reference  repack  rm alternates
 will be an acceptable internal implementation of the --borrow
 option that was mentioned in the thread.  I am not sure where you
 got the auto-repack from.

Ah, yes - that is better than what I was thinking.  I was thinking a bit
too low-level, and using two arguments in the place of your one.

 One of the reasons you may have misread me may be because I made it
 sound as if this may work and when it works you will be happy, but
 if it does not work you did not lose very much by mentioning mv 
 fsck.  That wasn't what I meant.
 
 The repack -a procedure is to make the borrower repository no
 longer dependent on the borrowee, and it is supposed to always work.
 In fact, this behaviour was the whole reason why repack later
 learned its -l option to disable it, because people who cloned
 with --reference in order to reduce the disk footprint by sharing
 older and more common objects [*1*] were rightfully surprised to see
 that the borrowed objects were copied over to their borrower
 repository when they ran repack [*2*].
 
 Because this is clone, there is nothing complex to undo.  Either
 it succeeds, or you remove the whole new directory if anything
 fails.
 
 I said even in the final version for a simple reason: you cannot
 cannot do realistically any better than the clone --reference 
 repack -a d  rm alternates sequence.

Wow, that's very insightful - thanks!  So, it sounds like I was right about
the general areas of concern when trying to do this during a fetch, but
I underestimated just how complicated it would be.

Okay, so to re-frame my idea, like you said, the goal is to find a user-
friendly way for the user to tell git-clone to set up the alternates file
(or perhaps just use the --alternates parameter), and run a repack,
and disconnect the alternate.  And yet, we still want to be able to use
--reference on its own, because there are existing use cases for that.

Thanks!
 - Andrew Keller

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Borrowing objects from nearby repositories

2014-03-26 Thread Junio C Hamano
Andrew Keller and...@kellerfarm.com writes:

 On Mar 25, 2014, at 6:17 PM, Junio C Hamano gits...@pobox.com wrote:
 ...
 I think that the standard practice with the existing toolset is to
 clone with reference and then repack.  That is:
 
$ git clone --reference borrowee git://over/there mine
$ cd mine
$ git repack -a -d
 
 And then you can try this:
 
$ mv .git/objects/info/alternates .git/objects/info/alternates.disabled
$ git fsck
 
 to make sure that you are no longer borrowing anything from the
 borrowee.  Once you are satisfied, you can remove the saved-away
 alternates.disabled file.
 
 Oh, I forgot to say that I am not opposed if somebody wants to teach
 git clone a new option to copy its objects from two places,
 (hopefully) the majority from near-by reference repository and the
 remainder over the network, without permanently relying on the
 former via the alternates mechanism.  The implementation of such a
 feature could even literally be clone with reference first and then
 repack at least initially but even in the final version.

[Administrivia: please wrap your lines to a reasonable length]

 That was actually one of my first ideas - adding some sort of
 '--auto-repack' option to git-clone.  It's a relatively small
 change, and would work.  However, keeping in mind my end goal of
 automating the feature to the point where you could run simply
 'git clone url', an '--auto-repack' option is more difficult to
 undo.  You would need a new parameter to disable the automatic
 adding of reference repositories, and a new parameter to undo
 '--auto-repack', and you'd have to remember to actually undo both
 of those settings.

 In contrast, if the new feature was '--borrow', and the evolution
 of the feature was a global configuration 'fetch.autoBorrow', then
 to turn it off temporarily, one only needs a single new parameter
 '--no-auto-borrow'.  I think this is a cleaner approach than the
 former, although much more work.

I think you may have misread me.  With the new option, I was
hinting that the clone --reference  repack  rm alternates
will be an acceptable internal implementation of the --borrow
option that was mentioned in the thread.  I am not sure where you
got the auto-repack from.

One of the reasons you may have misread me may be because I made it
sound as if this may work and when it works you will be happy, but
if it does not work you did not lose very much by mentioning mv 
fsck.  That wasn't what I meant.

The repack -a procedure is to make the borrower repository no
longer dependent on the borrowee, and it is supposed to always work.
In fact, this behaviour was the whole reason why repack later
learned its -l option to disable it, because people who cloned
with --reference in order to reduce the disk footprint by sharing
older and more common objects [*1*] were rightfully surprised to see
that the borrowed objects were copied over to their borrower
repository when they ran repack [*2*].

Because this is clone, there is nothing complex to undo.  Either
it succeeds, or you remove the whole new directory if anything
fails.

I said even in the final version for a simple reason: you cannot
cannot do realistically any better than the clone --reference 
repack -a d  rm alternates sequence.

But you would need to know a few things about how Git works in order
to come to that realisation.  Here are some:

 * clone --borrow (or whatever we end up calling the option) must
   talk to two repositories:

- We will need to have one upload-pack session with the distant
  origin repository over the network, which will send a complete
  pack.

- We need to also copy objects that weren't sent from the
  distant origin to our repository from the reference one.

 * A single repack -a -d (without -l) after clone --reference
   is already a way to do exactly what you need---enumerate what are
   missing in the packfile that was received from the distant origin
   and come up with packfile(s) that contain all and only objects
   the cloned repository needs.

 * You cannot easily concatenate multiple packfiles into a single
   one (or append runs of objects to an existing packfile) to come
   up with a single packfile.

You _could_ shoehorn the logic to enumerate and read from the
reference, and append them at the end of the packfile received from
the distant origin repository into the part that talks to the
distant origin repository, but the object layout in the resulting
packfile will be suboptimal [*3*] and the code complexity required
to do so is not worth it [*4*].


[Footnotes]

*1* From the point of view of supporting both camps, i.e. those who
want their borrower repositories to keep sharing the objects
with the borrowee repository and those who want to use a
borrowee repository temporarily while cloning only to reduce the
network cost from the distant upstream, the current option name
--reference and the proposed name --borrow are backwards.

Re: Borrowing objects from nearby repositories

2014-03-25 Thread Andrew Keller
On Mar 24, 2014, at 5:21 PM, Ævar Arnfjörð Bjarmason ava...@gmail.com wrote:
 On Wed, Mar 12, 2014 at 4:37 AM, Andrew Keller and...@kellerfarm.com wrote:
 Hi all,
 
 I am considering developing a new feature, and I'd like to poll the group 
 for opinions.
 
 Background: A couple years ago, I wrote a set of scripts that speed up 
 cloning of frequently used repositories.  The scripts utilize a bare Git 
 repository located at a known location, and automate providing a --reference 
 parameter to `git clone` and `git submodule update`.  Recently, some 
 coworkers of mine expressed an interest in using the scripts, so I published 
 the current version of my scripts, called `git repocache`, described at the 
 bottom of https://github.com/andrewkeller/ak-git-tools.
 
 Slowly, it has occurred to me that this feature, or something similar to it, 
 may be worth adding to Git, so I've been thinking about the best approach.  
 Here's my best idea so far:
 
 1)  Introduce '--borrow' to `git-fetch`.  This would behave similarly to 
 '--reference', except that it operates on a temporary basis, and does not 
 assume that the reference repository will exist after the operation 
 completes, so any used objects are copied into the local objects database.  
 In theory, this mechanism would be distinct from '--reference', so if both 
 are used, some objects would be copied, and some objects would be accessible 
 via a reference repository referenced by the alternates file.
 
 Isn't this the same as git clone --reference path --no-hardlinks url ?

'--reference` adds an entry to 'info/alternates' inside the objects folder.  
When an object is looked up, any objects folder listed in 
'objects/info/alternates' is considered to be an extension of the local objects 
folder.  So, when, for example, fetch runs, when it goes to decide whether or 
not it already has a blob locally, it may decide yes, and not download the 
blob at all, because it already exists in one of the reference repositories.  
If I clone one of my 80 GB repositories over SSH using a reference repository, 
the resulting clone is only about 175 KB, because it's assuming the reference 
repository will exist going forward, so it doesn't actually own any objects 
itself at all.

The '--no-hardlinks' option is only applicable when hard linking is available 
in the first place - i.e., when cloning from one local folder to another on the 
same filesystem (assuming the filesystem supports hard links).

Thanks,
 - Andrew

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Borrowing objects from nearby repositories

2014-03-25 Thread Junio C Hamano
Ævar Arnfjörð Bjarmason ava...@gmail.com writes:

 1) Introduce '--borrow' to `git-fetch`.  This would behave similarly
 to '--reference', except that it operates on a temporary basis, and
 does not assume that the reference repository will exist after the
 operation completes, so any used objects are copied into the local
 objects database.  In theory, this mechanism would be distinct from
 --reference', so if both are used, some objects would be copied, and
 some objects would be accessible via a reference repository referenced
 by the alternates file.

 Isn't this the same as git clone --reference path --no-hardlinks
 url ?

 Also without --no-hardlinks we're not assuming that the other repo
 doesn't go away (you could rm-rf it), just that the files won't be
 *modified*, which Git won't do, but you could manually do with other
 tools, so the default is to hardlink.

I think that the standard practice with the existing toolset is to
clone with reference and then repack.  That is:

$ git clone --reference borrowee git://over/there mine
$ cd mine
$ git repack -a -d

And then you can try this:

$ mv .git/objects/info/alternates .git/objects/info/alternates.disabled
$ git fsck

to make sure that you are no longer borrowing anything from the
borrowee.  Once you are satisfied, you can remove the saved-away
alternates.disabled file.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Borrowing objects from nearby repositories

2014-03-25 Thread Junio C Hamano
Junio C Hamano gits...@pobox.com writes:

 Ævar Arnfjörð Bjarmason ava...@gmail.com writes:

 1) Introduce '--borrow' to `git-fetch`.  This would behave similarly
 to '--reference', except that it operates on a temporary basis, and
 does not assume that the reference repository will exist after the
 operation completes, so any used objects are copied into the local
 objects database.  In theory, this mechanism would be distinct from
 --reference', so if both are used, some objects would be copied, and
 some objects would be accessible via a reference repository referenced
 by the alternates file.

 Isn't this the same as git clone --reference path --no-hardlinks
 url ?

 Also without --no-hardlinks we're not assuming that the other repo
 doesn't go away (you could rm-rf it), just that the files won't be
 *modified*, which Git won't do, but you could manually do with other
 tools, so the default is to hardlink.

 I think that the standard practice with the existing toolset is to
 clone with reference and then repack.  That is:

 $ git clone --reference borrowee git://over/there mine
 $ cd mine
 $ git repack -a -d

 And then you can try this:

 $ mv .git/objects/info/alternates .git/objects/info/alternates.disabled
 $ git fsck

 to make sure that you are no longer borrowing anything from the
 borrowee.  Once you are satisfied, you can remove the saved-away
 alternates.disabled file.

Oh, I forgot to say that I am not opposed if somebody wants to teach
git clone a new option to copy its objects from two places,
(hopefully) the majority from near-by reference repository and the
remainder over the network, without permanently relying on the
former via the alternates mechanism.  The implementation of such a
feature could even literally be clone with reference first and then
repack at least initially but even in the final version.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Borrowing objects from nearby repositories

2014-03-24 Thread Ævar Arnfjörð Bjarmason
On Wed, Mar 12, 2014 at 4:37 AM, Andrew Keller and...@kellerfarm.com wrote:
 Hi all,

 I am considering developing a new feature, and I'd like to poll the group for 
 opinions.

 Background: A couple years ago, I wrote a set of scripts that speed up 
 cloning of frequently used repositories.  The scripts utilize a bare Git 
 repository located at a known location, and automate providing a --reference 
 parameter to `git clone` and `git submodule update`.  Recently, some 
 coworkers of mine expressed an interest in using the scripts, so I published 
 the current version of my scripts, called `git repocache`, described at the 
 bottom of https://github.com/andrewkeller/ak-git-tools.

 Slowly, it has occurred to me that this feature, or something similar to it, 
 may be worth adding to Git, so I've been thinking about the best approach.  
 Here's my best idea so far:

 1)  Introduce '--borrow' to `git-fetch`.  This would behave similarly to 
 '--reference', except that it operates on a temporary basis, and does not 
 assume that the reference repository will exist after the operation 
 completes, so any used objects are copied into the local objects database.  
 In theory, this mechanism would be distinct from '--reference', so if both 
 are used, some objects would be copied, and some objects would be accessible 
 via a reference repository referenced by the alternates file.

Isn't this the same as git clone --reference path --no-hardlinks url ?

Also without --no-hardlinks we're not assuming that the other repo
doesn't go away (you could rm-rf it), just that the files won't be
*modified*, which Git won't do, but you could manually do with other
tools, so the default is to hardlink.

 2)  Teach `git fetch` to read 'repocache.path' (or a better-named 
 configuration), and use it to automatically activate borrowing.

So a default path for --reference path --no-hardlinks ?

 3)  For consistency, `git clone`, `git pull`, and `git submodule update` 
 should probably all learn '--borrow', and forward it to `git fetch`.

 4)  In some scenarios, it may be necessary to temporarily not automatically 
 borrow, so `git fetch`, and everything that calls it may need an argument to 
 do that.

 Intended outcome: With 'repocache.path' set, and the cached repository 
 properly updated, one could run `git clone url`, and the operation would 
 complete much faster than it does now due to less load on the network.

 Things I haven't figured out yet:

 *  What's the best approach to copying the needed objects?  It's probably 
 inefficient to copy individual objects out of pack files one at a time, but 
 it could be wasteful to copy entire pack files just because you need one 
 object.  Hard-linking could help, but that won't always be available.  One of 
 my previous ideas was to add a '--auto-repack' option to `git-clone`, which 
 solves this problem better, but introduces some other front-end usability 
 problems.
 *  To maintain optimal effectiveness, users would have to regularly run a 
 fetch in the cache repository.  Not all users know how to set up a scheduled 
 task on their computer, so this might become a maintenance problem for the 
 user.  This kind of problem I think brings into question the viability of the 
 underlying design here, assuming that the ultimate goal is to clone faster, 
 with very little or no change in the use of git.


 Thoughts?

 Thanks,
 Andrew Keller

 --
 To unsubscribe from this list: send the line unsubscribe git in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Borrowing objects from nearby repositories

2014-03-23 Thread Phil Hord
On Tue, Mar 11, 2014 at 11:37 PM, Andrew Keller and...@kellerfarm.com wrote:
 I am considering developing a new feature, and I'd like to poll the group for 
 opinions.

 Background: A couple years ago, I wrote a set of scripts that speed up 
 cloning of frequently used repositories.  The scripts utilize a bare Git 
 repository located at a known location, and automate providing a --reference 
 parameter to `git clone` and `git submodule update`.  Recently, some 
 coworkers of mine expressed an interest in using the scripts, so I published 
 the current version of my scripts, called `git repocache`, described at the 
 bottom of https://github.com/andrewkeller/ak-git-tools.

 Slowly, it has occurred to me that this feature, or something similar to it, 
 may be worth adding to Git, so I've been thinking about the best approach.  
 Here's my best idea so far:

 1)  Introduce '--borrow' to `git-fetch`.  This would behave similarly to 
 '--reference', except that it operates on a temporary basis, and does not 
 assume that the reference repository will exist after the operation 
 completes, so any used objects are copied into the local objects database.  
 In theory, this mechanism would be distinct from '--reference', so if both 
 are used, some objects would be copied, and some objects would be accessible 
 via a reference repository referenced by the alternates file.

Interesting.  I do something similar on my CI Server to reduce
workload on Gerrit. Having a built-in to support submodules would be
nice.  Currently my script does this:

MIRROR=/path/to/local/mirror
NEW=ssh://gerrit-server
git clone ${MIRROR}/project  cd project

#-- Init/update submodules from our local mirror if possible
git submodule update --recursive --init

#-- Switch to the remote server URL
git config remote.origin.url $(git config remote.origin.url|sed -e
s|^${MIRROR}|${NEW}|)
git submodule sync #--recursive ; recursive not supported :-[

#-- Checkout remote updates
git pull --ff-only --recurse-submodules origin ${BRANCH}
git submodule update --recursive --init


Is that about the same as you are aiming for?


 2)  Teach `git fetch` to read 'repocache.path' (or a better-named 
 configuration), and use it to automatically activate borrowing.

Seems like this could be trouble if a local repo is coincidentally
named the same as some unrelated repo you want to clone.  But I can
see the value.

What about something similar to url.insteadOf?   Maybe
'url.${SERVER}.autoBorrow = ${MIRROR}', with replacement semantics
similar to insteadOf.

 3)  For consistency, `git clone`, `git pull`, and `git submodule update` 
 should probably all learn '--borrow', and forward it to `git fetch`.

 4)  In some scenarios, it may be necessary to temporarily not automatically 
 borrow, so `git fetch`, and everything that calls it may need an argument to 
 do that.

--no-borrow

Phil
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Borrowing objects from nearby repositories

2014-03-11 Thread Andrew Keller
Hi all,

I am considering developing a new feature, and I'd like to poll the group for 
opinions.

Background: A couple years ago, I wrote a set of scripts that speed up cloning 
of frequently used repositories.  The scripts utilize a bare Git repository 
located at a known location, and automate providing a --reference parameter to 
`git clone` and `git submodule update`.  Recently, some coworkers of mine 
expressed an interest in using the scripts, so I published the current version 
of my scripts, called `git repocache`, described at the bottom of 
https://github.com/andrewkeller/ak-git-tools.

Slowly, it has occurred to me that this feature, or something similar to it, 
may be worth adding to Git, so I've been thinking about the best approach.  
Here's my best idea so far:

1)  Introduce '--borrow' to `git-fetch`.  This would behave similarly to 
'--reference', except that it operates on a temporary basis, and does not 
assume that the reference repository will exist after the operation completes, 
so any used objects are copied into the local objects database.  In theory, 
this mechanism would be distinct from '--reference', so if both are used, some 
objects would be copied, and some objects would be accessible via a reference 
repository referenced by the alternates file.

2)  Teach `git fetch` to read 'repocache.path' (or a better-named 
configuration), and use it to automatically activate borrowing.

3)  For consistency, `git clone`, `git pull`, and `git submodule update` should 
probably all learn '--borrow', and forward it to `git fetch`.

4)  In some scenarios, it may be necessary to temporarily not automatically 
borrow, so `git fetch`, and everything that calls it may need an argument to do 
that.

Intended outcome: With 'repocache.path' set, and the cached repository properly 
updated, one could run `git clone url`, and the operation would complete much 
faster than it does now due to less load on the network.

Things I haven't figured out yet:

*  What's the best approach to copying the needed objects?  It's probably 
inefficient to copy individual objects out of pack files one at a time, but it 
could be wasteful to copy entire pack files just because you need one object.  
Hard-linking could help, but that won't always be available.  One of my 
previous ideas was to add a '--auto-repack' option to `git-clone`, which solves 
this problem better, but introduces some other front-end usability problems.
*  To maintain optimal effectiveness, users would have to regularly run a fetch 
in the cache repository.  Not all users know how to set up a scheduled task on 
their computer, so this might become a maintenance problem for the user.  This 
kind of problem I think brings into question the viability of the underlying 
design here, assuming that the ultimate goal is to clone faster, with very 
little or no change in the use of git.


Thoughts?

Thanks,
Andrew Keller

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html