Re: git-fetch pulls already-pulled objects?
> I forgot to mention the recent "pack bitmap" addition. It makes the > set of "can be cheaply proven to exist" a lot larger. Cool! I tried this feature, and it worked! (At least, it worked for my small test case.) I ran on the server (after pushing the objects): git config repack.writeBitmaps true git repack -Ad After this, the 'git fetch origin master2' was super quick. Thanks for your help! Aside: This test case is using (normal, C/sh) Git. My production environment uses JGit on the server. I haven't tested this with JGit. -Original Message- From: Junio C Hamano Date: Thursday, October 29, 2015 at 11:42 AM To: Matt Glazer Cc: "git@vger.kernel.org" Subject: Re: git-fetch pulls already-pulled objects? >Matt Glazar writes: > >> Would negotiating the tree object hashes be possible on the client >>without >> server changes? Is the protocol that flexible? > >The protocol is strictly "find common ancestor in the commit >history". Everything else is done on the sender. > >>>The object transfer is done by first finding the common ancestor of >>>histories of the sending and the receiving sides, which allows the >>>sender to enumerate commits that the sender has but the receiver >>>doesn't. From there, all objects [*1*] that are referenced by these >>>commits that need to be sent. > >>>[Footnote] >>> >>>*1* There is an optimization to exclude the trees and blobs that can >>>be cheaply proven to exist on the receiving end. If the receiving >>>end has a commit that the sending end does *not* have, and that >>>commit happens to record a tree the sending end needs to send, >>>however, the sending end cannot prove that the tree does not have to >>>be sent without first fetching that commit from the receiving end, >>>which fails "can be cheaply proven to exist" test. > >I forgot to mention the recent "pack bitmap" addition. It makes the >set of "can be cheaply proven to exist" a lot larger. > >If for example the sender needs to send one commit C because it >determined that the receiver has history up to commit C~1, without >the bitmap, even when C^{tree} (i.e. the tree of C) is identical to >C~2^{tree} (i.e. the tree of C~2), it would have sent that tree >object because "proving that the receiver already has it" would >require the sender to dig its history back, starting from C~1 >(i.e. the commit that is known to exist at the receiver), to >enumerate the objects contained in the common part of the history, >which fails the "can be cheaply proven to exist" test. > >The "pack bitmap" pre-computes what commits, trees and blobs should >already exist in the repository given a commit for which bitmap >exists. Using the bitmap, from C~1 (i.e. the commit known to exist >at the receiving end), it can be proven cheaply that C^{tree} that >happens to be identical to C~2^{tree} already exists over there, and >the sender can use this knowledge to reduce the transfer. > >The "pack bitmap" however does not change the fundamental structure. >If your receiver has a commit that is not known to the sender, and >that commit happens to record the same tree recorded in the commit >that needs to be sent, there is no way for the sender to know that >the receiver has it, exactly because the exchange between them is >purely "find common ancestor in history". N�r��yb�X��ǧv�^�){.n�+ا���ܨ}���Ơz�&j:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf
Re: git-fetch pulls already-pulled objects?
Matt Glazar writes: > Would negotiating the tree object hashes be possible on the client without > server changes? Is the protocol that flexible? The protocol is strictly "find common ancestor in the commit history". Everything else is done on the sender. >>The object transfer is done by first finding the common ancestor of >>histories of the sending and the receiving sides, which allows the >>sender to enumerate commits that the sender has but the receiver >>doesn't. From there, all objects [*1*] that are referenced by these >>commits that need to be sent. >>[Footnote] >> >>*1* There is an optimization to exclude the trees and blobs that can >>be cheaply proven to exist on the receiving end. If the receiving >>end has a commit that the sending end does *not* have, and that >>commit happens to record a tree the sending end needs to send, >>however, the sending end cannot prove that the tree does not have to >>be sent without first fetching that commit from the receiving end, >>which fails "can be cheaply proven to exist" test. I forgot to mention the recent "pack bitmap" addition. It makes the set of "can be cheaply proven to exist" a lot larger. If for example the sender needs to send one commit C because it determined that the receiver has history up to commit C~1, without the bitmap, even when C^{tree} (i.e. the tree of C) is identical to C~2^{tree} (i.e. the tree of C~2), it would have sent that tree object because "proving that the receiver already has it" would require the sender to dig its history back, starting from C~1 (i.e. the commit that is known to exist at the receiver), to enumerate the objects contained in the common part of the history, which fails the "can be cheaply proven to exist" test. The "pack bitmap" pre-computes what commits, trees and blobs should already exist in the repository given a commit for which bitmap exists. Using the bitmap, from C~1 (i.e. the commit known to exist at the receiving end), it can be proven cheaply that C^{tree} that happens to be identical to C~2^{tree} already exists over there, and the sender can use this knowledge to reduce the transfer. The "pack bitmap" however does not change the fundamental structure. If your receiver has a commit that is not known to the sender, and that commit happens to record the same tree recorded in the commit that needs to be sent, there is no way for the sender to know that the receiver has it, exactly because the exchange between them is purely "find common ancestor in history". -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: git-fetch pulls already-pulled objects?
> What you are expecting _could_ be implemented by exchanging all > tree and blob objects sending and receiving sides have and computing > the set difference, but the sender and the receiver do not exchange > such a huge list. In my case, I only want to exchange the tree object hash pointed directly by the commit object; I don't care about all subtrees and blobs reachable from the commit. I think a naive approach would only double the number of hashes sent worst case. Would negotiating the tree object hashes be possible on the client without server changes? Is the protocol that flexible? If what I want is *not* possible, is it possible to explicitly put a tree (and its descendants) into its own pack? I think that will speed up the git-fetch a bit by doing this on the server. (I know what trees/commits will be sent ahead of time.) (The server does less work pulling the objects out of an existing pack and repacking them for the client. (Or maybe my mental model of git packs is wrong?)) > The object transfer is done by first finding the common ancestor of > histories of the sending and the receiving sides, which allows the > sender to enumerate commits that the sender has but the receiver > doesn't. From there, all objects [*1*] that are referenced by these > commits that need to be sent. Thanks for clarifying. > *1* There is an optimization to exclude the trees and blobs that can > be cheaply proven to exist on the receiving end. That makes sense (especially for 'git revert HEAD' situations). Thank you for your reply, Junio. -Original Message- From: Junio C Hamano Date: Thursday, October 29, 2015 at 10:32 AM To: Matt Glazer Cc: "git@vger.kernel.org" Subject: Re: git-fetch pulls already-pulled objects? >Matt Glazar writes: > >> On a remote, I have two Git commit objects which point to the same tree >> object (created with git commit-tree). > >What you are expecting _could_ be implemented by exchanging all >tree and blob objects sending and receiving sides have and computing >the set difference, but the sender and the receiver do not exchange >such a huge list. > >The object transfer is done by first finding the common ancestor of >histories of the sending and the receiving sides, which allows the >sender to enumerate commits that the sender has but the receiver >doesn't. From there, all objects [*1*] that are referenced by these >commits that need to be sent. > > >[Footnote] > >*1* There is an optimization to exclude the trees and blobs that can >be cheaply proven to exist on the receiving end. If the receiving >end has a commit that the sending end does *not* have, and that >commit happens to record a tree the sending end needs to send, >however, the sending end cannot prove that the tree does not have to >be sent without first fetching that commit from the receiving end, >which fails "can be cheaply proven to exist" test. > N�r��yb�X��ǧv�^�){.n�+ا���ܨ}���Ơz�&j:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf
Re: git-fetch pulls already-pulled objects?
Matt Glazar writes: > On a remote, I have two Git commit objects which point to the same tree > object (created with git commit-tree). What you are expecting _could_ be implemented by exchanging all tree and blob objects sending and receiving sides have and computing the set difference, but the sender and the receiver do not exchange such a huge list. The object transfer is done by first finding the common ancestor of histories of the sending and the receiving sides, which allows the sender to enumerate commits that the sender has but the receiver doesn't. From there, all objects [*1*] that are referenced by these commits that need to be sent. [Footnote] *1* There is an optimization to exclude the trees and blobs that can be cheaply proven to exist on the receiving end. If the receiving end has a commit that the sending end does *not* have, and that commit happens to record a tree the sending end needs to send, however, the sending end cannot prove that the tree does not have to be sent without first fetching that commit from the receiving end, which fails "can be cheaply proven to exist" test. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
git-fetch pulls already-pulled objects?
On a remote, I have two Git commit objects which point to the same tree object (created with git commit-tree). If I fetch one of the commits, the commit object (including the tree object) is fetched. If I then fetch the other commit, the tree object (and its dependencies) is fetched *again* (I think). I don't watch the tree object downloaded again, because it is large (multi-gigabyte). Because the tree object exists locally, I think it should not be downloaded. Is this a bug in Git, or is this by design? How can I confirm that the tree object (and dependencies) are downloaded twice? Is there are more complicated git-fetch (or similar) command I can execute to not download the already-downloaded tree objects? (I have the hash of the tree object which would be potentially re-downloaded, if that helps.) Sequence of commands to reproduce: # Replace this with the URL to an empty Git repository. remote=ssh://foo/bar.git # Create some random data to exaggerate git-fetch times. # If you have a slow remote, reduce 'count'. mkdir minimal cd minimal dd if=/dev/urandom of=random bs=65536 count=4096 # Create our two commits (master and master2). git init git add random git commit -m 'Random data (commit 1)' git branch master2 \ "$(echo 'Random data (commit 2)' \ | git commit-tree 'HEAD^{tree}')" # Push our commits. Expected to take some time. git remote add origin "${remote}" git push origin \ master:refs/heads/master \ master2:refs/heads/master2 # Clone master. Expected to take some time. cd .. mkdir minimal-clone git clone --single-branch --branch master "${remote}" # Fetch master2. Should be nearly instant, but takes some # time. Seems to be download everything again. cd minimal-clone git fetch origin master2 # Try again. git-fetch takes a while, but shouldn't. rm -f .git/FETCH_HEAD git gc --prune=all git fetch origin master2 Info about my system: Local (pusher): OS: OS X 10.10.5 git: git version 2.0.1 ssh: OpenSSH_6.2p2, OSSLShim 0.9.8r 8 Dec 2011 Remote (server): OS: Linux 4.0.9 (CentOS 6) git: git version 2.4.6 sshd: OpenSSH_6.7p1-hpn14v5, OpenSSL 1.0.1e-fips 11 Feb 2013