Re: git-fetch pulls already-pulled objects?

2015-10-29 Thread Matt Glazar
> I forgot to mention the recent "pack bitmap" addition.  It makes the
> set of "can be cheaply proven to exist" a lot larger.

Cool! I tried this feature, and it worked! (At least, it worked for my
small test case.)

I ran on the server (after pushing the objects):

git config repack.writeBitmaps true
git repack -Ad

After this, the 'git fetch origin master2' was super quick.

Thanks for your help!

Aside: This test case is using (normal, C/sh) Git. My production
environment uses JGit on the server. I haven't tested this with JGit.

-Original Message-
From: Junio C Hamano 
Date: Thursday, October 29, 2015 at 11:42 AM
To: Matt Glazer 
Cc: "git@vger.kernel.org" 
Subject: Re: git-fetch pulls already-pulled objects?

>Matt Glazar  writes:
>
>> Would negotiating the tree object hashes be possible on the client
>>without
>> server changes? Is the protocol that flexible?
>
>The protocol is strictly "find common ancestor in the commit
>history".  Everything else is done on the sender.
>
>>>The object transfer is done by first finding the common ancestor of
>>>histories of the sending and the receiving sides, which allows the
>>>sender to enumerate commits that the sender has but the receiver
>>>doesn't.  From there, all objects [*1*] that are referenced by these
>>>commits that need to be sent.
>
>>>[Footnote]
>>>
>>>*1* There is an optimization to exclude the trees and blobs that can
>>>be cheaply proven to exist on the receiving end.  If the receiving
>>>end has a commit that the sending end does *not* have, and that
>>>commit happens to record a tree the sending end needs to send,
>>>however, the sending end cannot prove that the tree does not have to
>>>be sent without first fetching that commit from the receiving end,
>>>which fails "can be cheaply proven to exist" test.
>
>I forgot to mention the recent "pack bitmap" addition.  It makes the
>set of "can be cheaply proven to exist" a lot larger.
>
>If for example the sender needs to send one commit C because it
>determined that the receiver has history up to commit C~1, without
>the bitmap, even when C^{tree} (i.e. the tree of C) is identical to
>C~2^{tree} (i.e. the tree of C~2), it would have sent that tree
>object because "proving that the receiver already has it" would
>require the sender to dig its history back, starting from C~1
>(i.e. the commit that is known to exist at the receiver), to
>enumerate the objects contained in the common part of the history,
>which fails the "can be cheaply proven to exist" test.
>
>The "pack bitmap" pre-computes what commits, trees and blobs should
>already exist in the repository given a commit for which bitmap
>exists.  Using the bitmap, from C~1 (i.e. the commit known to exist
>at the receiving end), it can be proven cheaply that C^{tree} that
>happens to be identical to C~2^{tree} already exists over there, and
>the sender can use this knowledge to reduce the transfer.
>
>The "pack bitmap" however does not change the fundamental structure.
>If your receiver has a commit that is not known to the sender, and
>that commit happens to record the same tree recorded in the commit
>that needs to be sent, there is no way for the sender to know that
>the receiver has it, exactly because the exchange between them is
>purely "find common ancestor in history".

N�r��yb�X��ǧv�^�)޺{.n�+ا���ܨ}���Ơz�&j:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf

Re: git-fetch pulls already-pulled objects?

2015-10-29 Thread Junio C Hamano
Matt Glazar  writes:

> Would negotiating the tree object hashes be possible on the client without
> server changes? Is the protocol that flexible?

The protocol is strictly "find common ancestor in the commit
history".  Everything else is done on the sender.

>>The object transfer is done by first finding the common ancestor of
>>histories of the sending and the receiving sides, which allows the
>>sender to enumerate commits that the sender has but the receiver
>>doesn't.  From there, all objects [*1*] that are referenced by these
>>commits that need to be sent.

>>[Footnote]
>>
>>*1* There is an optimization to exclude the trees and blobs that can
>>be cheaply proven to exist on the receiving end.  If the receiving
>>end has a commit that the sending end does *not* have, and that
>>commit happens to record a tree the sending end needs to send,
>>however, the sending end cannot prove that the tree does not have to
>>be sent without first fetching that commit from the receiving end,
>>which fails "can be cheaply proven to exist" test.

I forgot to mention the recent "pack bitmap" addition.  It makes the
set of "can be cheaply proven to exist" a lot larger.

If for example the sender needs to send one commit C because it
determined that the receiver has history up to commit C~1, without
the bitmap, even when C^{tree} (i.e. the tree of C) is identical to
C~2^{tree} (i.e. the tree of C~2), it would have sent that tree
object because "proving that the receiver already has it" would
require the sender to dig its history back, starting from C~1
(i.e. the commit that is known to exist at the receiver), to
enumerate the objects contained in the common part of the history,
which fails the "can be cheaply proven to exist" test.

The "pack bitmap" pre-computes what commits, trees and blobs should
already exist in the repository given a commit for which bitmap
exists.  Using the bitmap, from C~1 (i.e. the commit known to exist
at the receiving end), it can be proven cheaply that C^{tree} that
happens to be identical to C~2^{tree} already exists over there, and
the sender can use this knowledge to reduce the transfer.

The "pack bitmap" however does not change the fundamental structure.
If your receiver has a commit that is not known to the sender, and
that commit happens to record the same tree recorded in the commit
that needs to be sent, there is no way for the sender to know that
the receiver has it, exactly because the exchange between them is
purely "find common ancestor in history".
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: git-fetch pulls already-pulled objects?

2015-10-29 Thread Matt Glazar
> What you are expecting _could_ be implemented by exchanging all
> tree and blob objects sending and receiving sides have and computing
> the set difference, but the sender and the receiver do not exchange
> such a huge list.

In my case, I only want to exchange the tree object hash pointed directly
by the commit object; I don't care about all subtrees and blobs reachable
from the commit. I think a naive approach would only double the number of
hashes sent worst case.

Would negotiating the tree object hashes be possible on the client without
server changes? Is the protocol that flexible?


If what I want is *not* possible, is it possible to explicitly put a tree
(and its descendants) into its own pack? I think that will speed up the
git-fetch a bit by doing this on the server. (I know what trees/commits
will be sent ahead of time.) (The server does less work pulling the
objects out of an existing pack and repacking them for the client. (Or
maybe my mental model of git packs is wrong?))

> The object transfer is done by first finding the common ancestor of
> histories of the sending and the receiving sides, which allows the
> sender to enumerate commits that the sender has but the receiver
> doesn't.  From there, all objects [*1*] that are referenced by these
> commits that need to be sent.

Thanks for clarifying.

> *1* There is an optimization to exclude the trees and blobs that can
> be cheaply proven to exist on the receiving end.

That makes sense (especially for 'git revert HEAD' situations).

Thank you for your reply, Junio.

-Original Message-
From: Junio C Hamano 
Date: Thursday, October 29, 2015 at 10:32 AM
To: Matt Glazer 
Cc: "git@vger.kernel.org" 
Subject: Re: git-fetch pulls already-pulled objects?

>Matt Glazar  writes:
>
>> On a remote, I have two Git commit objects which point to the same tree
>> object (created with git commit-tree).
>
>What you are expecting _could_ be implemented by exchanging all
>tree and blob objects sending and receiving sides have and computing
>the set difference, but the sender and the receiver do not exchange
>such a huge list.
>
>The object transfer is done by first finding the common ancestor of
>histories of the sending and the receiving sides, which allows the
>sender to enumerate commits that the sender has but the receiver
>doesn't.  From there, all objects [*1*] that are referenced by these
>commits that need to be sent.
>
>
>[Footnote]
>
>*1* There is an optimization to exclude the trees and blobs that can
>be cheaply proven to exist on the receiving end.  If the receiving
>end has a commit that the sending end does *not* have, and that
>commit happens to record a tree the sending end needs to send,
>however, the sending end cannot prove that the tree does not have to
>be sent without first fetching that commit from the receiving end,
>which fails "can be cheaply proven to exist" test.
>

N�r��yb�X��ǧv�^�)޺{.n�+ا���ܨ}���Ơz�&j:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf

Re: git-fetch pulls already-pulled objects?

2015-10-29 Thread Junio C Hamano
Matt Glazar  writes:

> On a remote, I have two Git commit objects which point to the same tree
> object (created with git commit-tree).

What you are expecting _could_ be implemented by exchanging all
tree and blob objects sending and receiving sides have and computing
the set difference, but the sender and the receiver do not exchange
such a huge list.

The object transfer is done by first finding the common ancestor of
histories of the sending and the receiving sides, which allows the
sender to enumerate commits that the sender has but the receiver
doesn't.  From there, all objects [*1*] that are referenced by these
commits that need to be sent.


[Footnote]

*1* There is an optimization to exclude the trees and blobs that can
be cheaply proven to exist on the receiving end.  If the receiving
end has a commit that the sending end does *not* have, and that
commit happens to record a tree the sending end needs to send,
however, the sending end cannot prove that the tree does not have to
be sent without first fetching that commit from the receiving end,
which fails "can be cheaply proven to exist" test.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


git-fetch pulls already-pulled objects?

2015-10-28 Thread Matt Glazar
On a remote, I have two Git commit objects which point to the same tree
object (created with git commit-tree). If I fetch one of the commits, the
commit object (including the tree object) is fetched. If I then fetch the
other commit, the tree object (and its dependencies) is fetched *again* (I
think). I don't watch the tree object downloaded again, because it is
large (multi-gigabyte). Because the tree object exists locally, I think it
should not be downloaded.

Is this a bug in Git, or is this by design? How can I confirm that the
tree object (and dependencies) are downloaded twice? Is there are more
complicated git-fetch (or similar) command I can execute to not download
the already-downloaded tree objects? (I have the hash of the tree object
which would be potentially re-downloaded, if that helps.)

Sequence of commands to reproduce:

# Replace this with the URL to an empty Git repository.
remote=ssh://foo/bar.git

# Create some random data to exaggerate git-fetch times.
# If you have a slow remote, reduce 'count'.
mkdir minimal
cd minimal
dd if=/dev/urandom of=random bs=65536 count=4096

# Create our two commits (master and master2).
git init
git add random
git commit -m 'Random data (commit 1)'
git branch master2 \
  "$(echo 'Random data (commit 2)' \
| git commit-tree 'HEAD^{tree}')"

# Push our commits. Expected to take some time.
git remote add origin "${remote}"
git push origin \
  master:refs/heads/master \
  master2:refs/heads/master2

# Clone master. Expected to take some time.
cd ..
mkdir minimal-clone
git clone --single-branch --branch master "${remote}"

# Fetch master2. Should be nearly instant, but takes some
# time. Seems to be download everything again.
cd minimal-clone
git fetch origin master2

# Try again. git-fetch takes a while, but shouldn't.
rm -f .git/FETCH_HEAD
git gc --prune=all
git fetch origin master2

Info about my system:


Local (pusher):
OS: OS X 10.10.5
git: git version 2.0.1
ssh: OpenSSH_6.2p2, OSSLShim 0.9.8r 8 Dec 2011


Remote (server):
OS: Linux 4.0.9 (CentOS 6)
git: git version 2.4.6
sshd: OpenSSH_6.7p1-hpn14v5, OpenSSL 1.0.1e-fips 11 Feb 2013