Re: Proposal for "fetch-any-blob Git protocol" and server design

2017-04-26 Thread Jonathan Tan

On 04/21/2017 09:41 AM, Kevin David wrote:

Hi Jonathan,

Sorry for the delayed response - other work got in the way, unfortunately!


No worries!


I am envisioning (1a) as described in Jeff Hostetler's e-mail [1] ("a
pre-command or hook to identify needed blobs and pre-fetch them before
allowing the actual command to start"), so a Git command would typically
make a single request that contains all the blobs required, but my
proposal can also handle (1c) ('"fault" them in as necessary in
read_object() while the command is running and without any pre-fetch
(either synchronously or asynchronously and with/without a helper
process)').

Even if we decided to go with single-blob requests and responses, it is
still important to send them as packfiles, so that the server can serve
them directly from its compressed storage without first having to
uncompress them.

[1] 
https://public-inbox.org/git/1488999039-37631-1-git-send-email-...@jeffhostetler.com/


Ah, I missed this. If we think we can build meaningfully-sized requests via 
(1a) and (1b),
then I agree - packs are optimal.

However, if single-blob requests/responses dominate, I have a few concerns, 
mostly from
experience playing with something similar:
* Regardless of the size of the object or number returned, the client will need 
to
  `index-pack` the result and create a corresponding `.idx` file, which requires
  decompression to construct (right?)


Yes, although I think that loose objects would need to be decompressed 
too to verify the SHA-1.



* Unless the client's repository is repacked aggressively, we'll pollute the
  `.git\objects\pack` directory with little indexes (somewhere around 4KiB 
minimum)
  and packfiles rapidly, which would degrade performance. One rudimentary 
workaround
  would be to loosen these packs on the client if they were under a certain
  size/object count. I think fetch does this already?

In either case, shifting the pack decompression/loose object recompression 
problem
to the client instead of the server is probably a good principle, but in our 
case it
simply wasn't fast enough to serve single blobs interactively (e.g. opening a 
source
file you don't have locally). I'm hopeful that the proposed partial clone 
solutions you
referenced would reduce the frequency of this being required.


I wrote up a proposal for how Git should handle missing blobs, including 
a hook that, when invoked, is allowed to write packed or loose objects 
[1]. I was thinking along similar lines too about the many-packfile 
issue, and that was part of the reason why [1] was written the way it is.


I think that, by default, the wire protocol should transmit packs 
because it causes less load on the server (as you describe) and it 
supports batch requests. Also, this protocol with single-blob fetching 
might be usable as-is for projects that only exclude large blobs (so 
there are relatively few of them), and will be more and more usable as 
we migrate the more commonly used Git commands to provide batch 
requests. However, the ability to customize the hook allows users to use 
another download scheme more suited to their use cases (for example, in 
yours, where you can provision servers that are topologically close to 
your clients).


[1] <20170426221346.25337-1-jonathanta...@google.com>


Being a bit more clever about packing objects (e.g. splitting blobs out from 
commits
and trees) improved this a bit, but we still hit a bottlenecks from what 
appeared to
be a large number of memory-mapping operations on a ~140GiB packfile of blobs.

Each `pack-objects` process would consume approximately one CPU core for the
duration of the request. It's possible that further splitting of these large 
blob packs
would have improved performance in some scenarios, but that would increase the
amount of pack-index lookups necessary to find a single object.


I'm not very experienced with mmap, but I thought that memory-mapping a
large file in itself does not incur much of a performance penalty (if
any) - it is the accesses that count. I experimented with 15,000 and
150,000 MiB files and mmap and they seem to be handled quite well. Also,
how many objects are "pack-objects" packing here?


Back when we took this approach, it was ~4000 blob objects at a time.
Perhaps we were being bitten by the Windows implementation of git_mmap[3]?.
When I profiled the 4k blob scenario, the majority of CPU and wall time was
spent in MapViewOfFileEx, which looks like it could mean accesses as well.

[3] https://github.com/git/git/blob/master/compat/win32mmap.c


Hmm...I don't have much experience with this, but this may be something 
to keep in mind.



This seems like a clever way to avoid the canonical 
`/info/refs?service=git-upload-pack`
capability negotiation on every call. However, using error handling to fallback 
seems
slightly wonky to me. Hopefully users are incentivized to upgrade their clients.


By "error handling to fallback", do you mean in my proposal or in a
poss

RE: Proposal for "fetch-any-blob Git protocol" and server design

2017-04-21 Thread Kevin David
Hi Jonathan,

Sorry for the delayed response - other work got in the way, unfortunately!

Kevin

-Original Message-
From: Jonathan Tan [mailto:jonathanta...@google.com] 
Sent: Thursday, April 13, 2017 4:13 PM

>> I know we're considering server behavior here, but how large do you generally
>> expect these blob-want requests to be? I ask because we took an initial 
>> approach
>> very similar to this, however, we had a hard time being clever about 
>> figuring out
>> what set of blobs to request for those clients that didn't want the entire 
>> set, and
>> ended up falling back to single-blob requests.
>>
>> Obviously, this could be due to thenature of our 
>> filesystem-virtualization-based client,
>> but I also suspect that the teams attacking this problem are more often than 
>> not dealing
>> with very large blob objects, so the cost of a round-trip becomes lower 
>> relative to sending
>> the object content itself.
>
> I am envisioning (1a) as described in Jeff Hostetler's e-mail [1] ("a 
> pre-command or hook to identify needed blobs and pre-fetch them before 
> allowing the actual command to start"), so a Git command would typically 
> make a single request that contains all the blobs required, but my 
> proposal can also handle (1c) ('"fault" them in as necessary in 
> read_object() while the command is running and without any pre-fetch 
> (either synchronously or asynchronously and with/without a helper 
> process)').
> 
> Even if we decided to go with single-blob requests and responses, it is 
> still important to send them as packfiles, so that the server can serve 
> them directly from its compressed storage without first having to 
> uncompress them.
> 
> [1] 
> https://public-inbox.org/git/1488999039-37631-1-git-send-email-...@jeffhostetler.com/

Ah, I missed this. If we think we can build meaningfully-sized requests via 
(1a) and (1b), 
then I agree - packs are optimal.

However, if single-blob requests/responses dominate, I have a few concerns, 
mostly from 
experience playing with something similar: 
* Regardless of the size of the object or number returned, the client will need 
to 
  `index-pack` the result and create a corresponding `.idx` file, which 
requires 
  decompression to construct (right?)

* Unless the client's repository is repacked aggressively, we'll pollute the 
  `.git\objects\pack` directory with little indexes (somewhere around 4KiB 
minimum) 
  and packfiles rapidly, which would degrade performance. One rudimentary 
workaround 
  would be to loosen these packs on the client if they were under a certain 
  size/object count. I think fetch does this already?

In either case, shifting the pack decompression/loose object recompression 
problem 
to the client instead of the server is probably a good principle, but in our 
case it 
simply wasn't fast enough to serve single blobs interactively (e.g. opening a 
source 
file you don't have locally). I'm hopeful that the proposed partial clone 
solutions you 
referenced would reduce the frequency of this being required.

>> Being a bit more clever about packing objects (e.g. splitting blobs out from 
>> commits
>> and trees) improved this a bit, but we still hit a bottlenecks from what 
>> appeared to
>> be a large number of memory-mapping operations on a ~140GiB packfile of 
>> blobs.
>> 
>> Each `pack-objects` process would consume approximately one CPU core for the
>> duration of the request. It's possible that further splitting of these large 
>> blob packs
>> would have improved performance in some scenarios, but that would increase 
>> the
>> amount of pack-index lookups necessary to find a single object.
>
> I'm not very experienced with mmap, but I thought that memory-mapping a 
> large file in itself does not incur much of a performance penalty (if 
> any) - it is the accesses that count. I experimented with 15,000 and 
> 150,000 MiB files and mmap and they seem to be handled quite well. Also, 
> how many objects are "pack-objects" packing here?

Back when we took this approach, it was ~4000 blob objects at a time. 
Perhaps we were being bitten by the Windows implementation of git_mmap[3]?. 
When I profiled the 4k blob scenario, the majority of CPU and wall time was 
spent in MapViewOfFileEx, which looks like it could mean accesses as well.

[3] https://github.com/git/git/blob/master/compat/win32mmap.c

>> This seems like a clever way to avoid the canonical 
>> `/info/refs?service=git-upload-pack`
>> capability negotiation on every call. However, using error handling to 
>> fallback seems
>> slightly wonky to me. Hopefully users are incentivized to upgrade their 
>> clients.
>
> By "error handling to fallback", do you mean in my proposal or in a 
> possible future one (assuming my proposal is implemented)? I don't think 
> my proposal requires any error handling to fallback (since only new 
> clients can clone partially - old clients will just clone totally and 
> obliviously), but I acknowledge that this propos

Re: Proposal for "fetch-any-blob Git protocol" and server design

2017-04-13 Thread Jonathan Tan

On 04/12/2017 03:02 PM, Kevin David wrote:

Hi Jonathan,

I work on the network protocols for the GVFS project at Microsoft.
I shared a couple thoughts and questions below.


Thanks for your reply!


I know we're considering server behavior here, but how large do you generally
expect these blob-want requests to be? I ask because we took an initial approach
very similar to this, however, we had a hard time being clever about figuring 
out
what set of blobs to request for those clients that didn't want the entire set, 
and
ended up falling back to single-blob requests.

Obviously, this could be due to thenature of our 
filesystem-virtualization-based client,
but I also suspect that the teams attacking this problem are more often than 
not dealing
with very large blob objects, so the cost of a round-trip becomes lower 
relative to sending
the object content itself.


I am envisioning (1a) as described in Jeff Hostetler's e-mail [1] ("a 
pre-command or hook to identify needed blobs and pre-fetch them before 
allowing the actual command to start"), so a Git command would typically 
make a single request that contains all the blobs required, but my 
proposal can also handle (1c) ('"fault" them in as necessary in 
read_object() while the command is running and without any pre-fetch 
(either synchronously or asynchronously and with/without a helper 
process)').


Even if we decided to go with single-blob requests and responses, it is 
still important to send them as packfiles, so that the server can serve 
them directly from its compressed storage without first having to 
uncompress them.


[1] 
https://public-inbox.org/git/1488999039-37631-1-git-send-email-...@jeffhostetler.com/



Along the same lines as above, this is where we started and it worked well for
low-volume requests. However, when we started ramping up the load,
`pack-objects` operating on a very large packed repository (~150 GiB) became
very computationally expensive, even with `--compression=1 --depth=0 
--window=0`.

Being a bit more clever about packing objects (e.g. splitting blobs out from 
commits
and trees) improved this a bit, but we still hit a bottlenecks from what 
appeared to
be a large number of memory-mapping operations on a ~140GiB packfile of blobs.

Each `pack-objects` process would consume approximately one CPU core for the
duration of the request. It's possible that further splitting of these large 
blob packs
would have improved performance in some scenarios, but that would increase the
amount of pack-index lookups necessary to find a single object.


I'm not very experienced with mmap, but I thought that memory-mapping a 
large file in itself does not incur much of a performance penalty (if 
any) - it is the accesses that count. I experimented with 15,000 and 
150,000 MiB files and mmap and they seem to be handled quite well. Also, 
how many objects are "pack-objects" packing here?



=== Endpoint support for forward compatibility

This "server" endpoint requires that the first line be understood, but
will ignore any other lines starting with words that it does not
understand. This allows new "commands" to be added (distinguished by
their first lines) and existing commands to be "upgraded" with backwards 
compatibility.


This seems like a clever way to avoid the canonical 
`/info/refs?service=git-upload-pack`
capability negotiation on every call. However, using error handling to fallback 
seems
slightly wonky to me. Hopefully users are incentivized to upgrade their clients.


By "error handling to fallback", do you mean in my proposal or in a 
possible future one (assuming my proposal is implemented)? I don't think 
my proposal requires any error handling to fallback (since only new 
clients can clone partially - old clients will just clone totally and 
obliviously), but I acknowledge that this proposal does not mean that 
any future proposal can be done without requiring error handling to 
fallback.



=== Indication to use the proposed endpoint

The client will probably already record that at least one of its
remotes (the one that it successfully performed a "partial clone"
from) supports this new endpoint (if not, it can’t determine whether a
missing blob was caused by repo corruption or by the "partial clone").
This knowledge can be used both to know that the server supports
"fetch-blob-pack" and "fetch-commit-pack" (for the latter, the client
can fall back to "fetch-pack"/"upload-pack" when fetching from other servers).


This makes a lot of sense to me. When we built our caching proxy, we had to be 
careful
when designing how we'd handle clients requesting objects missing from the 
proxy.

For example, a client requests a single blob and the proxy doesn't have it - we 
can't simply
download that object from the "authoritative" remote and stick it in the 
`.git\objects\xx\yyy...`
directory, because the repository would be made corrupt.


By proxy, do you mean a Git repository? Sorry, I don't really understand 
this part.

RE: Proposal for "fetch-any-blob Git protocol" and server design

2017-04-12 Thread Kevin David
Hi Jonathan, 

I work on the network protocols for the GVFS project at Microsoft.
I shared a couple thoughts and questions below.

Thanks,

Kevin

-Original Message-
From: Stefan Beller [mailto:sbel...@google.com]
Sent: Tuesday, March 28, 2017 7:20 PM
To: Jonathan Tan ; Ben Peart 
Cc: git@vger.kernel.org; Mark Thomas ; Jeff Hostetler 

Subject: Re: Proposal for "fetch-any-blob Git protocol" and server design

+cc Ben Peart, who sent
"[RFC] Add support for downloading blobs on demand" to the list recently.
This proposal here seems like it has the same goal, so maybe your review could 
go a long way here?

Thanks,
Stefan

On Tue, Mar 14, 2017 at 3:57 PM, Jonathan Tan  wrote:
> As described in "Background" below, there have been at least 2 patch 
> sets to support "partial clones" and on-demand blob fetches, where the 
> server part that supports on-demand blob fetches was treated at least 
> in outline. Here is a proposal treating that server part in detail.
>
> == Background
>
> The desire for Git to support (i) missing blobs and (ii) fetching them 
> as needed from a remote repository has surfaced on the mailing list a 
> few times, most recently in the form of RFC patch sets [1] [2].
>
> A local repository that supports (i) will be created by a "partial 
> clone", that is, a clone with some special parameters (exact 
> parameters are still being discussed) that does not download all blobs 
> normally downloaded. Such a repository should support (ii), which is what 
> this proposal describes.
>
> == Design
>
> A new endpoint "server" is created. The client will send a message in 
> the following format:
>
> 
> fbp-request = PKT-LINE("fetch-blob-pack")
>   1*want
>   flush-pkt
> want = PKT-LINE("want" SP obj-id)
> 
>
> The client may send one or more SHA-1s for which it wants blobs, then 
> a flush-pkt.

I know we're considering server behavior here, but how large do you generally
expect these blob-want requests to be? I ask because we took an initial approach
very similar to this, however, we had a hard time being clever about figuring 
out
what set of blobs to request for those clients that didn't want the entire set, 
and
ended up falling back to single-blob requests. 

Obviously, this could be due to thenature of our 
filesystem-virtualization-based client, 
but I also suspect that the teams attacking this problem are more often than 
not dealing
with very large blob objects, so the cost of a round-trip becomes lower 
relative to sending
the object content itself.

>
> The server will then reply:
>
> 
> server-reply = flush-pkt | PKT-LINE("ERR" SP message)
> 
>
> If there was no error, the server will then send them in a packfile, 
> formatted like described in "Packfile Data" in pack-protocol.txt with 
> "side-band-64k" enabled.
> Any server that supports "partial clone" will also support this, and 
> the client will automatically assume this. (How a client discovers 
> "partial clone" is not covered by this proposal.)

Along the same lines as above, this is where we started and it worked well for
low-volume requests. However, when we started ramping up the load,
`pack-objects` operating on a very large packed repository (~150 GiB) became
very computationally expensive, even with `--compression=1 --depth=0 
--window=0`.

Being a bit more clever about packing objects (e.g. splitting blobs out from 
commits
and trees) improved this a bit, but we still hit a bottlenecks from what 
appeared to
be a large number of memory-mapping operations on a ~140GiB packfile of blobs.

Each `pack-objects` process would consume approximately one CPU core for the
duration of the request. It's possible that further splitting of these large 
blob packs
would have improved performance in some scenarios, but that would increase the
amount of pack-index lookups necessary to find a single object.

>
> The server will perform reachability checks on requested blobs through 
> the equivalent of "git rev-list --use-bitmap-index" (like "git 
> upload-pack" when using the allowreachablesha1inwant option), unless 
> configured to suppress reachability checks through a config option.
> The server administrator is highly recommended to regularly regenerate 
> the bitmap (or suppress reachability checks).
>
> === Endpoint support for forward compatibility
>
> This "server" endpoint requires that the first line be understood, but 
> will ignore any other lines starting with words that it does not 
> understand. This allows new "commands" to be added (distinguished by 
> their first lines) and existing commands to be "upgraded" 

Re: Proposal for "fetch-any-blob Git protocol" and server design

2017-03-28 Thread Stefan Beller
+cc Ben Peart, who sent
"[RFC] Add support for downloading blobs on demand" to the list recently.
This proposal here seems like it has the same goal, so maybe your review
could go a long way here?

Thanks,
Stefan

On Tue, Mar 14, 2017 at 3:57 PM, Jonathan Tan  wrote:
> As described in "Background" below, there have been at least 2 patch sets to
> support "partial clones" and on-demand blob fetches, where the server part
> that supports on-demand blob fetches was treated at least in outline. Here
> is a proposal treating that server part in detail.
>
> == Background
>
> The desire for Git to support (i) missing blobs and (ii) fetching them as
> needed from a remote repository has surfaced on the mailing list a few
> times, most recently in the form of RFC patch sets [1] [2].
>
> A local repository that supports (i) will be created by a "partial clone",
> that is, a clone with some special parameters (exact parameters are still
> being discussed) that does not download all blobs normally downloaded. Such
> a repository should support (ii), which is what this proposal describes.
>
> == Design
>
> A new endpoint "server" is created. The client will send a message in the
> following format:
>
> 
> fbp-request = PKT-LINE("fetch-blob-pack")
>   1*want
>   flush-pkt
> want = PKT-LINE("want" SP obj-id)
> 
>
> The client may send one or more SHA-1s for which it wants blobs, then a
> flush-pkt.
>
> The server will then reply:
>
> 
> server-reply = flush-pkt | PKT-LINE("ERR" SP message)
> 
>
> If there was no error, the server will then send them in a packfile,
> formatted like described in "Packfile Data" in pack-protocol.txt with
> "side-band-64k" enabled.
>
> Any server that supports "partial clone" will also support this, and the
> client will automatically assume this. (How a client discovers "partial
> clone" is not covered by this proposal.)
>
> The server will perform reachability checks on requested blobs through the
> equivalent of "git rev-list --use-bitmap-index" (like "git upload-pack" when
> using the allowreachablesha1inwant option), unless configured to suppress
> reachability checks through a config option. The server administrator is
> highly recommended to regularly regenerate the bitmap (or suppress
> reachability checks).
>
> === Endpoint support for forward compatibility
>
> This "server" endpoint requires that the first line be understood, but will
> ignore any other lines starting with words that it does not understand. This
> allows new "commands" to be added (distinguished by their first lines) and
> existing commands to be "upgraded" with backwards compatibility.
>
> === Related improvements possible with new endpoint
>
> Previous protocol upgrade suggestions have had to face the difficulty of
> allowing updated clients to discover the server support while not slowing
> down (for example, through extra network round-trips) any client, whether
> non-updated or updated. The introduction of "partial clone" allows clients
> to rely on the guarantee that any server that supports "partial clone"
> supports "fetch-blob-pack", and we can extend the guarantee to other
> protocol upgrades that such repos would want.
>
> One such upgrade is "ref-in-want" [3]. The full details can be obtained from
> that email thread, but to summarize, the patch set eliminates the need for
> the initial ref advertisement and allows communication in ref name globs,
> making it much easier for multiple load-balanced servers to serve large
> repos to clients - this is something that would greatly benefit the Android
> project, for example, and possibly many others.
>
> Bundling support for "ref-in-want" with "fetch-blob-pack" simplifies matters
> for the client in that a client needs to only handle one "version" of server
> (a server that supports both). If "ref-in-want" were added later, instead of
> now, clients would need to be able to handle two "versions" (one with only
> "fetch-blob-pack" and one with both "fetch-blob-pack" and "ref-in-want").
>
> As for its implementation, that email thread already contains a patch set
> that makes it work with the existing "upload-pack" endpoint; I can update
> that patch set to use the proposed "server" endpoint (with a
> "fetch-commit-pack" message) if need be.
>
> == Client behavior
>
> This proposal is concerned with server behavior only, but it is useful to
> envision how the client would use this to ensure that the server behavior is
> useful.
>
> === Indication to use the proposed endpoint
>
> The client will probably already record that at least one of its remotes
> (the one that it successfully performed a "partial clone" from) supports
> this new endpoint (if not, it can’t determine whether a missing blob was
> caused by repo corruption or by the "partial clone"). This knowledge can be
> used both to know that the server supports "fetch-blob-pack" and
> "fetch-commit-pack" (for the latter, the client can fall back to
> "fetch-pack"/

Re: Proposal for "fetch-any-blob Git protocol" and server design

2017-03-16 Thread Jonathan Tan

On 03/16/2017 02:17 PM, Junio C Hamano wrote:

Yeah, the example was solely to see how the system was to be
extended, as one of the selling point of the proposal was:

> === Endpoint support for forward compatibility
>
> This "server" endpoint requires that the first line be understood, but
> will ignore any other lines starting with words that it does not
> understand. This allows new "commands" to be added (distinguished by
> their first lines) and existing commands to be "upgraded" with
> backwards compatibility.






And the lack of "capability negotiation" is substituted by "assume
the better server, fallback to lower common denominator by detecting
errors"?


Yes. I probably should have mentioned that this "forward compatibility" 
is limited - it does not include any potential new feature intending to 
reduce the size of the request. I was thinking more of this being able 
to be extended to, for example, add "hint" lines that certain blobs come 
from certain commits, or add "have" lines to present blobs that might be 
good delta bases (and even if the server doesn't understand these lines, 
its output is still correct).


Re: Proposal for "fetch-any-blob Git protocol" and server design

2017-03-16 Thread Junio C Hamano
Jonathan Tan  writes:

> On 03/15/2017 10:59 AM, Junio C Hamano wrote:
>> ...
>> but I am wondering how you would extend the proposed system to do
>> so.  Would you add "fetch-size-limited-blob-in-tree-pack" that runs
>> parallel to "fetch-blob-pack" request?  Would you add a new type of
>> request packet "want-blob-with-expression" for fbp-request, which is
>> protected by some "protocol capability" exchange?
>>
>> If the former, how does a client discover if a particular server
>> already supports the new "fetch-size-limited-blob-in-tree-pack"
>> request, so that it does not have to send a bunch of "want" request
>> by enumerating the blobs itself?  If the latter, how does a client
>> discover if a particular server's "fetch-blob-pack" already supports
>> the new "want-blob-with-expression" request packet?
>
> I'm not sure if that use case is something we need to worry about (if
> you're downloading x * 10MB, uploading x * 50B shouldn't be a problem,
> I think), but if we want to handle that use case in the future, I
> agree that extending this system would be difficult.

Yeah, the example was solely to see how the system was to be
extended, as one of the selling point of the proposal was:

> === Endpoint support for forward compatibility
>
> This "server" endpoint requires that the first line be understood, but
> will ignore any other lines starting with words that it does not
> understand. This allows new "commands" to be added (distinguished by
> their first lines) and existing commands to be "upgraded" with
> backwards compatibility.

> The best way I can think of right now is for the client to send a
> fetch-blob-pack request with no "want" lines and at least one
> "want-tree" line, ...

So it is not by adding new type of "request" that sits next to
"fetch-blob-pack" request, but by adding a new way to drive that
existing "fetch-blob-pack" request.  

> and then if there is an error (which will happen if
> the server is old, and therefore sees that there is not at least
> "want" line), to retry with the "want" lines. This allows us to add
> alternative ways of specifying blobs later (if we want to), but also
> means that upgrading a client without upgrading the corresponding
> server incurs a round-trip penalty.

And the lack of "capability negotiation" is substituted by "assume
the better server, fallback to lower common denominator by detecting
errors"?

> Alternatively we could add rudimentary support for trees now and add
> filter-by-size later ...

I am not particularly interested in "blobs in this tree" request.
It was merely an example to make it easier to discuss the main
point, which is the bigger picture that the proposal was painting
around "forward compatibility".

Thanks.


Re: Proposal for "fetch-any-blob Git protocol" and server design

2017-03-16 Thread Jonathan Tan

On 03/15/2017 10:59 AM, Junio C Hamano wrote:

By "SHA-1s for which it wants blobs", you mean that "want" only
allows one exact blob object name?  I think it is necessary to
support that mode of operation as a base case, and it is a good
starting point.

When you know

 - you have a "partial" clone that initially asked to contain only
   blobs that are smaller than 10MB, and

 - you are now trying to do a "git checkout v1.0 -- this/directory"
   so that the directory is fully populated

instead of enumerating all the missing blobs from the output of
"ls-tree -r v1.0 this/directory" on separate "want" requests, you
may want to say "I want all the blobs that are not smaller than 10MB
in this tree object $(git rev-parse v1.0:this/directory)".

I am not saying that you should add something like this right away,
but I am wondering how you would extend the proposed system to do
so.  Would you add "fetch-size-limited-blob-in-tree-pack" that runs
parallel to "fetch-blob-pack" request?  Would you add a new type of
request packet "want-blob-with-expression" for fbp-request, which is
protected by some "protocol capability" exchange?

If the former, how does a client discover if a particular server
already supports the new "fetch-size-limited-blob-in-tree-pack"
request, so that it does not have to send a bunch of "want" request
by enumerating the blobs itself?  If the latter, how does a client
discover if a particular server's "fetch-blob-pack" already supports
the new "want-blob-with-expression" request packet?


I'm not sure if that use case is something we need to worry about (if 
you're downloading x * 10MB, uploading x * 50B shouldn't be a problem, I 
think), but if we want to handle that use case in the future, I agree 
that extending this system would be difficult.


The best way I can think of right now is for the client to send a 
fetch-blob-pack request with no "want" lines and at least one 
"want-tree" line, and then if there is an error (which will happen if 
the server is old, and therefore sees that there is not at least "want" 
line), to retry with the "want" lines. This allows us to add alternative 
ways of specifying blobs later (if we want to), but also means that 
upgrading a client without upgrading the corresponding server incurs a 
round-trip penalty.


Alternatively we could add rudimentary support for trees now and add 
filter-by-size later (so that such requests made to old servers will 
download extra blobs, but at least it works), but it still doesn't solve 
the general problem of specifying blobs by some other rule than its own 
SHA-1 or its tree's SHA-1.




Re: Proposal for "fetch-any-blob Git protocol" and server design

2017-03-15 Thread Junio C Hamano
Jonathan Tan  writes:

> == Design
>
> A new endpoint "server" is created. The client will send a message in
> the following format:
>
> 
> fbp-request = PKT-LINE("fetch-blob-pack")
>   1*want
>   flush-pkt
> want = PKT-LINE("want" SP obj-id)
> 
>
> The client may send one or more SHA-1s for which it wants blobs, then
> a flush-pkt.

By "SHA-1s for which it wants blobs", you mean that "want" only
allows one exact blob object name?  I think it is necessary to
support that mode of operation as a base case, and it is a good
starting point.

When you know

 - you have a "partial" clone that initially asked to contain only
   blobs that are smaller than 10MB, and

 - you are now trying to do a "git checkout v1.0 -- this/directory"
   so that the directory is fully populated

instead of enumerating all the missing blobs from the output of
"ls-tree -r v1.0 this/directory" on separate "want" requests, you
may want to say "I want all the blobs that are not smaller than 10MB
in this tree object $(git rev-parse v1.0:this/directory)".

I am not saying that you should add something like this right away,
but I am wondering how you would extend the proposed system to do
so.  Would you add "fetch-size-limited-blob-in-tree-pack" that runs
parallel to "fetch-blob-pack" request?  Would you add a new type of
request packet "want-blob-with-expression" for fbp-request, which is
protected by some "protocol capability" exchange?

If the former, how does a client discover if a particular server
already supports the new "fetch-size-limited-blob-in-tree-pack"
request, so that it does not have to send a bunch of "want" request
by enumerating the blobs itself?  If the latter, how does a client
discover if a particular server's "fetch-blob-pack" already supports
the new "want-blob-with-expression" request packet?

> === Endpoint support for forward compatibility
>
> This "server" endpoint requires that the first line be understood, but
> will ignore any other lines starting with words that it does not
> understand. This allows new "commands" to be added (distinguished by
> their first lines) and existing commands to be "upgraded" with
> backwards compatibility.