Re: [WIP RFC 2/5] Documentation: add Packfile URIs design doc

2018-12-06 Thread Jonathan Tan
> > +This feature allows servers to serve part of their packfile response as 
> > URIs.
> > +This allows server designs that improve scalability in bandwidth and CPU 
> > usage
> > +(for example, by serving some data through a CDN), and (in the future) 
> > provides
> > +some measure of resumability to clients.
> 
> Without reading the remainder, this makes readers anticipate a few
> good things ;-)
> 
>  - "part of", so pre-generated constant material can be given from
>CDN and then followed-up by "filling the gaps" small packfile,
>perhaps?

Yes :-)

>  - The "part of" transmission may not bring the repository up to
>date wrt to the "want" objects; would this feature involve "you
>asked history up to these commits, but with this pack-uri, you'll
>be getting history up to these (somewhat stale) commits"?

It could be, but not necessarily. In my current WIP implementation, for
example, pack URIs don't give you any commits at all (and thus, no
history) - only blobs. Quite a few people first think of the "stale
clone then top-up" case, though - I wonder if it would be a good idea to
give the blob example in this paragraph in order to put people in the
right frame of mind.

> > +If the client replies with the following arguments:
> > +
> > + * packfile-uris
> > + * thin-pack
> > + * ofs-delta
> 
> "with the following" meaning "with all of the following", or "with
> any of the following"?  Is there a reason why the server side must
> require that the client understands and is willing to accept a
> thin-pack when wanting to use packfile-uris?  The same question for
> the ofs-delta.

"All of the following", but from your later comments, we probably don't
need this section anyway.

> My recommendation is to drop the mention of "thin" and "ofs" from
> the above list, and also from the following paragraph.  The "it MAY
> send" will serve as a practical escape clause to allow a server/CDN
> implementation that *ALWAYS* prepares pregenerated material that can
> only be digested by clients that supports thin and ofs.  Such a server
> can send packfile-URIs only when all of the three are given by the
> client and be compliant.  And such an update to the proposed document
> would allow a more diskful server to prepare both thin and non-thin
> pregenerated packs and choose which one to give to the client depending
> on the capability.

That is true - we can just let the server decide. I'll update the patch
accordingly, and state that the URIs should point to packs with features
like thin-pack and ofs-delta only if the client has declared that it
supports them.

> > +Clients then should understand that the returned packfile could be 
> > incomplete,
> > +and that it needs to download all the given URIs before the fetch or clone 
> > is
> > +complete. Each URI should point to a Git packfile (which may be a thin 
> > pack and
> > +which may contain offset deltas).
> 
> weaken or remove the (parenthetical comment) in the last sentence,
> and replace the beginning of the section with something like
> 
>   If the client replies with 'packfile-uris', when the server
>   sends the packfile, it MAY send a `packfile-uris` section...
> 
> You may steal what I wrote in the above response to help the
> server-side folks to decide how to actually implement the "it MAY
> send a packfile-uris" part in the document.

OK, will do.

> OK, this comes back to what I alluded to at the beginning.  We could
> respond to a full-clone request by feeding a series of packfile-uris
> and some ref information, perhaps like this:
> 
>   * Grab this packfile and update your remote-tracking refs
>   and tags to these values; you'd be as if you cloned the
>   project when it was at v1.0.
> 
>   * When you are done with the above, grab this packfile and
>   update your remote-tracking refs and tags to these values;
>   you'd be as if you cloned the project when it was at v2.0.
> 
>   * When you are done with the above, grab this packfile and
>   update your remote-tracking refs and tags to these values;
>   you'd be as if you cloned the project when it was at v3.0.
> 
>   ...
> 
>   * When you are done with the above, here is the remaining
>   packdata to bring you fully up to date with your original
>   "want"s.
> 
> and before fully reading the proposal, I anticipated that it was
> what you were going to describe.  The major difference is "up to the
> packdata given to you so far, you'd be as if you fetched these" ref
> information, which would allow you to be interrupted and then simply
> resume, without having to remember the set of packfile-uris yet to
> be processed across a fetch/clone failure.  If you sucessfully fetch
> packfile for ..v1.0, you can update the remote-tracking refs to
> match as if you fetched back when that was the most recent state of
> the project, and then if you failed while transferring packfile for
> v1.0..v2.0, 

Re: [WIP RFC 2/5] Documentation: add Packfile URIs design doc

2018-12-04 Thread Junio C Hamano
Junio C Hamano  writes:

> So, this is OK, but
>
>> +Clients then should understand that the returned packfile could be 
>> incomplete,
>> +and that it needs to download all the given URIs before the fetch or clone 
>> is
>> +complete. Each URI should point to a Git packfile (which may be a thin pack 
>> and
>> +which may contain offset deltas).
>
> weaken or remove the (parenthetical comment) in the last sentence,
> and replace the beginning of the section with something like
>
>   If the client replies with 'packfile-uris', when the server
>   sends the packfile, it MAY send a `packfile-uris` section...
>
> You may steal what I wrote in the above response to help the
> server-side folks to decide how to actually implement the "it MAY
> send a packfile-uris" part in the document.

By the way, I do agree with the practical consideration the design
you described makes.  For a pregenerated pack that brings you from
v1.0 to v2.0, "thin" would roughly save the transfer of one full
checkout (compressed, of course), and "ofs" would also save several
bytes per object.  Compared to a single pack that delivers everything
the fetcher wants, concatenation of packs without "thin" to transfer
the same set of objects would cost quite a lot more.

And I do not think we should care too much about fetchers that lack
either thin or ofs, so I'd imagine that any client that ask for
packfile-uris would also advertise thin and ofs as well, so in
practice, a request with packfile-uris that lack thin or ofs would
be pretty rare and requiring all three and requiring only one would
not make much practical difference.  It's just that I think singling
out these two capabilities as hard requirements at the protocol
level is wrong.


Re: [WIP RFC 2/5] Documentation: add Packfile URIs design doc

2018-12-04 Thread Junio C Hamano
Jonathan Tan  writes:

> +This feature allows servers to serve part of their packfile response as URIs.
> +This allows server designs that improve scalability in bandwidth and CPU 
> usage
> +(for example, by serving some data through a CDN), and (in the future) 
> provides
> +some measure of resumability to clients.

Without reading the remainder, this makes readers anticipate a few
good things ;-)

 - "part of", so pre-generated constant material can be given from
   CDN and then followed-up by "filling the gaps" small packfile,
   perhaps?

 - The "part of" transmission may not bring the repository up to
   date wrt to the "want" objects; would this feature involve "you
   asked history up to these commits, but with this pack-uri, you'll
   be getting history up to these (somewhat stale) commits"?

Anyway, let's read on.

> +This feature is available only in protocol version 2.
> +
> +Protocol
> +
> +
> +The server advertises `packfile-uris`.
> +
> +If the client replies with the following arguments:
> +
> + * packfile-uris
> + * thin-pack
> + * ofs-delta

"with the following" meaning "with all of the following", or "with
any of the following"?  Is there a reason why the server side must
require that the client understands and is willing to accept a
thin-pack when wanting to use packfile-uris?  The same question for
the ofs-delta.

When the pregenerated constant material the server plans to hand out
the uris for was prepared by using ofs-delta encoding, the server
cannot give the uri to it when the client does not want ofs-delta
encoded packfile, but it feels somewhat strange that we require the
most capable client at the protocol level.  After all, the server
side could prepare one with ofs-delta and another without ofs-delta
and depending on what the client is capable of, hand out different
URIs, if it wanted to.

The reason why I care is because thin and ofs will *NOT* stay
forever be the only optional features of the pack format.  We may
invent yet another such optional 'frotz' feature, which may greatly
help the efficiency of the packfile encoding, hence it may be
preferrable to always generate a CDN packfile with that feature, in
addition to thin and ofs.  Would we add 'frotz' to the above list in
the documentation, then?  What would happen to existing servers and
clients written before that time then?

My recommendation is to drop the mention of "thin" and "ofs" from
the above list, and also from the following paragraph.  The "it MAY
send" will serve as a practical escape clause to allow a server/CDN
implementation that *ALWAYS* prepares pregenerated material that can
only be digested by clients that supports thin and ofs.  Such a server
can send packfile-URIs only when all of the three are given by the
client and be compliant.  And such an update to the proposed document
would allow a more diskful server to prepare both thin and non-thin
pregenerated packs and choose which one to give to the client depending
on the capability.

> +when the server sends the packfile, it MAY send a `packfile-uris` section
> +directly before the `packfile` section (right after `wanted-refs` if it is
> +sent) containing HTTP(S) URIs. See protocol-v2.txt for the documentation of
> +this section.

So, this is OK, but

> +Clients then should understand that the returned packfile could be 
> incomplete,
> +and that it needs to download all the given URIs before the fetch or clone is
> +complete. Each URI should point to a Git packfile (which may be a thin pack 
> and
> +which may contain offset deltas).

weaken or remove the (parenthetical comment) in the last sentence,
and replace the beginning of the section with something like

If the client replies with 'packfile-uris', when the server
sends the packfile, it MAY send a `packfile-uris` section...

You may steal what I wrote in the above response to help the
server-side folks to decide how to actually implement the "it MAY
send a packfile-uris" part in the document.

> +Server design
> +-
> +
> +The server can be trivially made compatible with the proposed protocol by
> +having it advertise `packfile-uris`, tolerating the client sending
> +`packfile-uris`, and never sending any `packfile-uris` section. But we should
> +include some sort of non-trivial implementation in the Minimum Viable 
> Product,
> +at least so that we can test the client.
> +
> +This is the implementation: a feature, marked experimental, that allows the
> +server to be configured by one or more `uploadpack.blobPackfileUri=
> +` entries. Whenever the list of objects to be sent is assembled, a blob
> +with the given sha1 can be replaced by the given URI. This allows, for 
> example,
> +servers to delegate serving of large blobs to CDNs.

;-)

> +Client design
> +-
> +
> +While fetching, the client needs to remember the list of URIs and cannot
> +declare that the fetch is complete until all URIs have been downloaded as
> +packfiles.
> +
> +The division 

Re: [WIP RFC 2/5] Documentation: add Packfile URIs design doc

2018-12-04 Thread Jonathan Tan
> Some thoughts here:
> 
> First, I'd like to see a section (and a bit in the implementation)
> requiring HTTPS if the original protocol is secure (SSH or HTTPS).
> Allowing the server to downgrade to HTTP, even by accident, would be a
> security problem.
> 
> Second, this feature likely should be opt-in for SSH. One issue I've
> seen repeatedly is that people don't want to use HTTPS to fetch things
> when they're using SSH for Git. Many people in corporate environments
> have proxies that break HTTP for non-browser use cases[0], and using SSH
> is the only way that they can make a functional Git connection.

Good points about SSH support and the client needing to control which
protocols the server will send URIs for. I'll include a line in the
client request in which the client can specify which protocols it is OK
with.

> Third, I think the server needs to be required to both support Range
> headers and never change the content of a URI, so that we can have
> resumable clone implicit in this design. There are some places in the
> world where connections are poor and fetching even the initial packfile
> at once might be a problem. (I've seen such questions on Stack
> Overflow, for example.)

Good points. I'll add these in the next revision.

> Having said that, I think overall this is a good idea and I'm glad to
> see a proposal for it.

Thanks, and thanks for your comments too.


Re: [WIP RFC 2/5] Documentation: add Packfile URIs design doc

2018-12-03 Thread brian m. carlson
On Mon, Dec 03, 2018 at 03:37:35PM -0800, Jonathan Tan wrote:
> Signed-off-by: Jonathan Tan 
> ---
>  Documentation/technical/packfile-uri.txt | 83 
>  Documentation/technical/protocol-v2.txt  |  6 +-
>  2 files changed, 88 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/technical/packfile-uri.txt
> 
> diff --git a/Documentation/technical/packfile-uri.txt 
> b/Documentation/technical/packfile-uri.txt
> new file mode 100644
> index 00..6535801486
> --- /dev/null
> +++ b/Documentation/technical/packfile-uri.txt
> @@ -0,0 +1,83 @@
> +Packfile URIs
> +=
> +
> +This feature allows servers to serve part of their packfile response as URIs.
> +This allows server designs that improve scalability in bandwidth and CPU 
> usage
> +(for example, by serving some data through a CDN), and (in the future) 
> provides
> +some measure of resumability to clients.
> +
> +This feature is available only in protocol version 2.
> +
> +Protocol
> +
> +
> +The server advertises `packfile-uris`.
> +
> +If the client replies with the following arguments:
> +
> + * packfile-uris
> + * thin-pack
> + * ofs-delta
> +
> +when the server sends the packfile, it MAY send a `packfile-uris` section
> +directly before the `packfile` section (right after `wanted-refs` if it is
> +sent) containing HTTP(S) URIs. See protocol-v2.txt for the documentation of
> +this section.
> +
> +Clients then should understand that the returned packfile could be 
> incomplete,
> +and that it needs to download all the given URIs before the fetch or clone is
> +complete. Each URI should point to a Git packfile (which may be a thin pack 
> and
> +which may contain offset deltas).


Some thoughts here:

First, I'd like to see a section (and a bit in the implementation)
requiring HTTPS if the original protocol is secure (SSH or HTTPS).
Allowing the server to downgrade to HTTP, even by accident, would be a
security problem.

Second, this feature likely should be opt-in for SSH. One issue I've
seen repeatedly is that people don't want to use HTTPS to fetch things
when they're using SSH for Git. Many people in corporate environments
have proxies that break HTTP for non-browser use cases[0], and using SSH
is the only way that they can make a functional Git connection.

Third, I think the server needs to be required to both support Range
headers and never change the content of a URI, so that we can have
resumable clone implicit in this design. There are some places in the
world where connections are poor and fetching even the initial packfile
at once might be a problem. (I've seen such questions on Stack
Overflow, for example.)

Having said that, I think overall this is a good idea and I'm glad to
see a proposal for it.

[0] For example, a naughty-word filter may corrupt or block certain byte
sequences that occur incidentally in the pack stream.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204


signature.asc
Description: PGP signature


Re: [WIP RFC 2/5] Documentation: add Packfile URIs design doc

2018-12-03 Thread Stefan Beller
Thanks for bringing this design to the list!

> diff --git a/Documentation/technical/protocol-v2.txt 
> b/Documentation/technical/protocol-v2.txt
> index 345c00e08c..2cb1c41742 100644
> --- a/Documentation/technical/protocol-v2.txt
> +++ b/Documentation/technical/protocol-v2.txt
> @@ -313,7 +313,8 @@ header. Most sections are sent only when the packfile is 
> sent.
>
>  output = acknowledgements flush-pkt |
>  [acknowledgments delim-pkt] [shallow-info delim-pkt]
> -[wanted-refs delim-pkt] packfile flush-pkt
> +[wanted-refs delim-pkt] [packfile-uris delim-pkt]
> +packfile flush-pkt

While this is an RFC and incomplete, we'd need to remember to
add packfile-uris to the capabilities list above, stating that it requires
thin-pack and ofs-delta to be sent, and what to expect from it.

The mention of  --no-packfile-urls in the Client design above
seems to imply we'd want to turn it on by default, which I thought
was not the usual stance how we introduce new things.

An odd way of disabling it would be --no-thin-pack, hoping the
client side implementation abides by the implied requirements.

>  acknowledgments = PKT-LINE("acknowledgments" LF)
>   (nak | *ack)
> @@ -331,6 +332,9 @@ header. Most sections are sent only when the packfile is 
> sent.
>   *PKT-LINE(wanted-ref LF)
>  wanted-ref = obj-id SP refname
>
> +packfile-uris = PKT-LINE("packfile-uris" LF) *packfile-uri
> +packfile-uri = PKT-LINE("uri" SP *%x20-ff LF)

Is the *%x20-ff a fancy way of saying obj-id?

While the server is configured with pairs of (oid URL),
we would not need to send the exact oid to the client
as that is what the client can figure out on its own by reading
the downloaded pack.

Instead we could send an integrity hash (i.e. the packfile
downloaded from "uri" is expected to hash to $oid here)

Thanks,
Stefan


[WIP RFC 2/5] Documentation: add Packfile URIs design doc

2018-12-03 Thread Jonathan Tan
Signed-off-by: Jonathan Tan 
---
 Documentation/technical/packfile-uri.txt | 83 
 Documentation/technical/protocol-v2.txt  |  6 +-
 2 files changed, 88 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/technical/packfile-uri.txt

diff --git a/Documentation/technical/packfile-uri.txt 
b/Documentation/technical/packfile-uri.txt
new file mode 100644
index 00..6535801486
--- /dev/null
+++ b/Documentation/technical/packfile-uri.txt
@@ -0,0 +1,83 @@
+Packfile URIs
+=
+
+This feature allows servers to serve part of their packfile response as URIs.
+This allows server designs that improve scalability in bandwidth and CPU usage
+(for example, by serving some data through a CDN), and (in the future) provides
+some measure of resumability to clients.
+
+This feature is available only in protocol version 2.
+
+Protocol
+
+
+The server advertises `packfile-uris`.
+
+If the client replies with the following arguments:
+
+ * packfile-uris
+ * thin-pack
+ * ofs-delta
+
+when the server sends the packfile, it MAY send a `packfile-uris` section
+directly before the `packfile` section (right after `wanted-refs` if it is
+sent) containing HTTP(S) URIs. See protocol-v2.txt for the documentation of
+this section.
+
+Clients then should understand that the returned packfile could be incomplete,
+and that it needs to download all the given URIs before the fetch or clone is
+complete. Each URI should point to a Git packfile (which may be a thin pack and
+which may contain offset deltas).
+
+Server design
+-
+
+The server can be trivially made compatible with the proposed protocol by
+having it advertise `packfile-uris`, tolerating the client sending
+`packfile-uris`, and never sending any `packfile-uris` section. But we should
+include some sort of non-trivial implementation in the Minimum Viable Product,
+at least so that we can test the client.
+
+This is the implementation: a feature, marked experimental, that allows the
+server to be configured by one or more `uploadpack.blobPackfileUri=
+` entries. Whenever the list of objects to be sent is assembled, a blob
+with the given sha1 can be replaced by the given URI. This allows, for example,
+servers to delegate serving of large blobs to CDNs.
+
+Client design
+-
+
+While fetching, the client needs to remember the list of URIs and cannot
+declare that the fetch is complete until all URIs have been downloaded as
+packfiles.
+
+The division of work (initial fetch + additional URIs) introduces convenient
+points for resumption of an interrupted clone - such resumption can be done
+after the Minimum Viable Product (see "Future work").
+
+The client can inhibit this feature (i.e. refrain from sending the
+`packfile-urls` parameter) by passing --no-packfile-urls to `git fetch`.
+
+Future work
+---
+
+The protocol design allows some evolution of the server and client without any
+need for protocol changes, so only a small-scoped design is included here to
+form the MVP. For example, the following can be done:
+
+ * On the server, a long-running process that takes in entire requests and
+   outputs a list of URIs and the corresponding inclusion and exclusion sets of
+   objects. This allows, e.g., signed URIs to be used and packfiles for common
+   requests to be cached.
+ * On the client, resumption of clone. If a clone is interrupted, information
+   could be recorded in the repository's config and a "clone-resume" command
+   can resume the clone in progress. (Resumption of subsequent fetches is more
+   difficult because that must deal with the user wanting to use the repository
+   even after the fetch was interrupted.)
+
+There are some possible features that will require a change in protocol:
+
+ * Additional HTTP headers (e.g. authentication)
+ * Byte range support
+ * Different file formats referenced by URIs (e.g. raw object)
+
diff --git a/Documentation/technical/protocol-v2.txt 
b/Documentation/technical/protocol-v2.txt
index 345c00e08c..2cb1c41742 100644
--- a/Documentation/technical/protocol-v2.txt
+++ b/Documentation/technical/protocol-v2.txt
@@ -313,7 +313,8 @@ header. Most sections are sent only when the packfile is 
sent.
 
 output = acknowledgements flush-pkt |
 [acknowledgments delim-pkt] [shallow-info delim-pkt]
-[wanted-refs delim-pkt] packfile flush-pkt
+[wanted-refs delim-pkt] [packfile-uris delim-pkt]
+packfile flush-pkt
 
 acknowledgments = PKT-LINE("acknowledgments" LF)
  (nak | *ack)
@@ -331,6 +332,9 @@ header. Most sections are sent only when the packfile is 
sent.
  *PKT-LINE(wanted-ref LF)
 wanted-ref = obj-id SP refname
 
+packfile-uris = PKT-LINE("packfile-uris" LF) *packfile-uri
+packfile-uri = PKT-LINE("uri" SP *%x20-ff LF)
+
 packfile = PKT-LINE("packfile" LF)
   *PKT-LINE(%x01-03 *%x00-ff)
 
-- 
2.19.0.271.gfe8321ec05.dirty