Re: Proposed approaches to supporting HTTP remotes in "git archive"

2018-07-29 Thread René Scharfe
Am 28.07.2018 um 00:32 schrieb Junio C Hamano:
> Josh Steadmon  writes:
> 
>> # Supporting HTTP remotes in "git archive"
>>
>> We would like to allow remote archiving from HTTP servers. There are a
>> few possible implementations to be discussed:
>>
>> ## Shallow clone to temporary repo
>>
>> This approach builds on existing endpoints. Clients will connect to the
>> remote server's git-upload-pack service and fetch a shallow clone of the
>> requested commit into a temporary local repo. The write_archive()
>> function is then called on the local clone to write out the requested
>> archive.

A prototype would require just a few lines of shell script, I guess..

A downside that was only stated implicitly: This method needs temporary
disk space for the clone, while the existing archive modes only ever
write out the resulting file.  I guess the required space is in the same
order as the compressed archive.  This shouldn't be a problem if we
assume the user would eventually want to extract its contents, right?

>> ## Summary
>>
>> Personally, I lean towards the first approach. It could give us an
>> opportunity to remove server-side complexity; there is no reason that
>> the shallow-clone approach must be restricted to the HTTP transport, and
>> we could re-implement other transports using this method.  Additionally,
>> it would allow clients to pull archives from remotes that would not
>> otherwise support it.
> 
> I consider the first one (i.e. make a shallow clone and tar it up
> locally) a hack that does *not* belong to "git archive --remote"
> command, especially when it is only done to "http remotes".  The
> only reason HTTP remotes are special is because there is no ready
> "http-backend" equivalent that passes the "git archive" traffic over
> smart-http transport, unlike the one that exists for "git
> upload-pack".
> 
> It however still _is_ attractive to drive such a hack from "git
> archive" at the UI level, as the end users do not care how ugly the
> hack is ;-)  As you mentioned, the approach would work for any
> transport that allows one-commit shallow clone, so it might become
> more palatable if it is designed as a different _mode_ of operation
> of "git archive" that is orthogonal to the underlying transport,
> i.e.
> 
>$ git archive --remote= --shallow-clone-then-local-archive-hack 
> master
> 
> or
> 
>$ git config archive..useShallowCloneThenLocalArchiveHack true
>$ git archive --remote= master

Archive-via-clone would also work with full clones (if shallow ones are
not available), but that would be wasteful and a bit cruel, of course.

Anyway, I think we should find a better (shorter) name for that option;
that could turn out to be the hardest part. :)

> It might turn out that it may work better than the native "git
> archive" access against servers that offer both shallow clone
> and native archive access.  I doubt a single-commit shallow clone
> would benefit from reusing of premade deltas and compressed bases
> streamed straight out of packfiles from the server side that much,
> but you'd never know until you measure ;-)

It could benefit from GIT_ALTERNATE_OBJECT_DIRECTORIES, but I guess
typical users of git archive --remote won't have any good ones lying
around.

René


Re: Proposed approaches to supporting HTTP remotes in "git archive"

2018-07-28 Thread brian m. carlson
On Fri, Jul 27, 2018 at 02:47:00PM -0700, Josh Steadmon wrote:
> ## Use git-upload-archive
> 
> This approach requires adding support for the git-upload-archive
> endpoint to the HTTP backend. Clients will connect to the remote
> server's git-upload-archive service and the server will generate the
> archive which is then delivered to the client.
> 
> ### Benefits
> 
> * Matches existing "git archive" behavior for other remotes.
> 
> * Requires less bandwidth to send a compressed archive than a shallow
>   clone.
> 
> * Resulting archive does not depend in any way on the client
>   implementation.
> 
> ### Drawbacks
> 
> * Implementation is more complicated; it will require changes to (at
>   least) builtin/archive.c, http-backend.c, and
>   builtin/upload-archive.c.
> 
> * Generates more CPU load on the server when compressing archives. This
>   is potentially a DoS vector.
> 
> * Does not allow archiving from servers that don't support the
>   git-upload-archive service.

I happen to like this option because it has the potential to be driven
by a non-git client (e.g. a curl invocation).  That would be enormously
valuable, especially in cases where authentication isn't desired or an
SSH key isn't a good form of authentication.

I'm not really worried about the DoS vector because an implementation is
almost certainly going to support both SSH and HTTPS or neither, and the
DoS potential is the same either way.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204


signature.asc
Description: PGP signature


Re: Proposed approaches to supporting HTTP remotes in "git archive"

2018-07-27 Thread Junio C Hamano
Josh Steadmon  writes:

> # Supporting HTTP remotes in "git archive"
>
> We would like to allow remote archiving from HTTP servers. There are a
> few possible implementations to be discussed:
>
> ## Shallow clone to temporary repo
>
> This approach builds on existing endpoints. Clients will connect to the
> remote server's git-upload-pack service and fetch a shallow clone of the
> requested commit into a temporary local repo. The write_archive()
> function is then called on the local clone to write out the requested
> archive.
>
> ...
>
> ## Summary
>
> Personally, I lean towards the first approach. It could give us an
> opportunity to remove server-side complexity; there is no reason that
> the shallow-clone approach must be restricted to the HTTP transport, and
> we could re-implement other transports using this method.  Additionally,
> it would allow clients to pull archives from remotes that would not
> otherwise support it.

I consider the first one (i.e. make a shallow clone and tar it up
locally) a hack that does *not* belong to "git archive --remote"
command, especially when it is only done to "http remotes".  The
only reason HTTP remotes are special is because there is no ready
"http-backend" equivalent that passes the "git archive" traffic over
smart-http transport, unlike the one that exists for "git
upload-pack".

It however still _is_ attractive to drive such a hack from "git
archive" at the UI level, as the end users do not care how ugly the
hack is ;-)  As you mentioned, the approach would work for any
transport that allows one-commit shallow clone, so it might become
more palatable if it is designed as a different _mode_ of operation
of "git archive" that is orthogonal to the underlying transport,
i.e.

  $ git archive --remote= --shallow-clone-then-local-archive-hack master

or

  $ git config archive..useShallowCloneThenLocalArchiveHack true
  $ git archive --remote= master

It might turn out that it may work better than the native "git
archive" access against servers that offer both shallow clone
and native archive access.  I doubt a single-commit shallow clone
would benefit from reusing of premade deltas and compressed bases
streamed straight out of packfiles from the server side that much,
but you'd never know until you measure ;-)




Re: Proposed approaches to supporting HTTP remotes in "git archive"

2018-07-27 Thread Jonathan Nieder
(just cc-ing René Scharfe, archive expert; Peff; Dscho; Franck Bui-Huu
to see how his creation is evolving.

Using the correct address for René this time. Sorry for the noise.)

Josh Steadmon wrote:

> # Supporting HTTP remotes in "git archive"
>
> We would like to allow remote archiving from HTTP servers. There are a
> few possible implementations to be discussed:
>
> ## Shallow clone to temporary repo
>
> This approach builds on existing endpoints. Clients will connect to the
> remote server's git-upload-pack service and fetch a shallow clone of the
> requested commit into a temporary local repo. The write_archive()
> function is then called on the local clone to write out the requested
> archive.
>
> ### Benefits
>
> * This can be implemented entirely in builtin/archive.c. No new service
>   endpoints or server code are required.
>
> * The archive is generated and compressed on the client side. This
>   reduces CPU load on the server (for compressed archives) which would
>otherwise be a potential DoS vector.
>
> * This provides a git-native way to archive any HTTP servers that
>   support the git-upload-pack service; some providers (including GitHub)
>   do not currently allow the git-upload-archive service.
>
> ### Drawbacks
>
> * Archives generated remotely may not be bit-for-bit identical compared
>   to those generated locally, if the versions of git used on the client
>   and on the server differ.
>
> * This requires higher bandwidth compared to transferring a compressed
>   archive generated on the server.
>
>
> ## Use git-upload-archive
>
> This approach requires adding support for the git-upload-archive
> endpoint to the HTTP backend. Clients will connect to the remote
> server's git-upload-archive service and the server will generate the
> archive which is then delivered to the client.
>
> ### Benefits
>
> * Matches existing "git archive" behavior for other remotes.
>
> * Requires less bandwidth to send a compressed archive than a shallow
>   clone.
>
> * Resulting archive does not depend in any way on the client
>   implementation.
>
> ### Drawbacks
>
> * Implementation is more complicated; it will require changes to (at
>   least) builtin/archive.c, http-backend.c, and
>   builtin/upload-archive.c.
>
> * Generates more CPU load on the server when compressing archives. This
>   is potentially a DoS vector.
>
> * Does not allow archiving from servers that don't support the
>   git-upload-archive service.
>
>
> ## Add a new protocol v2 "archive" command
>
> I am still a bit hazy on the exact details of this approach, please
> forgive any inaccuracies (I'm a new contributor and haven't examined
> custom v2 commands in much detail yet).
>
> This approach builds off the existing v2 upload-pack endpoint. The
> client will issue an archive command (with options to select particular
> paths or a tree-ish). The server will generate the archive and deliver
> it to the client.
>
> ### Benefits
>
> * Requires less bandwidth to send a compressed archive than a shallow
>   clone.
>
> * Resulting archive does not depend in any way on the client
>   implementation.
>
> ### Drawbacks
>
> * Generates more CPU load on the server when compressing archives. This
>   is potentially a DoS vector.
>
> * Servers must support the v2 protocol (although the client could
>   potentially fallback to some other supported remote archive
>functionality).
>
> ### Unknowns
>
> * I am not clear on the relative complexity of this approach compared to
>   the others, and would appreciate any guidance offered.
>
>
> ## Summary
>
> Personally, I lean towards the first approach. It could give us an
> opportunity to remove server-side complexity; there is no reason that
> the shallow-clone approach must be restricted to the HTTP transport, and
> we could re-implement other transports using this method.  Additionally,
> it would allow clients to pull archives from remotes that would not
> otherwise support it.
>
> That said, I am happy to work on whichever approach the community deems
> most worthwhile.


Re: Proposed approaches to supporting HTTP remotes in "git archive"

2018-07-27 Thread Jonathan Nieder
(just cc-ing René Scharfe, archive expert; Peff; Dscho; Franck Bui-Huu
to see how his creation is evolving)
Josh Steadmon wrote:

> # Supporting HTTP remotes in "git archive"
>
> We would like to allow remote archiving from HTTP servers. There are a
> few possible implementations to be discussed:
>
> ## Shallow clone to temporary repo
>
> This approach builds on existing endpoints. Clients will connect to the
> remote server's git-upload-pack service and fetch a shallow clone of the
> requested commit into a temporary local repo. The write_archive()
> function is then called on the local clone to write out the requested
> archive.
>
> ### Benefits
>
> * This can be implemented entirely in builtin/archive.c. No new service
>   endpoints or server code are required.
>
> * The archive is generated and compressed on the client side. This
>   reduces CPU load on the server (for compressed archives) which would
>otherwise be a potential DoS vector.
>
> * This provides a git-native way to archive any HTTP servers that
>   support the git-upload-pack service; some providers (including GitHub)
>   do not currently allow the git-upload-archive service.
>
> ### Drawbacks
>
> * Archives generated remotely may not be bit-for-bit identical compared
>   to those generated locally, if the versions of git used on the client
>   and on the server differ.
>
> * This requires higher bandwidth compared to transferring a compressed
>   archive generated on the server.
>
>
> ## Use git-upload-archive
>
> This approach requires adding support for the git-upload-archive
> endpoint to the HTTP backend. Clients will connect to the remote
> server's git-upload-archive service and the server will generate the
> archive which is then delivered to the client.
>
> ### Benefits
>
> * Matches existing "git archive" behavior for other remotes.
>
> * Requires less bandwidth to send a compressed archive than a shallow
>   clone.
>
> * Resulting archive does not depend in any way on the client
>   implementation.
>
> ### Drawbacks
>
> * Implementation is more complicated; it will require changes to (at
>   least) builtin/archive.c, http-backend.c, and
>   builtin/upload-archive.c.
>
> * Generates more CPU load on the server when compressing archives. This
>   is potentially a DoS vector.
>
> * Does not allow archiving from servers that don't support the
>   git-upload-archive service.
>
>
> ## Add a new protocol v2 "archive" command
>
> I am still a bit hazy on the exact details of this approach, please
> forgive any inaccuracies (I'm a new contributor and haven't examined
> custom v2 commands in much detail yet).
>
> This approach builds off the existing v2 upload-pack endpoint. The
> client will issue an archive command (with options to select particular
> paths or a tree-ish). The server will generate the archive and deliver
> it to the client.
>
> ### Benefits
>
> * Requires less bandwidth to send a compressed archive than a shallow
>   clone.
>
> * Resulting archive does not depend in any way on the client
>   implementation.
>
> ### Drawbacks
>
> * Generates more CPU load on the server when compressing archives. This
>   is potentially a DoS vector.
>
> * Servers must support the v2 protocol (although the client could
>   potentially fallback to some other supported remote archive
>functionality).
>
> ### Unknowns
>
> * I am not clear on the relative complexity of this approach compared to
>   the others, and would appreciate any guidance offered.
>
>
> ## Summary
>
> Personally, I lean towards the first approach. It could give us an
> opportunity to remove server-side complexity; there is no reason that
> the shallow-clone approach must be restricted to the HTTP transport, and
> we could re-implement other transports using this method.  Additionally,
> it would allow clients to pull archives from remotes that would not
> otherwise support it.
>
> That said, I am happy to work on whichever approach the community deems
> most worthwhile.