Re: Git Merge contributor summit notes

2018-04-07 Thread Jakub Narebski
Brandon Williams  writes:
> On 03/26, Jeff Hostetler wrote:

[...]
>> All of these cases could be eliminated if the type/size were available
>> in the OID.
>> 
>> Just a thought.  While we are converting to a new hash it seems like
>> this would be a good time to at least discuss it.
>
> Echoing what Stefan said.  I don't think its a good idea to embed this
> sort of data into the OID.  There are a lot of reasons but one of them
> being that would block having access to this data behind completing the
> hash transition (which could very well still be years away from
> completing).
>
> I think that a much better approach would be to create a meta-data data
> structure, much like the commit graph that stolee has been working on)
> which can store this data along side the objects (but not in the
> packfiles themselves).  It could be a stacking structure which is
> periodically coalesced and we could add in a wire feature to fetch this
> meta data from the server upon fetching objects.

Well, the type of the object is available, from what I remember, in the
bitmap file for a packfile (if one does enable creaating them).  There
are four compressed bit vectors, one for each type, with bit set to 1 on
i-th place if i-th object in packfile is of given type.

Just FYI.
--
Jakub Narębski


Re: Including object type and size in object id (Re: Git Merge contributor summit notes)

2018-03-26 Thread Junio C Hamano
Jonathan Nieder  writes:

> This implies a limit on the object size (e.g. 5 bytes in your
> example).  What happens when someone wants to encode an object larger
> than that limit?
>
> This also decreases the number of bits available for the hash, but
> that shouldn't be a big issue.

I actually thought that the latter "downside" makes the object name
a tad larger.

But let's not go there, really.

"X is handy if we can get it on the surface without looking into it"
will grow.  Somebody may want to have the generation number of a
commit in the commit object name.  Yet another somebody may want to
be able to quickly learn the object name for the top-level tree from
the commit object name alone.  We need to stop somewhere, and as
already suggested in the thread(s), having auxiliary look-up table
is a better way to go, encoding nothing in the name, as we are going
to need such a look-up table because it is unrealistic to encode
everything we would want in the name anyway.



Re: Including object type and size in object id (Re: Git Merge contributor summit notes)

2018-03-26 Thread Jeff Hostetler



On 3/26/2018 5:00 PM, Jonathan Nieder wrote:

Jeff Hostetler wrote:
[long quote snipped]


While we are converting to a new hash function, it would be nice
if we could add a couple of fields to the end of the OID:  the object
type and the raw uncompressed object size.

If would be nice if we could extend the OID to include 6 bytes of data
(4 or 8 bits for the type and the rest for the raw object size), and
just say that an OID is a {hash,type,size} tuple.

There are lots of places where we open an object to see what type it is
or how big it is.  This requires uncompressing/undeltafying the object
(or at least decoding enough to get the header).  In the case of missing
objects (partial clone or a gvfs-like projection) it requires either
dynamically fetching the object or asking an object-size-server for the
data.

All of these cases could be eliminated if the type/size were available
in the OID.


This implies a limit on the object size (e.g. 5 bytes in your
example).  What happens when someone wants to encode an object larger
than that limit?


I could say add a full uint64 to the tail end of the hash, but
we currently don't handle blobs/objects larger then 4GB right now
anyway, right?

5 bytes for the size is just a compromise -- 1TB blobs would be
terrible to think about...
 


This also decreases the number of bits available for the hash, but
that shouldn't be a big issue.


I was suggesting extending the OIDs by 6 bytes while we are changing
the hash function.


Aside from those two, I don't see any downsides.  It would mean that
tree objects contain information about the sizes of blobs contained
there, which helps with virtual file systems.  It's also possible to
do that without putting the size in the object id, but maybe having it
in the object id is simpler.

Will think more about this.

Thanks for the idea,
Jonathan



Thanks
Jeff



Re: Per-object encryption (Re: Git Merge contributor summit notes)

2018-03-26 Thread Ævar Arnfjörð Bjarmason

On Mon, Mar 26 2018, Jonathan Nieder wrote:

> Hi Ævar,
>
> Ævar Arnfjörð Bjarmason wrote:
>
>> It occurred to me recently that once we have such a layer it could be
>> (ab)used with some relatively minor changes to do any arbitrary
>> local-to-remote object content translation, unless I've missed something
>> (but I just re-read hash-function-transition.txt now...).
>>
>> E.g. having a SHA-1 (or NewHash) local repo, but interfacing with a
>> remote server so that you upload a GPG encrypted version of all your
>> blobs, and have your trees reference those blobs.
>
> Interesting!
>
> To be clear, this would only work with deterministic encryption.
> Normal GPG encryption would not have the round-tripping properties
> required by the design.

Right, sorry. I was being lazy. For simplicity let's say rot13 or some
other deterministic algorithm.

> If I understand correctly, it also requires both sides of the
> connection to have access to the encryption key.  Otherwise they
> cannot perform ordinary operations like revision walks.  So I'm not
> seeing a huge advantage over ordinary transport-layer encryption.
>
> That said, it's an interesting idea --- thanks for that.  I'm changing
> the subject line since otherwise there's no way I'll find this again. :)

In this specific implementation I have in mind only one side would have
the key, we'd encrypt just up to the point where the repository would
still pass fsck. But of course once we had that facility we could do any
arbitrary translation .

I.e. consider the latest commit in git.git:

commit 90bbd502d54fe920356fa9278055dc9c9bfe9a56
tree 5539308dc384fd11055be9d6a0cc1cce7d495150
parent 085f5f95a2723e8f9f4d037c01db5b786355ba49
parent d32eb83c1db7d0a8bb54fe743c6d1dd674d372c5
author Junio C Hamano  1521754611 -0700
committer Junio C Hamano  1521754611 -0700

Sync with Git 2.16.3

With rot13 "encryption" it would be:

commit 
tree 
parent 
parent 
author Whavb P Unznab  1521754611 -0700
committer Whavb P Unznab  1521754611 -0700

Flap jvgu Tvg 2.16.3

And an ls-tree on that tree hash would instead of README.md give you:

100644 blob  ERNQZR.zq

And inspecting that blob would give you:

# Rot13'd "Hello, World!"
Uryyb, Jbeyq!

So obviously for the encryption use-case such a repo would leak a lot of
info compared to just uploading the fast-export version of it
periodically as one big encrypted blob to store somewhere, but the
advantage would be:

 * It's better than existing "just munge the blobs" encryption solutions
   bolted on top of git, because at least you encrypt the commit
   message, author names & filenames.

 * Since it would be a valid repo even without the key, you could use
   git hosting solutions for it, similar to checking in encrypted blobs
   in existing git repos.

 * As noted, it could be a permanent stress test on the SHA-1<->NewHash
   codepath.

   I can't think of a reason for why once we have that we couldn't add
   the equivalent of clean/smudge filters.

   We need to unpack & repack & re-hash all the stuff we send over the
   wire anyway, so we can munge it as it goes in/out as long as the same
   input values always yield the same output values.


Including object type and size in object id (Re: Git Merge contributor summit notes)

2018-03-26 Thread Jonathan Nieder
(administrivia: please omit parts of the text you are replying to that
 are not relevant to the reply.  This makes it easier to see what you're
 replying to, especially in mail readers that don't hide quoted text by
 the default)
Hi Jeff,

Jeff Hostetler wrote:
[long quote snipped]

> While we are converting to a new hash function, it would be nice
> if we could add a couple of fields to the end of the OID:  the object
> type and the raw uncompressed object size.
>
> If would be nice if we could extend the OID to include 6 bytes of data
> (4 or 8 bits for the type and the rest for the raw object size), and
> just say that an OID is a {hash,type,size} tuple.
>
> There are lots of places where we open an object to see what type it is
> or how big it is.  This requires uncompressing/undeltafying the object
> (or at least decoding enough to get the header).  In the case of missing
> objects (partial clone or a gvfs-like projection) it requires either
> dynamically fetching the object or asking an object-size-server for the
> data.
>
> All of these cases could be eliminated if the type/size were available
> in the OID.

This implies a limit on the object size (e.g. 5 bytes in your
example).  What happens when someone wants to encode an object larger
than that limit?

This also decreases the number of bits available for the hash, but
that shouldn't be a big issue.

Aside from those two, I don't see any downsides.  It would mean that
tree objects contain information about the sizes of blobs contained
there, which helps with virtual file systems.  It's also possible to
do that without putting the size in the object id, but maybe having it
in the object id is simpler.

Will think more about this.

Thanks for the idea,
Jonathan


Per-object encryption (Re: Git Merge contributor summit notes)

2018-03-26 Thread Jonathan Nieder
Hi Ævar,

Ævar Arnfjörð Bjarmason wrote:

> It occurred to me recently that once we have such a layer it could be
> (ab)used with some relatively minor changes to do any arbitrary
> local-to-remote object content translation, unless I've missed something
> (but I just re-read hash-function-transition.txt now...).
>
> E.g. having a SHA-1 (or NewHash) local repo, but interfacing with a
> remote server so that you upload a GPG encrypted version of all your
> blobs, and have your trees reference those blobs.

Interesting!

To be clear, this would only work with deterministic encryption.
Normal GPG encryption would not have the round-tripping properties
required by the design.

If I understand correctly, it also requires both sides of the
connection to have access to the encryption key.  Otherwise they
cannot perform ordinary operations like revision walks.  So I'm not
seeing a huge advantage over ordinary transport-layer encryption.

That said, it's an interesting idea --- thanks for that.  I'm changing
the subject line since otherwise there's no way I'll find this again. :)

Jonathan


Re: Git Merge contributor summit notes

2018-03-26 Thread Jeff Hostetler



On 3/26/2018 1:56 PM, Stefan Beller wrote:

On Mon, Mar 26, 2018 at 10:33 AM Jeff Hostetler 
wrote:




On 3/25/2018 6:58 PM, Ævar Arnfjörð Bjarmason wrote:


On Sat, Mar 10 2018, Alex Vandiver wrote:


New hash (Stefan, etc)
--
   - discussed on the mailing list
   - actual plan checked in to

Documentation/technical/hash-function-transition.txt

   - lots of work renaming
   - any actual work with the transition plan?
   - local conversion first; fetch/push have translation table
   - like git-svn
   - also modified pack and index format to have lookup/translation

efficiently

   - brian's series to eliminate SHA1 strings from the codebase
   - testsuite is not working well because hardcoded SHA1 values
   - flip a bit in the sha1 computation and see what breaks in the

testsuite

   - will also need a way to do the conversion itself; traverse and

write out new version

   - without that, can start new repos, but not work on old ones
   - on-disk formats will need to change -- something to keep in mind

with new index work

   - documentation describes packfile and index formats
   - what time frame are we talking?
   - public perception question
   - signing commits doesn't help (just signs commit object) unless you

"recursive sign"

   - switched to SHA1dc; we detect and reject known collision technique
   - do it now because it takes too long if we start when the collision

drops

   - always call it "new hash" to reduce bikeshedding
   - is translation table a backdoor? has it been reviewed by crypto

folks?

 - no, but everything gets translated
   - meant to avoid a flag day for entire repositories
   - linus can decide to upgrade to newhash; if pushes to server that

is not newhash aware, that's fine

   - will need a wire protocol change
   - v2 might add a capability for newhash
   - "now that you mention md5, it's a good idea"
   - can use md5 to test the conversion
   - is there a technical reason for why not /n/ hashes?
   - the slow step goes away as people converge to the new hash
   - beneficial to make up some fake hash function for testing
   - is there a plan on how we decide which hash function?
   - trust junio to merge commits when appropriate
   - conservancy committee explicitly does not make code decisions
   - waiting will just give better data
   - some hash functions are in silicon (e.g. microsoft cares)
   - any movement in libgit2 / jgit?
 - basic stuff for libgit2; same testsuite problems
 - no work in jgit
   - most optimistic forecast?
 - could be done in 1-2y
   - submodules with one hash function?
 - unable to convert project unless all submodules are converted
 - OO-ing is not a prereq


Late reply, but one thing I brought up at the time is that we'll want to
keep this code around even after the NewHash migration at least for
testing purposes, should we ever need to move to NewNewHash.

It occurred to me recently that once we have such a layer it could be
(ab)used with some relatively minor changes to do any arbitrary
local-to-remote object content translation, unless I've missed something
(but I just re-read hash-function-transition.txt now...).

E.g. having a SHA-1 (or NewHash) local repo, but interfacing with a
remote server so that you upload a GPG encrypted version of all your
blobs, and have your trees reference those blobs.

Because we'd be doing arbitrary translations for all of
commits/trees/blobs this could go further than other bolted-on
encryption solutions for Git. E.g. paths in trees could be encrypted
too, as well as all the content of the commit object that isn't parent
info & the like (but that would have different hashes).

Basically clean/smudge filters on steroids, but for every object in the
repo. Anyone who got a hold of it would still see the shape of the repo
& approximate content size, but other than that it wouldn't be more info
than they'd get via `fast-export --anonymize` now.

I mainly find it interesting because presents an intersection between a
feature we might want to offer anyway, and something that would stress
the hash transition codepath going forward, to make sure it hasn't all
bitrotted by the time we'll need NewHash->NewNewHash.

Git hosting providers would hate it, but they should probably be
charging users by how much Michael Haggerty's git-sizer tool hates their
repo anyway :)




While we are converting to a new hash function, it would be nice
if we could add a couple of fields to the end of the OID:  the object
type and the raw uncompressed object size.


This would allow to craft invalid OIDs, i.e. the correct hash value with
the wrong object type. (This is different field of "invalid" compared to
today, where we either have or do not have the object named by the
hash value. If we don't have it, it may be just unknown to us, but not
"wrong".)


An invalid OID (such as a wrong object type) could be detected as soon
as we open the object and read the 

Re: Git Merge contributor summit notes

2018-03-26 Thread Brandon Williams
On 03/26, Jeff Hostetler wrote:
> 
> 
> On 3/25/2018 6:58 PM, Ævar Arnfjörð Bjarmason wrote:
> > 
> > On Sat, Mar 10 2018, Alex Vandiver wrote:
> > 
> > > New hash (Stefan, etc)
> > > --
> > >   - discussed on the mailing list
> > >   - actual plan checked in to 
> > > Documentation/technical/hash-function-transition.txt
> > >   - lots of work renaming
> > >   - any actual work with the transition plan?
> > >   - local conversion first; fetch/push have translation table
> > >   - like git-svn
> > >   - also modified pack and index format to have lookup/translation 
> > > efficiently
> > >   - brian's series to eliminate SHA1 strings from the codebase
> > >   - testsuite is not working well because hardcoded SHA1 values
> > >   - flip a bit in the sha1 computation and see what breaks in the 
> > > testsuite
> > >   - will also need a way to do the conversion itself; traverse and write 
> > > out new version
> > >   - without that, can start new repos, but not work on old ones
> > >   - on-disk formats will need to change -- something to keep in mind with 
> > > new index work
> > >   - documentation describes packfile and index formats
> > >   - what time frame are we talking?
> > >   - public perception question
> > >   - signing commits doesn't help (just signs commit object) unless you 
> > > "recursive sign"
> > >   - switched to SHA1dc; we detect and reject known collision technique
> > >   - do it now because it takes too long if we start when the collision 
> > > drops
> > >   - always call it "new hash" to reduce bikeshedding
> > >   - is translation table a backdoor? has it been reviewed by crypto folks?
> > > - no, but everything gets translated
> > >   - meant to avoid a flag day for entire repositories
> > >   - linus can decide to upgrade to newhash; if pushes to server that is 
> > > not newhash aware, that's fine
> > >   - will need a wire protocol change
> > >   - v2 might add a capability for newhash
> > >   - "now that you mention md5, it's a good idea"
> > >   - can use md5 to test the conversion
> > >   - is there a technical reason for why not /n/ hashes?
> > >   - the slow step goes away as people converge to the new hash
> > >   - beneficial to make up some fake hash function for testing
> > >   - is there a plan on how we decide which hash function?
> > >   - trust junio to merge commits when appropriate
> > >   - conservancy committee explicitly does not make code decisions
> > >   - waiting will just give better data
> > >   - some hash functions are in silicon (e.g. microsoft cares)
> > >   - any movement in libgit2 / jgit?
> > > - basic stuff for libgit2; same testsuite problems
> > > - no work in jgit
> > >   - most optimistic forecast?
> > > - could be done in 1-2y
> > >   - submodules with one hash function?
> > > - unable to convert project unless all submodules are converted
> > > - OO-ing is not a prereq
> > 
> > Late reply, but one thing I brought up at the time is that we'll want to
> > keep this code around even after the NewHash migration at least for
> > testing purposes, should we ever need to move to NewNewHash.
> > 
> > It occurred to me recently that once we have such a layer it could be
> > (ab)used with some relatively minor changes to do any arbitrary
> > local-to-remote object content translation, unless I've missed something
> > (but I just re-read hash-function-transition.txt now...).
> > 
> > E.g. having a SHA-1 (or NewHash) local repo, but interfacing with a
> > remote server so that you upload a GPG encrypted version of all your
> > blobs, and have your trees reference those blobs.
> > 
> > Because we'd be doing arbitrary translations for all of
> > commits/trees/blobs this could go further than other bolted-on
> > encryption solutions for Git. E.g. paths in trees could be encrypted
> > too, as well as all the content of the commit object that isn't parent
> > info & the like (but that would have different hashes).
> > 
> > Basically clean/smudge filters on steroids, but for every object in the
> > repo. Anyone who got a hold of it would still see the shape of the repo
> > & approximate content size, but other than that it wouldn't be more info
> > than they'd get via `fast-export --anonymize` now.
> > 
> > I mainly find it interesting because presents an intersection between a
> > feature we might want to offer anyway, and something that would stress
> > the hash transition codepath going forward, to make sure it hasn't all
> > bitrotted by the time we'll need NewHash->NewNewHash.
> > 
> > Git hosting providers would hate it, but they should probably be
> > charging users by how much Michael Haggerty's git-sizer tool hates their
> > repo anyway :)
> > 
> 
> While we are converting to a new hash function, it would be nice
> if we could add a couple of fields to the end of the OID:  the object
> type and the raw uncompressed object size.
> 
> If would be nice if we could extend the OID to include 6 bytes of data

Re: Git Merge contributor summit notes

2018-03-26 Thread Stefan Beller
On Mon, Mar 26, 2018 at 10:33 AM Jeff Hostetler 
wrote:



> On 3/25/2018 6:58 PM, Ævar Arnfjörð Bjarmason wrote:
> >
> > On Sat, Mar 10 2018, Alex Vandiver wrote:
> >
> >> New hash (Stefan, etc)
> >> --
> >>   - discussed on the mailing list
> >>   - actual plan checked in to
Documentation/technical/hash-function-transition.txt
> >>   - lots of work renaming
> >>   - any actual work with the transition plan?
> >>   - local conversion first; fetch/push have translation table
> >>   - like git-svn
> >>   - also modified pack and index format to have lookup/translation
efficiently
> >>   - brian's series to eliminate SHA1 strings from the codebase
> >>   - testsuite is not working well because hardcoded SHA1 values
> >>   - flip a bit in the sha1 computation and see what breaks in the
testsuite
> >>   - will also need a way to do the conversion itself; traverse and
write out new version
> >>   - without that, can start new repos, but not work on old ones
> >>   - on-disk formats will need to change -- something to keep in mind
with new index work
> >>   - documentation describes packfile and index formats
> >>   - what time frame are we talking?
> >>   - public perception question
> >>   - signing commits doesn't help (just signs commit object) unless you
"recursive sign"
> >>   - switched to SHA1dc; we detect and reject known collision technique
> >>   - do it now because it takes too long if we start when the collision
drops
> >>   - always call it "new hash" to reduce bikeshedding
> >>   - is translation table a backdoor? has it been reviewed by crypto
folks?
> >> - no, but everything gets translated
> >>   - meant to avoid a flag day for entire repositories
> >>   - linus can decide to upgrade to newhash; if pushes to server that
is not newhash aware, that's fine
> >>   - will need a wire protocol change
> >>   - v2 might add a capability for newhash
> >>   - "now that you mention md5, it's a good idea"
> >>   - can use md5 to test the conversion
> >>   - is there a technical reason for why not /n/ hashes?
> >>   - the slow step goes away as people converge to the new hash
> >>   - beneficial to make up some fake hash function for testing
> >>   - is there a plan on how we decide which hash function?
> >>   - trust junio to merge commits when appropriate
> >>   - conservancy committee explicitly does not make code decisions
> >>   - waiting will just give better data
> >>   - some hash functions are in silicon (e.g. microsoft cares)
> >>   - any movement in libgit2 / jgit?
> >> - basic stuff for libgit2; same testsuite problems
> >> - no work in jgit
> >>   - most optimistic forecast?
> >> - could be done in 1-2y
> >>   - submodules with one hash function?
> >> - unable to convert project unless all submodules are converted
> >> - OO-ing is not a prereq
> >
> > Late reply, but one thing I brought up at the time is that we'll want to
> > keep this code around even after the NewHash migration at least for
> > testing purposes, should we ever need to move to NewNewHash.
> >
> > It occurred to me recently that once we have such a layer it could be
> > (ab)used with some relatively minor changes to do any arbitrary
> > local-to-remote object content translation, unless I've missed something
> > (but I just re-read hash-function-transition.txt now...).
> >
> > E.g. having a SHA-1 (or NewHash) local repo, but interfacing with a
> > remote server so that you upload a GPG encrypted version of all your
> > blobs, and have your trees reference those blobs.
> >
> > Because we'd be doing arbitrary translations for all of
> > commits/trees/blobs this could go further than other bolted-on
> > encryption solutions for Git. E.g. paths in trees could be encrypted
> > too, as well as all the content of the commit object that isn't parent
> > info & the like (but that would have different hashes).
> >
> > Basically clean/smudge filters on steroids, but for every object in the
> > repo. Anyone who got a hold of it would still see the shape of the repo
> > & approximate content size, but other than that it wouldn't be more info
> > than they'd get via `fast-export --anonymize` now.
> >
> > I mainly find it interesting because presents an intersection between a
> > feature we might want to offer anyway, and something that would stress
> > the hash transition codepath going forward, to make sure it hasn't all
> > bitrotted by the time we'll need NewHash->NewNewHash.
> >
> > Git hosting providers would hate it, but they should probably be
> > charging users by how much Michael Haggerty's git-sizer tool hates their
> > repo anyway :)
> >

> While we are converting to a new hash function, it would be nice
> if we could add a couple of fields to the end of the OID:  the object
> type and the raw uncompressed object size.

This would allow to craft invalid OIDs, i.e. the correct hash value with
the wrong object type. (This is different field of "invalid" compared 

Re: Git Merge contributor summit notes

2018-03-26 Thread Jeff Hostetler



On 3/25/2018 6:58 PM, Ævar Arnfjörð Bjarmason wrote:


On Sat, Mar 10 2018, Alex Vandiver wrote:


New hash (Stefan, etc)
--
  - discussed on the mailing list
  - actual plan checked in to 
Documentation/technical/hash-function-transition.txt
  - lots of work renaming
  - any actual work with the transition plan?
  - local conversion first; fetch/push have translation table
  - like git-svn
  - also modified pack and index format to have lookup/translation efficiently
  - brian's series to eliminate SHA1 strings from the codebase
  - testsuite is not working well because hardcoded SHA1 values
  - flip a bit in the sha1 computation and see what breaks in the testsuite
  - will also need a way to do the conversion itself; traverse and write out 
new version
  - without that, can start new repos, but not work on old ones
  - on-disk formats will need to change -- something to keep in mind with new 
index work
  - documentation describes packfile and index formats
  - what time frame are we talking?
  - public perception question
  - signing commits doesn't help (just signs commit object) unless you "recursive 
sign"
  - switched to SHA1dc; we detect and reject known collision technique
  - do it now because it takes too long if we start when the collision drops
  - always call it "new hash" to reduce bikeshedding
  - is translation table a backdoor? has it been reviewed by crypto folks?
- no, but everything gets translated
  - meant to avoid a flag day for entire repositories
  - linus can decide to upgrade to newhash; if pushes to server that is not 
newhash aware, that's fine
  - will need a wire protocol change
  - v2 might add a capability for newhash
  - "now that you mention md5, it's a good idea"
  - can use md5 to test the conversion
  - is there a technical reason for why not /n/ hashes?
  - the slow step goes away as people converge to the new hash
  - beneficial to make up some fake hash function for testing
  - is there a plan on how we decide which hash function?
  - trust junio to merge commits when appropriate
  - conservancy committee explicitly does not make code decisions
  - waiting will just give better data
  - some hash functions are in silicon (e.g. microsoft cares)
  - any movement in libgit2 / jgit?
- basic stuff for libgit2; same testsuite problems
- no work in jgit
  - most optimistic forecast?
- could be done in 1-2y
  - submodules with one hash function?
- unable to convert project unless all submodules are converted
- OO-ing is not a prereq


Late reply, but one thing I brought up at the time is that we'll want to
keep this code around even after the NewHash migration at least for
testing purposes, should we ever need to move to NewNewHash.

It occurred to me recently that once we have such a layer it could be
(ab)used with some relatively minor changes to do any arbitrary
local-to-remote object content translation, unless I've missed something
(but I just re-read hash-function-transition.txt now...).

E.g. having a SHA-1 (or NewHash) local repo, but interfacing with a
remote server so that you upload a GPG encrypted version of all your
blobs, and have your trees reference those blobs.

Because we'd be doing arbitrary translations for all of
commits/trees/blobs this could go further than other bolted-on
encryption solutions for Git. E.g. paths in trees could be encrypted
too, as well as all the content of the commit object that isn't parent
info & the like (but that would have different hashes).

Basically clean/smudge filters on steroids, but for every object in the
repo. Anyone who got a hold of it would still see the shape of the repo
& approximate content size, but other than that it wouldn't be more info
than they'd get via `fast-export --anonymize` now.

I mainly find it interesting because presents an intersection between a
feature we might want to offer anyway, and something that would stress
the hash transition codepath going forward, to make sure it hasn't all
bitrotted by the time we'll need NewHash->NewNewHash.

Git hosting providers would hate it, but they should probably be
charging users by how much Michael Haggerty's git-sizer tool hates their
repo anyway :)



While we are converting to a new hash function, it would be nice
if we could add a couple of fields to the end of the OID:  the object
type and the raw uncompressed object size.

If would be nice if we could extend the OID to include 6 bytes of data
(4 or 8 bits for the type and the rest for the raw object size), and
just say that an OID is a {hash,type,size} tuple.

There are lots of places where we open an object to see what type it is
or how big it is.  This requires uncompressing/undeltafying the object
(or at least decoding enough to get the header).  In the case of missing
objects (partial clone or a gvfs-like projection) it requires either
dynamically fetching the object or asking an object-size-server for the
data.

All of these cases could be 

Re: Git Merge contributor summit notes

2018-03-25 Thread Ævar Arnfjörð Bjarmason

On Sat, Mar 10 2018, Alex Vandiver wrote:

> New hash (Stefan, etc)
> --
>  - discussed on the mailing list
>  - actual plan checked in to 
> Documentation/technical/hash-function-transition.txt
>  - lots of work renaming
>  - any actual work with the transition plan?
>  - local conversion first; fetch/push have translation table
>  - like git-svn
>  - also modified pack and index format to have lookup/translation efficiently
>  - brian's series to eliminate SHA1 strings from the codebase
>  - testsuite is not working well because hardcoded SHA1 values
>  - flip a bit in the sha1 computation and see what breaks in the testsuite
>  - will also need a way to do the conversion itself; traverse and write out 
> new version
>  - without that, can start new repos, but not work on old ones
>  - on-disk formats will need to change -- something to keep in mind with new 
> index work
>  - documentation describes packfile and index formats
>  - what time frame are we talking?
>  - public perception question
>  - signing commits doesn't help (just signs commit object) unless you 
> "recursive sign"
>  - switched to SHA1dc; we detect and reject known collision technique
>  - do it now because it takes too long if we start when the collision drops
>  - always call it "new hash" to reduce bikeshedding
>  - is translation table a backdoor? has it been reviewed by crypto folks?
>- no, but everything gets translated
>  - meant to avoid a flag day for entire repositories
>  - linus can decide to upgrade to newhash; if pushes to server that is not 
> newhash aware, that's fine
>  - will need a wire protocol change
>  - v2 might add a capability for newhash
>  - "now that you mention md5, it's a good idea"
>  - can use md5 to test the conversion
>  - is there a technical reason for why not /n/ hashes?
>  - the slow step goes away as people converge to the new hash
>  - beneficial to make up some fake hash function for testing
>  - is there a plan on how we decide which hash function?
>  - trust junio to merge commits when appropriate
>  - conservancy committee explicitly does not make code decisions
>  - waiting will just give better data
>  - some hash functions are in silicon (e.g. microsoft cares)
>  - any movement in libgit2 / jgit?
>- basic stuff for libgit2; same testsuite problems
>- no work in jgit
>  - most optimistic forecast?
>- could be done in 1-2y
>  - submodules with one hash function?
>- unable to convert project unless all submodules are converted
>- OO-ing is not a prereq

Late reply, but one thing I brought up at the time is that we'll want to
keep this code around even after the NewHash migration at least for
testing purposes, should we ever need to move to NewNewHash.

It occurred to me recently that once we have such a layer it could be
(ab)used with some relatively minor changes to do any arbitrary
local-to-remote object content translation, unless I've missed something
(but I just re-read hash-function-transition.txt now...).

E.g. having a SHA-1 (or NewHash) local repo, but interfacing with a
remote server so that you upload a GPG encrypted version of all your
blobs, and have your trees reference those blobs.

Because we'd be doing arbitrary translations for all of
commits/trees/blobs this could go further than other bolted-on
encryption solutions for Git. E.g. paths in trees could be encrypted
too, as well as all the content of the commit object that isn't parent
info & the like (but that would have different hashes).

Basically clean/smudge filters on steroids, but for every object in the
repo. Anyone who got a hold of it would still see the shape of the repo
& approximate content size, but other than that it wouldn't be more info
than they'd get via `fast-export --anonymize` now.

I mainly find it interesting because presents an intersection between a
feature we might want to offer anyway, and something that would stress
the hash transition codepath going forward, to make sure it hasn't all
bitrotted by the time we'll need NewHash->NewNewHash.

Git hosting providers would hate it, but they should probably be
charging users by how much Michael Haggerty's git-sizer tool hates their
repo anyway :)


Re: Git Merge contributor summit notes

2018-03-12 Thread Brandon Williams
On 03/12, Jeff King wrote:
> On Sat, Mar 10, 2018 at 02:01:14PM +0100, Ævar Arnfjörð Bjarmason wrote:
> 
> > >  - (peff) Time to deprecate the git anonymous protocol?
> > [...]
> > 
> > I think the conclusion was that nobody cares about the git:// protocol,
> > but people do care about it being super easy to spin up a server, and
> > currently it's easiest to spin up git://, but we could also ship with
> > some git-daemon mode that had a stand-alone webserver (or ssh server) to
> > get around that.
> 
> I don't think keeping support for git:// is too onerous at this point
> (especially because it should make the jump to protocol v2 with the
> rest). But it really is a pretty dated protocol, lacking any kind of
> useful security properties (yes, I know, if we're all verifying signed
> tags it's great, but realistically people are fetching the tip of master
> over a hijack-able TCP connection and running arbitrary code on the
> result). It might be nice if it went away completely so we don't have to
> warn people off of it.
> 
> The only thing git:// really has going over git-over-http right now is
> that it doesn't suffer from the stateless-rpc overhead. But if we unify
> that behavior in v2, then any advantage goes away.

It's still my intention to unify this behavior in v2 but then begin
working on improving negotiation as a whole (once v2 is in) so that we
can hopefully get rid of the nasty corner cases that exist in http://.
Since v2 will be hidden behind a config anyway, it may be prudent to
wait until negotiation gets better before we entertain making v2 default
(well there's also needing to wait for hosting providers to begin
supporting it).

> 
> I do agree we should have _something_ that is easy to spin up. But it
> would be wonderful if git-over-http could become that, and we could just
> deprecate git://. I suppose it's possible people build clients without
> curl, but I suspect that's an extreme minority these days (most third
> party hosters don't seem to offer git:// at all).
> 
> -Peff

-- 
Brandon Williams


Re: Git Merge contributor summit notes

2018-03-12 Thread Jeff King
On Sat, Mar 10, 2018 at 02:01:14PM +0100, Ævar Arnfjörð Bjarmason wrote:

> >  - (peff) Time to deprecate the git anonymous protocol?
> [...]
> 
> I think the conclusion was that nobody cares about the git:// protocol,
> but people do care about it being super easy to spin up a server, and
> currently it's easiest to spin up git://, but we could also ship with
> some git-daemon mode that had a stand-alone webserver (or ssh server) to
> get around that.

I don't think keeping support for git:// is too onerous at this point
(especially because it should make the jump to protocol v2 with the
rest). But it really is a pretty dated protocol, lacking any kind of
useful security properties (yes, I know, if we're all verifying signed
tags it's great, but realistically people are fetching the tip of master
over a hijack-able TCP connection and running arbitrary code on the
result). It might be nice if it went away completely so we don't have to
warn people off of it.

The only thing git:// really has going over git-over-http right now is
that it doesn't suffer from the stateless-rpc overhead. But if we unify
that behavior in v2, then any advantage goes away.

I do agree we should have _something_ that is easy to spin up. But it
would be wonderful if git-over-http could become that, and we could just
deprecate git://. I suppose it's possible people build clients without
curl, but I suspect that's an extreme minority these days (most third
party hosters don't seem to offer git:// at all).

-Peff


Re: Git Merge contributor summit notes

2018-03-12 Thread Jeff King
On Fri, Mar 09, 2018 at 04:06:18PM -0800, Alex Vandiver wrote:

> It was great to meet some of you in person!  Some notes from the
> Contributor Summit at Git Merge are below.  Taken in haste, so
> my apologies if there are any mis-statements.

Thanks very much for these notes!

I think in future years we should do a better job of making sure we have
an official note-taker so that this stuff makes it onto the list. I was
very happy when you announced part-way through the summit that you had
already been taking notes. :)

>   "Does anyone think there's a compelling reason for git to exist?"
> - peff

Heh, those words did indeed escape my mouth.

Your notes look accurate overall from a brief skim. I'm still post-trip
recovering, but I may try to follow-up and expand on a few areas where I
have thoughts. And I'd encourage others to do the same as a way of
bridging the discussion back to the list.

-Peff


Re: Git Merge contributor summit notes

2018-03-10 Thread Junio C Hamano
Ævar Arnfjörð Bjarmason  writes:

> On Sat, Mar 10 2018, Alex Vandiver jotted:
>
>> It was great to meet some of you in person!  Some notes from the
>> Contributor Summit at Git Merge are below.  Taken in haste, so
>> my apologies if there are any mis-statements.
>
> Thanks a lot for taking these notes. I've read them over and they're all
> accurate per my wetware recollection. Adding some things I remember
> about various discussions below where I think it may help to clarify
> things a bit.
>
>>  - Alex

Thanks, both, for sharing.


Re: Git Merge contributor summit notes

2018-03-10 Thread Ævar Arnfjörð Bjarmason

On Sat, Mar 10 2018, Alex Vandiver jotted:

> It was great to meet some of you in person!  Some notes from the
> Contributor Summit at Git Merge are below.  Taken in haste, so
> my apologies if there are any mis-statements.

Thanks a lot for taking these notes. I've read them over and they're all
accurate per my wetware recollection. Adding some things I remember
about various discussions below where I think it may help to clarify
things a bit.

>  - Alex
>
> 
>
>
>   "Does anyone think there's a compelling reason for git to exist?"
> - peff
>
>
> Partial clone (Jeff Hostetler / Jonathan Tan)
> -
>  - Request that the server not send everything
>  - Motivated by getting Windows into git
>  - Also by not having to fetch large blobs that are in-tree
>  - Allows client to request a clone that excludes some set of objects, with 
> incomplete packfiles
>  - Decoration on objects that include promise for later on-demand backfill
>  - In `master`, have a way of:
>- omitting all blobs
>- omitting large blobs
>- sparse checkout specification stored on server
>  - Hook in read_object to fetch objects in bulk
>
>  - Future work:
>- A way to fetch blobsizes for virtual checkouts
>- Give me new blobs that this tree references relative to now
>- Omit some subset of trees
>- Modify other commits to exclude omitted blobs
>- Protocol v2 may have better verbs for sparse specification, etc
>
> Questions:
>  - Reference server implementation?
>- In git itself
>- VSTS does not support
>  - What happens if a commit becomes unreachable?  Does promise still apply?
>- Probably yes?
>- If the promise is broken, probably crashes
>- Can differentiate between promise that was made, and one that wasn't
>=> Demanding commitment from server to never GC seems like a strong promise
>  - Interactions with external object db
>- promises include bulk fetches, as opposed to external db, which is 
> one-at-a-time
>- dry-run semantics to determine which objects will be needed
>- very important for small objects, like commits/trees (which is not in 
> `master`, only blobs)
>- perhaps for protocol V2
>  - server has to promise more, requires some level of online operation
>- annotate that only some refs are forever?
>- requires enabling the "fetch any SHA" flags
>- rebasing might require now-missing objects?
>  - No, to build on them you must have fetched them
>  - Well, building on someone else's work may mean you don't have all of 
> them
>- server is less aggressive about GC'ing by keeping "weak references" when 
> there are promises?
>- hosting requires that you be able to forcibly remove information
>  - being able to know where a reference came from?
>- as being able to know why an object was needed, for more advanced logic
>  - Does `git grep` attempt to fetch blobs that are deferred?
>- will always attempt to fetch
>- one fetch per object, even!
>- might not be true for sparse checkouts
>- Maybe limit to skipping "binary files"?
>- Currently sparse checkout grep "works" because grep defaults to looking 
> at the index, not the commit
>- Does the above behavior for grepping revisions
>- Don't yet have a flag to exclude grep on non-fetched objects
>- Should `git grep -L` die if it can't fetch the file?
>- Need a config option for "should we die, or try to move on"?
>  - What's the endgame?  Only a few codepaths that are aware, or threaded 
> through everywhere?
>- Fallback to fetch on demand means there's an almost-reasonable fallback
>- Better prediction with bulk fetching
>- Are most commands going to _need_ to be sensitive to it?
>- GVFS has a caching server in the building
>- A few git commands have been disabled (see recent mail from Stolee); 
> those are likely candidates for code that needs to be aware of de-hydrated 
> objects
>  - Is there an API to know what objects are actually local?
>- No external API
>- GVFS has a REST API
>  - Some way to later ask about files?
>- "virtualized filesystem"?
>- hook to say "focus on this world of files"
>- GVFS writes out your index currently
>  - Will this always require turning off reachability checks?
>- Possibly
>  - Shallow clones, instead of partial?
>- Don't download the history, just the objects
>- More of a protocol V2 property
>- Having all of the trees/commits make this reasonable
>  - GVFS vs this?
>- GVFS was a first pass
>- Now trying to mainstream productize that
>- Goal is to remove features from GVFS and replace with this

As I understood it Microsoft deploys this in a mode where they're not
vulnerable to the caveats noted above, i.e. the server serving this up
only has branches that are fast-forwarded (and never deleted).

However, if you were to build 

Git Merge contributor summit notes

2018-03-09 Thread Alex Vandiver
It was great to meet some of you in person!  Some notes from the
Contributor Summit at Git Merge are below.  Taken in haste, so
my apologies if there are any mis-statements.

 - Alex




  "Does anyone think there's a compelling reason for git to exist?"
- peff


Partial clone (Jeff Hostetler / Jonathan Tan)
-
 - Request that the server not send everything
 - Motivated by getting Windows into git
 - Also by not having to fetch large blobs that are in-tree
 - Allows client to request a clone that excludes some set of objects, with 
incomplete packfiles
 - Decoration on objects that include promise for later on-demand backfill
 - In `master`, have a way of:
   - omitting all blobs
   - omitting large blobs
   - sparse checkout specification stored on server
 - Hook in read_object to fetch objects in bulk

 - Future work:
   - A way to fetch blobsizes for virtual checkouts
   - Give me new blobs that this tree references relative to now
   - Omit some subset of trees
   - Modify other commits to exclude omitted blobs
   - Protocol v2 may have better verbs for sparse specification, etc

Questions:
 - Reference server implementation?
   - In git itself
   - VSTS does not support
 - What happens if a commit becomes unreachable?  Does promise still apply?
   - Probably yes?
   - If the promise is broken, probably crashes
   - Can differentiate between promise that was made, and one that wasn't
   => Demanding commitment from server to never GC seems like a strong promise
 - Interactions with external object db
   - promises include bulk fetches, as opposed to external db, which is 
one-at-a-time
   - dry-run semantics to determine which objects will be needed
   - very important for small objects, like commits/trees (which is not in 
`master`, only blobs)
   - perhaps for protocol V2
 - server has to promise more, requires some level of online operation
   - annotate that only some refs are forever?
   - requires enabling the "fetch any SHA" flags
   - rebasing might require now-missing objects?
 - No, to build on them you must have fetched them
 - Well, building on someone else's work may mean you don't have all of them
   - server is less aggressive about GC'ing by keeping "weak references" when 
there are promises?
   - hosting requires that you be able to forcibly remove information
 - being able to know where a reference came from?
   - as being able to know why an object was needed, for more advanced logic
 - Does `git grep` attempt to fetch blobs that are deferred?
   - will always attempt to fetch
   - one fetch per object, even!
   - might not be true for sparse checkouts
   - Maybe limit to skipping "binary files"?
   - Currently sparse checkout grep "works" because grep defaults to looking at 
the index, not the commit
   - Does the above behavior for grepping revisions
   - Don't yet have a flag to exclude grep on non-fetched objects
   - Should `git grep -L` die if it can't fetch the file?
   - Need a config option for "should we die, or try to move on"?
 - What's the endgame?  Only a few codepaths that are aware, or threaded 
through everywhere?
   - Fallback to fetch on demand means there's an almost-reasonable fallback
   - Better prediction with bulk fetching
   - Are most commands going to _need_ to be sensitive to it?
   - GVFS has a caching server in the building
   - A few git commands have been disabled (see recent mail from Stolee); those 
are likely candidates for code that needs to be aware of de-hydrated objects
 - Is there an API to know what objects are actually local?
   - No external API
   - GVFS has a REST API
 - Some way to later ask about files?
   - "virtualized filesystem"?
   - hook to say "focus on this world of files"
   - GVFS writes out your index currently
 - Will this always require turning off reachability checks?
   - Possibly
 - Shallow clones, instead of partial?
   - Don't download the history, just the objects
   - More of a protocol V2 property
   - Having all of the trees/commits make this reasonable
 - GVFS vs this?
   - GVFS was a first pass
   - Now trying to mainstream productize that
   - Goal is to remove features from GVFS and replace with this

Protocol V2 (Brandon)

 - Main problem is that forward compatibility negotiation wasn't possible
 - Found a way to sneak in the V2 negotiation via side-channel in all transports
 - "environment variable" GIT_PROTOCOL which server can detect
 - Ability to transmit and ignore, or not transmit, means forward/backward 
compat
 - HTTP header / environment variable
 - ...s now what?
 - Keep as similar as possible, but more layout changes to remove bad 
characteristics
 - Like fixing flush semantics
 - Remove ref advertisement (250M of refs every fetch from Android!)
 - Capabilities are currently in first packet, 1K limit
 - First response is capabilities from the server,