Re: Transition plan for git to move to a new hash function

2017-03-05 Thread brian m. carlson
On Sun, Mar 05, 2017 at 01:45:46PM +, Ian Jackson wrote:
> brian m. carlson writes ("Re: Transition plan for git to move to a new hash 
> function"):
> > Instead, I was referring to areas like the notes code.  It has extensive
> > use of the last byte as a type of lookup table key.  It's very dependent
> > on having exactly one hash, since it will always want to use the last
> > byte.
> 
> You mean note_tree_search ?  (My tree here may be a bit out of date.)
> This doesn't seem difficult to fix.  The nontrivial changes would be
> mostly confined to SUBTREE_SHA1_PREFIXCMP and GET_NIBBLE.
> 
> It's true that like most of git there's a lot of hardcoded `sha1'.

I'm talking about the entire notes.c file.  There are several different
uses of "19" in there, and they compose at least two separate concepts.
My object-id-part9 series tries to split those out into logical
constants.

This code is not going to handle repositories with different-length
objects well, which I believe was your initial proposal.  I originally
thought that mixed-hash repositories would be viable as well, but I no
longer do.

> Are you arguing in favour of "replace git with git2 by simply
> s/20/64/g; s/sha1/blake/g" ?  This seems to me to be a poor idea.
> Takeup of the new `git2' would be very slow because of the pain
> involved.

I'm arguing that the same binary ought to be able to handle both SHA-1
and the new hash.  I'm also arguing that a given object have exactly one
hash and that we not mix hashes in the same object.  A repository will
be composed of one type of object, and if that's the new hash, a lookup
table will be used to translate SHA-1.  We can synthesize the old
objects, should we need them.

That allows people to use the SHA-1 hashes (in my view, with a prefix,
such as "sha1:") in repositories using the new hash.  It also allows
verifying old tags and commits if need be.

What I *would* like to see is an extension to the tag and commit objects
which names the hash that was used to make them.  That makes it easy to
determine which object the signature should be verified over, as it will
verify over only one of them.

> [1] I've heard suggestions here that instead we should expect users to
> "git1 fast-export", which you would presumably feed into "git2
> fast-import".  But what is `git1' here ?  Is it the current git
> codebase frozen in time ?  I don't think it can be.  With this
> conversion strategy, we will need to maintain git1 for decades.  It
> will need portability fixes, security fixes, fixes for new hostile
> compiler optimisations, and so on.  The difficulty of conversion means
> there will be pressure to backport new features from `git2' to `git1'.
> (Also this approach means that all signatures are definitively lost
> during the conversion process.)

I'm proposing we have a git hash-convert (the name doesn't matter that
much) that converts in place.  It rebuilds the objects and builds a
lookup table.  Since the contents of git objects are deterministic, this
makes it possible for each individual user to make the transition in
place.
-- 
brian m. carlson / brian with sandals: Houston, Texas, US
+1 832 623 2791 | https://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: https://keybase.io/bk2204


signature.asc
Description: PGP signature


Re: Transition plan for git to move to a new hash function

2017-03-05 Thread Ian Jackson
brian m. carlson writes ("Re: Transition plan for git to move to a new hash 
function"):
> Instead, I was referring to areas like the notes code.  It has extensive
> use of the last byte as a type of lookup table key.  It's very dependent
> on having exactly one hash, since it will always want to use the last
> byte.

You mean note_tree_search ?  (My tree here may be a bit out of date.)
This doesn't seem difficult to fix.  The nontrivial changes would be
mostly confined to SUBTREE_SHA1_PREFIXCMP and GET_NIBBLE.

It's true that like most of git there's a lot of hardcoded `sha1'.


Are you arguing in favour of "replace git with git2 by simply
s/20/64/g; s/sha1/blake/g" ?  This seems to me to be a poor idea.
Takeup of the new `git2' would be very slow because of the pain
involved.

Any sensible method of moving to a new hash that isn't "make a
completely incompatible new version of git" is going to involve
teaching the code we have in git right now to handle new hashes as
well as sha1 hashes.

Even if the plan is to try to convert old data, rather than keep it
and be able to refer to it from new data, something will have to be
able to parse old packfiles, old commits, old tags, old notes,
etc. etc. etc.  Either that's going to be some separate conversion
utility, or it has to be the same code in git that's there already.[1]

The ability to handle both old-format and new-format data can be
achieved in the code by doing away with the hardcoded sha1s, so that
instead the hash is an abstract data type with operations like
"initialise", "compare", "get a nybble", etc.  We've already seen
patches going in this direction.

[1] I've heard suggestions here that instead we should expect users to
"git1 fast-export", which you would presumably feed into "git2
fast-import".  But what is `git1' here ?  Is it the current git
codebase frozen in time ?  I don't think it can be.  With this
conversion strategy, we will need to maintain git1 for decades.  It
will need portability fixes, security fixes, fixes for new hostile
compiler optimisations, and so on.  The difficulty of conversion means
there will be pressure to backport new features from `git2' to `git1'.
(Also this approach means that all signatures are definitively lost
during the conversion process.)

So if we want to provide both `git1' and `git2', it's still better to
compile `git' and `git2' from the same codebase.  But if we do that,
the resulting ifdeffery and/or other hash abstractions are most of the
work to be hash-agile.  It's just the difference between a
compile-time and runtime switch.

I think the incompatibile approach is much more work in the medium and
long term - and it leads to a longer transition period.


Bear in mind that our objective is not to minimise the time until the
new version of git is available.  Our objective is to minimise the
time until (most) people are using it.  An approach which takes longer
for the git community to develop, but which is easier to deploy, can
easily be better.

Or maybe the objective is to minimise overall effort.  In which case
more work on git, for an easier transition for all the users, seems
like a no-brainer.  I think this is arguably true even from the point
of view of effort amongst the community of git contributors.  git
contributors start out as git users - and if git's users are all busy
struggling with a difficult transition, they will have less time to
improve other stuff and will tend less to get involved upstream.  (And
they may be less inclined to feel that the git upstream developers
understand their needs well.)

The better alternative is to adopt a plan that has a clear and
straightforward transition for users, and ask git users to help with
implementation.

I think many git users, including sophisticated users and competent
organisations, are concerned about sha1.  Currently most of those
users will find it difficult to help, because it's not clear to them
what needs to be done.

Thanks,
Ian.


Re: Transition plan for git to move to a new hash function

2017-03-04 Thread brian m. carlson
On Thu, Mar 02, 2017 at 06:13:27PM +, Ian Jackson wrote:
> brian m. carlson writes ("Re: Transition plan for git to move to a new hash 
> function"):
> > On Mon, Feb 27, 2017 at 01:00:01PM +, Ian Jackson wrote:
> > > Objects of one hash may refer to objects named by a different hash
> > > function to their own.  Preference rules arrange that normally, new
> > > hash objects refer to other new hash objects.
> > 
> > The existing codebase isn't really intended with that in mind.
> 
> Yes.  I've seen the attempts to start to replace char* with a hash
> struct.

My comment actually has nothing to do with the way struct object_id is
set up.  That actually can be trivially extended with a byte or two of
type.

Instead, I was referring to areas like the notes code.  It has extensive
use of the last byte as a type of lookup table key.  It's very dependent
on having exactly one hash, since it will always want to use the last
byte.

There are other, more subtle areas of the code that just don't handle
multiple hashes well.  Ideally we would remedy this, but I think
everyone is very eager to move away from SHA-1, and since nobody has
stepped up to volunteer to do that work, we should probably adopt a
solution that doesn't involve doing that.
-- 
brian m. carlson / brian with sandals: Houston, Texas, US
+1 832 623 2791 | https://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: https://keybase.io/bk2204


signature.asc
Description: PGP signature


Re: Transition plan for git to move to a new hash function

2017-03-02 Thread Ian Jackson
brian m. carlson writes ("Re: Transition plan for git to move to a new hash 
function"):
> On Mon, Feb 27, 2017 at 01:00:01PM +, Ian Jackson wrote:
> > Objects of one hash may refer to objects named by a different hash
> > function to their own.  Preference rules arrange that normally, new
> > hash objects refer to other new hash objects.
> 
> The existing codebase isn't really intended with that in mind.

Yes.  I've seen the attempts to start to replace char* with a hash
struct.

> I like Peff's suggested approach in which we essentially rewrite history
> under the hood, but have a lookup table which looks up the old hash
> based on the new hash.  That allows us to refer to old objects, but not
> have to share serialized data that mentions both hashes.

I think this means that the when a project converts, every copy of the
history must be rewritten (separately).  Also, this leaves the whole
system lacking in algorithm agililty.  Meaning we may have to do all
of this again some time.

I also think that we need to distinguish old hashes from new hashes in
the command line interface etc.  Otherwise there is a possible
ambiguity.

> > The object name textual syntax is extended.  The new syntax may be
> > used in all textual git objects and protocols (commits, tags, command
> > lines, etc.).
> > 
> > We declare that the object name syntax is henceforth
> >   [A-Z]+[0-9a-z]+ | [0-9a-f]+
> > and that names [A-Z].* are deprecated as ref name components.
> 
> I'd simply say that we have data always be in the new format if it's
> available, and tag the old SHA-1 versions instead.  Otherwise, as Peff
> pointed out, we're going to be stuck typing a bunch of identical stuff
> every time.  Again, this encourages migration.

The hash identifier is only one character.  Object names are not
normally typed very much anyway.

If you say we must decorate old hashes, then all existing data
everywhere in the world which refers to any git objects by object name
will become invalid.  I don't mean just data in git here.  I mean CI
systems, mailing list archives, commit messages (perhaps in other
version control systems), test cases, and so on.

Ian.


Re: Transition plan for git to move to a new hash function

2017-02-28 Thread brian m. carlson
On Mon, Feb 27, 2017 at 01:00:01PM +, Ian Jackson wrote:
> I said I was working on a transition plan.  Here it is.  This is
> obviously a draft for review, and I have no official status in the git
> project.  But I have extensive experience of protocol compatibility
> engineering, and I hope this will be helpful.
> 
> Ian.
> 
> 
> Subject: Transition plan for git to move to a new hash function
> 
> 
> BASIC PRINCIPLES
> 
> 
> We run multiple hashes in parallel.  Each object is named by exactly
> one hash.  We define that objects with identical content, but named by
> different hash functions, are different objects.

I think this is fine.

> Objects of one hash may refer to objects named by a different hash
> function to their own.  Preference rules arrange that normally, new
> hash objects refer to other new hash objects.

The existing codebase isn't really intended with that in mind.

It's not that I am arguing against this because I think it's a bad idea,
I'm arguing against it because as a contributor, I'm doubtful that this
is easily achievable given the state of the codebase.

> The intention is that for most projects, the existing SHA-1 based
> history will be retained and a new history built on top of it.
> (Rewriting is also possible but means a per-project hard switch.)

I like Peff's suggested approach in which we essentially rewrite history
under the hood, but have a lookup table which looks up the old hash
based on the new hash.  That allows us to refer to old objects, but not
have to share serialized data that mentions both hashes.

Obviously only the SHA-1 versions of old tags and commits will be able
to be validated, but that shouldn't be an issue.  We can hook that code
into a conversion routine that can handle on-the-fly object conversion.

We also can implement (optionally disabled) fallback functionality to
look up old SHA-1 hash names based on the new hash.

> We extend the textual object name syntax to explicitly name the hash
> used.  Every program that invokes git or speaks git protocols will
> need to understand the extended object name syntax.
> 
> Packfiles need to be extended to be able to contain objects named by
> new hash functions.  Blob objects with identical contents but named by
> different hash functions would ideally share storage.
> 
> Safety catches preferent accidental incorporation into a project of
> incompatibly-new objects, or additional deprecatedly-old objects.
> This allows for incremental deployment.

We have a compatibility mechanism already in place: if the
repositoryFormatVersion option is set to 1, but an unknown extension
flag is set, Git will bail out.

For network protocols, we have the server offer a hash=foo extension,
and make the client echo it back, and either bail or convert on the fly.
This makes it fast for new clients, and slow for old clients, which
encourages migration.

We could also store old-style packs for easy fetch by clients.

> TEXTUAL SYNTAX
> ==
> 
> The object name textual syntax is extended.  The new syntax may be
> used in all textual git objects and protocols (commits, tags, command
> lines, etc.).
> 
> We declare that the object name syntax is henceforth
>   [A-Z]+[0-9a-z]+ | [0-9a-f]+
> and that names [A-Z].* are deprecated as ref name components.

I'd simply say that we have data always be in the new format if it's
available, and tag the old SHA-1 versions instead.  Otherwise, as Peff
pointed out, we're going to be stuck typing a bunch of identical stuff
every time.  Again, this encourages migration.
-- 
brian m. carlson / brian with sandals: Houston, Texas, US
+1 832 623 2791 | https://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: https://keybase.io/bk2204


signature.asc
Description: PGP signature


Re: Transition plan for git to move to a new hash function

2017-02-27 Thread Tony Finch
Ian Jackson  wrote:

A few questions and one or two suggestions...

> TEXTUAL SYNTAX
> ==
>
> We also reserve the following syntax for private experiments:
>   E[A-Z]+[0-9a-z]+
> We declare that public releases of git will never accept such
> object names.

Instead of this I would suggest that experimental hash names should have
multi-character prefixes and an easy registration process - rationale:
https://tools.ietf.org/html/rfc6648

> A single object may refer to other objects the hash function which
> names the object itself, or by other hash functions, in any
> combination.

If I understand it correctly, this freedom is greatly restricted later on
in this document, depending on the object type in question. If so, it's
probably worth saying so at this point.

> Commits
> ---
>
> The hash function naming an origin commit is controlled by the hint
> left in .git for the ref named by HEAD (or for HEAD itself, if HEAD is
> detached) by git checkout --orphan or git init.

This confused me for a while - I think you mean "root commit"?

> TRANSITION PLAN
> ===
>
> Y4: BLAKE by default for new projects.
>
> When creating a new working tree, it starts using BLAKE.
>
> Servers which have been updated will accept BLAKE.

Why not allow newhash pushes before making it the default for new
projects? Wouldn't it make sense to get the server side ready some time
before projects start actively using new hashes?

Or is the idea that newhash upgrade is driven from the server?

What's the upgrade process for send-email patch exchange?

Tony.
-- 
f.anthony.n.finch    http://dotat.at/  -  I xn--zr8h punycode
Fair Isle: Southwest 6 to gale 8, backing east 5 or 6, backing north 6 to gale
8 later. Rough or very rough. Rain or showers. Moderate or good.