Re: Transition plan for git to move to a new hash function
On Sun, Mar 05, 2017 at 01:45:46PM +, Ian Jackson wrote: > brian m. carlson writes ("Re: Transition plan for git to move to a new hash > function"): > > Instead, I was referring to areas like the notes code. It has extensive > > use of the last byte as a type of lookup table key. It's very dependent > > on having exactly one hash, since it will always want to use the last > > byte. > > You mean note_tree_search ? (My tree here may be a bit out of date.) > This doesn't seem difficult to fix. The nontrivial changes would be > mostly confined to SUBTREE_SHA1_PREFIXCMP and GET_NIBBLE. > > It's true that like most of git there's a lot of hardcoded `sha1'. I'm talking about the entire notes.c file. There are several different uses of "19" in there, and they compose at least two separate concepts. My object-id-part9 series tries to split those out into logical constants. This code is not going to handle repositories with different-length objects well, which I believe was your initial proposal. I originally thought that mixed-hash repositories would be viable as well, but I no longer do. > Are you arguing in favour of "replace git with git2 by simply > s/20/64/g; s/sha1/blake/g" ? This seems to me to be a poor idea. > Takeup of the new `git2' would be very slow because of the pain > involved. I'm arguing that the same binary ought to be able to handle both SHA-1 and the new hash. I'm also arguing that a given object have exactly one hash and that we not mix hashes in the same object. A repository will be composed of one type of object, and if that's the new hash, a lookup table will be used to translate SHA-1. We can synthesize the old objects, should we need them. That allows people to use the SHA-1 hashes (in my view, with a prefix, such as "sha1:") in repositories using the new hash. It also allows verifying old tags and commits if need be. What I *would* like to see is an extension to the tag and commit objects which names the hash that was used to make them. That makes it easy to determine which object the signature should be verified over, as it will verify over only one of them. > [1] I've heard suggestions here that instead we should expect users to > "git1 fast-export", which you would presumably feed into "git2 > fast-import". But what is `git1' here ? Is it the current git > codebase frozen in time ? I don't think it can be. With this > conversion strategy, we will need to maintain git1 for decades. It > will need portability fixes, security fixes, fixes for new hostile > compiler optimisations, and so on. The difficulty of conversion means > there will be pressure to backport new features from `git2' to `git1'. > (Also this approach means that all signatures are definitively lost > during the conversion process.) I'm proposing we have a git hash-convert (the name doesn't matter that much) that converts in place. It rebuilds the objects and builds a lookup table. Since the contents of git objects are deterministic, this makes it possible for each individual user to make the transition in place. -- brian m. carlson / brian with sandals: Houston, Texas, US +1 832 623 2791 | https://www.crustytoothpaste.net/~bmc | My opinion only OpenPGP: https://keybase.io/bk2204 signature.asc Description: PGP signature
Re: Transition plan for git to move to a new hash function
brian m. carlson writes ("Re: Transition plan for git to move to a new hash function"): > Instead, I was referring to areas like the notes code. It has extensive > use of the last byte as a type of lookup table key. It's very dependent > on having exactly one hash, since it will always want to use the last > byte. You mean note_tree_search ? (My tree here may be a bit out of date.) This doesn't seem difficult to fix. The nontrivial changes would be mostly confined to SUBTREE_SHA1_PREFIXCMP and GET_NIBBLE. It's true that like most of git there's a lot of hardcoded `sha1'. Are you arguing in favour of "replace git with git2 by simply s/20/64/g; s/sha1/blake/g" ? This seems to me to be a poor idea. Takeup of the new `git2' would be very slow because of the pain involved. Any sensible method of moving to a new hash that isn't "make a completely incompatible new version of git" is going to involve teaching the code we have in git right now to handle new hashes as well as sha1 hashes. Even if the plan is to try to convert old data, rather than keep it and be able to refer to it from new data, something will have to be able to parse old packfiles, old commits, old tags, old notes, etc. etc. etc. Either that's going to be some separate conversion utility, or it has to be the same code in git that's there already.[1] The ability to handle both old-format and new-format data can be achieved in the code by doing away with the hardcoded sha1s, so that instead the hash is an abstract data type with operations like "initialise", "compare", "get a nybble", etc. We've already seen patches going in this direction. [1] I've heard suggestions here that instead we should expect users to "git1 fast-export", which you would presumably feed into "git2 fast-import". But what is `git1' here ? Is it the current git codebase frozen in time ? I don't think it can be. With this conversion strategy, we will need to maintain git1 for decades. It will need portability fixes, security fixes, fixes for new hostile compiler optimisations, and so on. The difficulty of conversion means there will be pressure to backport new features from `git2' to `git1'. (Also this approach means that all signatures are definitively lost during the conversion process.) So if we want to provide both `git1' and `git2', it's still better to compile `git' and `git2' from the same codebase. But if we do that, the resulting ifdeffery and/or other hash abstractions are most of the work to be hash-agile. It's just the difference between a compile-time and runtime switch. I think the incompatibile approach is much more work in the medium and long term - and it leads to a longer transition period. Bear in mind that our objective is not to minimise the time until the new version of git is available. Our objective is to minimise the time until (most) people are using it. An approach which takes longer for the git community to develop, but which is easier to deploy, can easily be better. Or maybe the objective is to minimise overall effort. In which case more work on git, for an easier transition for all the users, seems like a no-brainer. I think this is arguably true even from the point of view of effort amongst the community of git contributors. git contributors start out as git users - and if git's users are all busy struggling with a difficult transition, they will have less time to improve other stuff and will tend less to get involved upstream. (And they may be less inclined to feel that the git upstream developers understand their needs well.) The better alternative is to adopt a plan that has a clear and straightforward transition for users, and ask git users to help with implementation. I think many git users, including sophisticated users and competent organisations, are concerned about sha1. Currently most of those users will find it difficult to help, because it's not clear to them what needs to be done. Thanks, Ian.
Re: Transition plan for git to move to a new hash function
On Thu, Mar 02, 2017 at 06:13:27PM +, Ian Jackson wrote: > brian m. carlson writes ("Re: Transition plan for git to move to a new hash > function"): > > On Mon, Feb 27, 2017 at 01:00:01PM +, Ian Jackson wrote: > > > Objects of one hash may refer to objects named by a different hash > > > function to their own. Preference rules arrange that normally, new > > > hash objects refer to other new hash objects. > > > > The existing codebase isn't really intended with that in mind. > > Yes. I've seen the attempts to start to replace char* with a hash > struct. My comment actually has nothing to do with the way struct object_id is set up. That actually can be trivially extended with a byte or two of type. Instead, I was referring to areas like the notes code. It has extensive use of the last byte as a type of lookup table key. It's very dependent on having exactly one hash, since it will always want to use the last byte. There are other, more subtle areas of the code that just don't handle multiple hashes well. Ideally we would remedy this, but I think everyone is very eager to move away from SHA-1, and since nobody has stepped up to volunteer to do that work, we should probably adopt a solution that doesn't involve doing that. -- brian m. carlson / brian with sandals: Houston, Texas, US +1 832 623 2791 | https://www.crustytoothpaste.net/~bmc | My opinion only OpenPGP: https://keybase.io/bk2204 signature.asc Description: PGP signature
Re: Transition plan for git to move to a new hash function
brian m. carlson writes ("Re: Transition plan for git to move to a new hash function"): > On Mon, Feb 27, 2017 at 01:00:01PM +, Ian Jackson wrote: > > Objects of one hash may refer to objects named by a different hash > > function to their own. Preference rules arrange that normally, new > > hash objects refer to other new hash objects. > > The existing codebase isn't really intended with that in mind. Yes. I've seen the attempts to start to replace char* with a hash struct. > I like Peff's suggested approach in which we essentially rewrite history > under the hood, but have a lookup table which looks up the old hash > based on the new hash. That allows us to refer to old objects, but not > have to share serialized data that mentions both hashes. I think this means that the when a project converts, every copy of the history must be rewritten (separately). Also, this leaves the whole system lacking in algorithm agililty. Meaning we may have to do all of this again some time. I also think that we need to distinguish old hashes from new hashes in the command line interface etc. Otherwise there is a possible ambiguity. > > The object name textual syntax is extended. The new syntax may be > > used in all textual git objects and protocols (commits, tags, command > > lines, etc.). > > > > We declare that the object name syntax is henceforth > > [A-Z]+[0-9a-z]+ | [0-9a-f]+ > > and that names [A-Z].* are deprecated as ref name components. > > I'd simply say that we have data always be in the new format if it's > available, and tag the old SHA-1 versions instead. Otherwise, as Peff > pointed out, we're going to be stuck typing a bunch of identical stuff > every time. Again, this encourages migration. The hash identifier is only one character. Object names are not normally typed very much anyway. If you say we must decorate old hashes, then all existing data everywhere in the world which refers to any git objects by object name will become invalid. I don't mean just data in git here. I mean CI systems, mailing list archives, commit messages (perhaps in other version control systems), test cases, and so on. Ian.
Re: Transition plan for git to move to a new hash function
On Mon, Feb 27, 2017 at 01:00:01PM +, Ian Jackson wrote: > I said I was working on a transition plan. Here it is. This is > obviously a draft for review, and I have no official status in the git > project. But I have extensive experience of protocol compatibility > engineering, and I hope this will be helpful. > > Ian. > > > Subject: Transition plan for git to move to a new hash function > > > BASIC PRINCIPLES > > > We run multiple hashes in parallel. Each object is named by exactly > one hash. We define that objects with identical content, but named by > different hash functions, are different objects. I think this is fine. > Objects of one hash may refer to objects named by a different hash > function to their own. Preference rules arrange that normally, new > hash objects refer to other new hash objects. The existing codebase isn't really intended with that in mind. It's not that I am arguing against this because I think it's a bad idea, I'm arguing against it because as a contributor, I'm doubtful that this is easily achievable given the state of the codebase. > The intention is that for most projects, the existing SHA-1 based > history will be retained and a new history built on top of it. > (Rewriting is also possible but means a per-project hard switch.) I like Peff's suggested approach in which we essentially rewrite history under the hood, but have a lookup table which looks up the old hash based on the new hash. That allows us to refer to old objects, but not have to share serialized data that mentions both hashes. Obviously only the SHA-1 versions of old tags and commits will be able to be validated, but that shouldn't be an issue. We can hook that code into a conversion routine that can handle on-the-fly object conversion. We also can implement (optionally disabled) fallback functionality to look up old SHA-1 hash names based on the new hash. > We extend the textual object name syntax to explicitly name the hash > used. Every program that invokes git or speaks git protocols will > need to understand the extended object name syntax. > > Packfiles need to be extended to be able to contain objects named by > new hash functions. Blob objects with identical contents but named by > different hash functions would ideally share storage. > > Safety catches preferent accidental incorporation into a project of > incompatibly-new objects, or additional deprecatedly-old objects. > This allows for incremental deployment. We have a compatibility mechanism already in place: if the repositoryFormatVersion option is set to 1, but an unknown extension flag is set, Git will bail out. For network protocols, we have the server offer a hash=foo extension, and make the client echo it back, and either bail or convert on the fly. This makes it fast for new clients, and slow for old clients, which encourages migration. We could also store old-style packs for easy fetch by clients. > TEXTUAL SYNTAX > == > > The object name textual syntax is extended. The new syntax may be > used in all textual git objects and protocols (commits, tags, command > lines, etc.). > > We declare that the object name syntax is henceforth > [A-Z]+[0-9a-z]+ | [0-9a-f]+ > and that names [A-Z].* are deprecated as ref name components. I'd simply say that we have data always be in the new format if it's available, and tag the old SHA-1 versions instead. Otherwise, as Peff pointed out, we're going to be stuck typing a bunch of identical stuff every time. Again, this encourages migration. -- brian m. carlson / brian with sandals: Houston, Texas, US +1 832 623 2791 | https://www.crustytoothpaste.net/~bmc | My opinion only OpenPGP: https://keybase.io/bk2204 signature.asc Description: PGP signature
Re: Transition plan for git to move to a new hash function
Ian Jacksonwrote: A few questions and one or two suggestions... > TEXTUAL SYNTAX > == > > We also reserve the following syntax for private experiments: > E[A-Z]+[0-9a-z]+ > We declare that public releases of git will never accept such > object names. Instead of this I would suggest that experimental hash names should have multi-character prefixes and an easy registration process - rationale: https://tools.ietf.org/html/rfc6648 > A single object may refer to other objects the hash function which > names the object itself, or by other hash functions, in any > combination. If I understand it correctly, this freedom is greatly restricted later on in this document, depending on the object type in question. If so, it's probably worth saying so at this point. > Commits > --- > > The hash function naming an origin commit is controlled by the hint > left in .git for the ref named by HEAD (or for HEAD itself, if HEAD is > detached) by git checkout --orphan or git init. This confused me for a while - I think you mean "root commit"? > TRANSITION PLAN > === > > Y4: BLAKE by default for new projects. > > When creating a new working tree, it starts using BLAKE. > > Servers which have been updated will accept BLAKE. Why not allow newhash pushes before making it the default for new projects? Wouldn't it make sense to get the server side ready some time before projects start actively using new hashes? Or is the idea that newhash upgrade is driven from the server? What's the upgrade process for send-email patch exchange? Tony. -- f.anthony.n.finch http://dotat.at/ - I xn--zr8h punycode Fair Isle: Southwest 6 to gale 8, backing east 5 or 6, backing north 6 to gale 8 later. Rough or very rough. Rain or showers. Moderate or good.