RE: [Question] Signature calculation ignoring parts of binary files
On September 13, 2018 1:52 PM, Junio C Hamano wrote: > Junio C Hamano writes: > > > "Randall S. Becker" writes: > > > >> The scenario is slightly different. > >> 1. Person A gives me a new binary file-1 with fingerprint A1. This > >> goes into git unchanged. > >> 2. Person B gives me binary file-2 with fingerprint B2. This does not > >> go into git yet. > >> 3. We attempt a git diff between the committed file-1 and uncommitted > >> file-2 using a textconv implementation that strips what we don't need to > compare. > >> 4. If file-1 and file-2 have no difference when textconv is used, > >> file-2 is not added and not committed. It is discarded with impunity, > >> never to be seen again, although we might whine a lot at the user for > >> attempting to put > >> file-2 in - but that's not git's issue. > > > > You are forgetting that Git is a distributed version control system, > > aren't you? Person A and B can introduce their "moral equivalent but > > bytewise different" copies to their repository under the same object > > name, and you can pull from them--what happens? > > > > It is fundamental that one object name given to Git identifies one > > specific byte sequence contained in an object uniquely. Once you > > broke that, you no longer have Git. > > Having said all that, if you want to keep the original with frills but somehow > give these bytewise different things that reduce to the same essence (e.g. > when passed thru a filter like textconv), I suspect a better approach might be > to store both the "original" and the result of passing the "original" through > the filter in the object database. In the above example, you'll get two > "original" > objects from person A and person B, plus one "canonical" object that are > bytewise different from either of these two originals, but what they reduce > to when you use the filter on them. Then you record the fact that to derive > the "essence" object, you can reduce either person A's or person B's > "original" through the filter, perhaps by using "git notes" attached to the > "essence" object, recording the object names of these originals (the reason > why using notes in this direction is because you can mechanically determine > which "essence" > object any given "original" object reduces to---it is just the matter of passing > it through the filter. But there can be more than one "original" that reduces > to the same "essence"). I like that idea. It turns the reduced object into a contract. Thanks.
Re: [Question] Signature calculation ignoring parts of binary files
Junio C Hamano writes: > "Randall S. Becker" writes: > >> The scenario is slightly different. >> 1. Person A gives me a new binary file-1 with fingerprint A1. This goes into >> git unchanged. >> 2. Person B gives me binary file-2 with fingerprint B2. This does not go >> into git yet. >> 3. We attempt a git diff between the committed file-1 and uncommitted file-2 >> using a textconv implementation that strips what we don't need to compare. >> 4. If file-1 and file-2 have no difference when textconv is used, file-2 is >> not added and not committed. It is discarded with impunity, never to be seen >> again, although we might whine a lot at the user for attempting to put >> file-2 in - but that's not git's issue. > > You are forgetting that Git is a distributed version control system, > aren't you? Person A and B can introduce their "moral equivalent > but bytewise different" copies to their repository under the same > object name, and you can pull from them--what happens? > > It is fundamental that one object name given to Git identifies one > specific byte sequence contained in an object uniquely. Once you > broke that, you no longer have Git. Having said all that, if you want to keep the original with frills but somehow give these bytewise different things that reduce to the same essence (e.g. when passed thru a filter like textconv), I suspect a better approach might be to store both the "original" and the result of passing the "original" through the filter in the object database. In the above example, you'll get two "original" objects from person A and person B, plus one "canonical" object that are bytewise different from either of these two originals, but what they reduce to when you use the filter on them. Then you record the fact that to derive the "essence" object, you can reduce either person A's or person B's "original" through the filter, perhaps by using "git notes" attached to the "essence" object, recording the object names of these originals (the reason why using notes in this direction is because you can mechanically determine which "essence" object any given "original" object reduces to---it is just the matter of passing it through the filter. But there can be more than one "original" that reduces to the same "essence").
RE: [Question] Signature calculation ignoring parts of binary files
On September 13, 2018 11:03 AM, Junio C Hamano wrote: > "Randall S. Becker" writes: > > > The scenario is slightly different. > > 1. Person A gives me a new binary file-1 with fingerprint A1. This > > goes into git unchanged. > > 2. Person B gives me binary file-2 with fingerprint B2. This does not > > go into git yet. > > 3. We attempt a git diff between the committed file-1 and uncommitted > > file-2 using a textconv implementation that strips what we don't need to > compare. > > 4. If file-1 and file-2 have no difference when textconv is used, > > file-2 is not added and not committed. It is discarded with impunity, > > never to be seen again, although we might whine a lot at the user for > > attempting to put > > file-2 in - but that's not git's issue. > > You are forgetting that Git is a distributed version control system, aren't you? > Person A and B can introduce their "moral equivalent but bytewise different" > copies to their repository under the same object name, and you can pull from > them--what happens? > > It is fundamental that one object name given to Git identifies one specific > byte sequence contained in an object uniquely. Once you broke that, you no > longer have Git. At that point I have a morally questionable situation, agreed. However, both are permitted to exist in the underlying tree without conflict in git - which I do consider a legitimately possible situation that will not break the application at all - although there is a semantic conflict in the application (not in git) that requires human decision to resolve. The fact that both objects can exist in git with different fingerprints is a good thing because it provides immutable evidence and ownership of someone bypassing the intent of the application. So, rather than using textconv, I shall implement this rule in the application rather than trying to configure git to do it. If two conflicting objects enter the commit history, the application will have the responsibility to resolve the semantic/legal conflict. Thanks, Randall
Re: [Question] Signature calculation ignoring parts of binary files
"Randall S. Becker" writes: > The scenario is slightly different. > 1. Person A gives me a new binary file-1 with fingerprint A1. This goes into > git unchanged. > 2. Person B gives me binary file-2 with fingerprint B2. This does not go > into git yet. > 3. We attempt a git diff between the committed file-1 and uncommitted file-2 > using a textconv implementation that strips what we don't need to compare. > 4. If file-1 and file-2 have no difference when textconv is used, file-2 is > not added and not committed. It is discarded with impunity, never to be seen > again, although we might whine a lot at the user for attempting to put > file-2 in - but that's not git's issue. You are forgetting that Git is a distributed version control system, aren't you? Person A and B can introduce their "moral equivalent but bytewise different" copies to their repository under the same object name, and you can pull from them--what happens? It is fundamental that one object name given to Git identifies one specific byte sequence contained in an object uniquely. Once you broke that, you no longer have Git.
RE: [Question] Signature calculation ignoring parts of binary files
On September 12, 2018 7:00 PM, Junio C Hamano wrote: > "Randall S. Becker" writes: > > >> author is important to our process. My objective is to keep the > >> original file 100% exact as supplied and then ignore any changes to > >> the metadata that I don't care about (like Creator) if the remainder of the > file is the same. > > That will *not* work. If person A gave you a version of original, which > hashes to X after you strip the cruft you do not care about, you would > register that original with person A's fingerprint on under the name of X. > What happens when person B gives you another version, which is not byte- > for-byte identical to the one you got earlier from person A, but does hash to > the same X after you strip the cruft? If you are going to store it in Git, and if > by SHA-1 you are calling what we perceive as "object name" in Git land, you > must store that one with person B's fingerprint on it also under the name of > X. Now which version will you get from Git when you ask it to give you the > object that hashes to X? The scenario is slightly different. 1. Person A gives me a new binary file-1 with fingerprint A1. This goes into git unchanged. 2. Person B gives me binary file-2 with fingerprint B2. This does not go into git yet. 3. We attempt a git diff between the committed file-1 and uncommitted file-2 using a textconv implementation that strips what we don't need to compare. 4. If file-1 and file-2 have no difference when textconv is used, file-2 is not added and not committed. It is discarded with impunity, never to be seen again, although we might whine a lot at the user for attempting to put file-2 in - but that's not git's issue. 5. If file-1 and file-2 have differences when textconv is used, file-2 is committed with fingerprint B2. 6. Even if an error is made by the user and they commit file-2 with B2 regardless of textconv, there will be a human who complains about it, but git has two unambiguous fingerprints that happen to have no diffs after textconv is applied. My original hope was that textconv could be used to influence the fingerprint, but I do not think that is the case, so I went with an alternative. In the application, I am not allowed to strip any cruft off file-1 when it is stored - it must be byte-for-byte the original file. This application is marginally related to a DRM-like situation where we only care about the original image provided by a user, but any copies that are provided by another user with modified metadata will be disallowed from repository. Does that make more sense? Cheers, Randall
Re: [Question] Signature calculation ignoring parts of binary files
"Randall S. Becker" writes: >> author is important to our process. My objective is to keep the original file >> 100% exact as supplied and then ignore any changes to the metadata that I >> don't care about (like Creator) if the remainder of the file is the same. That will *not* work. If person A gave you a version of original, which hashes to X after you strip the cruft you do not care about, you would register that original with person A's fingerprint on under the name of X. What happens when person B gives you another version, which is not byte-for-byte identical to the one you got earlier from person A, but does hash to the same X after you strip the cruft? If you are going to store it in Git, and if by SHA-1 you are calling what we perceive as "object name" in Git land, you must store that one with person B's fingerprint on it also under the name of X. Now which version will you get from Git when you ask it to give you the object that hashes to X?
RE: [Question] Signature calculation ignoring parts of binary files
On September 12, 2018 4:54 PM, I wrote: > On September 12, 2018 4:48 PM, Johannes Sixt wrote: > > Am 12.09.18 um 21:16 schrieb Randall S. Becker: > > > I feel really bad asking this, and I should know the answer, and yet. > > > > > > I have a binary file that needs to go into a repo intact (unchanged). > > > I also have a program that interprets the contents, like a textconv, > > > that can output the relevant portions of the file in whatever format > > > I like - used for diff typically, dumps in 1K chunks by file section. > > > What I'm looking for is to have the SHA1 signature calculated with > > > just the relevant portions of the file so that two actually > > > different files will be considered the same by git during a commit > > > or status. In real terms, I'm trying to ignore the Creator metadata > > > of a JPG because it is mutable and irrelevant to my repo contents. > > > > > > I'm sorry to ask, but I thought this was in .gitattributes but I > > > can't confirm the SHA1 behaviour. > > > > You are looking for a clean filter. See the 'filter' attribute in > > gitattributes(5). > > Your clean filter program or script should strip the unwanted metadata > > or set it to a constant known-good value. > > > > (You shouldn't need a smudge filter.) > > > > -- Hannes > > Thanks Hannes. I thought about the clean filter, but I don't actually want to > modify the file when going into git, just for SHA calculation. I need to be > able > to keep some origin metadata that might change with subsequent copies, so > just cleaning the origin is not going to work - actually knowing the original > author is important to our process. My objective is to keep the original file > 100% exact as supplied and then ignore any changes to the metadata that I > don't care about (like Creator) if the remainder of the file is the same. I had a thought that might be workable, opinions are welcome on this. The commit of my rather weird project is done by a script so I have flexibility in my approach. What I could do is set up a diff textconv configuration so that the text diff of the two JPG files will show no differences if the immutable fields and the image are the same. I can then trigger a git add and git commit for only those files where git diff reports no differences. That way the actual original file is stored in git with 100% fidelity (no cleaning). It's not as elegant as I'd like, but it does solve what I'm trying to do. Does this sound reasonable and/or is there a better way? Cheers, Randall -- Brief whoami: NonStop developer since approximately 2112884442 UNIX developer since approximately 421664400 -- In my real life, I talk too much.
RE: [Question] Signature calculation ignoring parts of binary files
> -Original Message- > From: git-ow...@vger.kernel.org On Behalf > Of Johannes Sixt > Sent: September 12, 2018 4:48 PM > To: Randall S. Becker > Cc: git@vger.kernel.org > Subject: Re: [Question] Signature calculation ignoring parts of binary files > > Am 12.09.18 um 21:16 schrieb Randall S. Becker: > > I feel really bad asking this, and I should know the answer, and yet. > > > > I have a binary file that needs to go into a repo intact (unchanged). > > I also have a program that interprets the contents, like a textconv, > > that can output the relevant portions of the file in whatever format I > > like - used for diff typically, dumps in 1K chunks by file section. > > What I'm looking for is to have the SHA1 signature calculated with > > just the relevant portions of the file so that two actually different > > files will be considered the same by git during a commit or status. In > > real terms, I'm trying to ignore the Creator metadata of a JPG because > > it is mutable and irrelevant to my repo contents. > > > > I'm sorry to ask, but I thought this was in .gitattributes but I can't > > confirm the SHA1 behaviour. > > You are looking for a clean filter. See the 'filter' attribute in > gitattributes(5). > Your clean filter program or script should strip the unwanted metadata or set > it to a constant known-good value. > > (You shouldn't need a smudge filter.) > > -- Hannes Thanks Hannes. I thought about the clean filter, but I don't actually want to modify the file when going into git, just for SHA calculation. I need to be able to keep some origin metadata that might change with subsequent copies, so just cleaning the origin is not going to work - actually knowing the original author is important to our process. My objective is to keep the original file 100% exact as supplied and then ignore any changes to the metadata that I don't care about (like Creator) if the remainder of the file is the same. Regards, Randall
Re: [Question] Signature calculation ignoring parts of binary files
Am 12.09.18 um 21:16 schrieb Randall S. Becker: I feel really bad asking this, and I should know the answer, and yet. I have a binary file that needs to go into a repo intact (unchanged). I also have a program that interprets the contents, like a textconv, that can output the relevant portions of the file in whatever format I like - used for diff typically, dumps in 1K chunks by file section. What I'm looking for is to have the SHA1 signature calculated with just the relevant portions of the file so that two actually different files will be considered the same by git during a commit or status. In real terms, I'm trying to ignore the Creator metadata of a JPG because it is mutable and irrelevant to my repo contents. I'm sorry to ask, but I thought this was in .gitattributes but I can't confirm the SHA1 behaviour. You are looking for a clean filter. See the 'filter' attribute in gitattributes(5). Your clean filter program or script should strip the unwanted metadata or set it to a constant known-good value. (You shouldn't need a smudge filter.) -- Hannes