Re: [PATCH v3 2/3] merge: Add merge.renames config setting
Hi Eckhard, On Mon, Apr 30, 2018 at 1:03 AM, Eckhard Maaß wrote: > On Fri, Apr 27, 2018 at 01:23:20PM -0700, Elijah Newren wrote: >> I doubt it has ever been discussed before this thread. But, if you're >> curious, I'll try to dump a few thoughts. > > Thank you, I try to dump some of mine, too. Maybe let me first stress > that for me copy detection without --find-copies-harder is much more a > "find content extracted" (like methods being factored out). In a way > this is nearer to a rename than to a real copy. Ooh, if you wanted to detect movement of code between files (as blame does, I think searches for PICKAXE_BLAME_MOVE would point you in the right direction) and then try to use that during merge to allow applied changes to move with the code, that would be awesome. Expensive, and might be a lot of work to wire it all up, but it'd be very interesting. I was only discussing DIFF_DETECT_COPY in my previous email, which was all about duplicating entire files; that's something I don't see utility in right now for resolving merges. > I admit that a "real" copy would get unnoticed that way. But the > semantics of such a copy isn't too clear for me either - did I copy the > other part to make it independent of the other or did I just employ a > copy and paste tactic? The former does not want the changes, the later > does. But I am happy catering to the former here. Right, if you have to assume that the copy was made to make the code independent, then there's no value for merge resolution to having detected the copy in the first place. That has the advantage of side-stepping the possible new edge and corner cases I mentioned in the rest of my email, but it means we shouldn't even spend time detecting copies -- whether whole file (via DIFF_DETECT_COPY) or individual lines (via PICKAXE_BLAME_COPY and variants). Elijah
Re: [PATCH v3 2/3] merge: Add merge.renames config setting
On Fri, Apr 27, 2018 at 01:23:20PM -0700, Elijah Newren wrote: > I doubt it has ever been discussed before this thread. But, if you're > curious, I'll try to dump a few thoughts. Thank you, I try to dump some of mine, too. Maybe let me first stress that for me copy detection without --find-copies-harder is much more a "find content extracted" (like methods being factored out). In a way this is nearer to a rename than to a real copy. > [...] Let's say we have branches > A and B, and: >A: modifies file z >B: copies z to y > > Should the modifications to z done in A propagate to both z and y? If > not, what good is copy detection? If so, then there are several > ramifications... If one just assumes the most likely outcome is that something from z wad factored out to y, it might just be sufficient to see whether the modifications of the two branches apply cleanly - if A touched the parts of B that have been factored out there would be a normal merge conflict (where one could be nice and give a hint that some content was copied to y on the B branch), if A did not touched the parts touched (or moved) by B, then there is no problem. If A exactly deleted the content moved by B, there will be no conflict - but this is seems to be strange anyway. I admit that a "real" copy would get unnoticed that way. But the semantics of such a copy isn't too clear for me either - did I copy the other part to make it independent of the other or did I just employ a copy and paste tactic? The former does not want the changes, the later does. But I am happy catering to the former here. To sum up: - fail as before for conflicting merges, but give a hint that one has copied to quicken up resolution. > - If B not only copied z but also first modified it, then do we have > potential conflicts with both z and y -- possibly the exact same > conflicts, making the user resolve them repeatedly? With the above suggestion, if there are conflicts, you fail and give a hint. > - What if A copied z to x? Do changes to z propagate to all three of > z and x and y? Do changes to either x or y affect z? Do they > affect each other? A copy on branch to x and one another to y seems strange even if z merges cleanly. Did both sides try to factor the same thing out to different files? Or did they try to make something independent, but managed to make it to different files? For this I would be inclined to just suggest fail with a copy/copy(somewhere else). But this is a real corner case after all. Has anyone seen just thing in practice? > - If A deleted z, does that give us a copy/delete conflict for y? Do > we also have to worry about copy/add conflicts? copy/add/delete? > rename/copy (multiple variants)? copy/copy? We do have the modified/deleted conflict where we could hint that content also has been copied and then not try to do more. > - Extra degrees of freedom may mean new conflict types: > > - The extra degrees of freedom from renames introduced multiple new > conflict types (e.g. rename/add, rename/rename(1to2), > rename/rename(2to1)). For renaming one side and coping the other, I would think doing the same as above is sensible enough: if there are conflicts one can give an additional hint of the one part having been copied, but not change the kind of conflicts much. > The more I think about it, the more I think that attempting to detect > copies in a merge algorithm just doesn't make sense. Anything I can > think of that someone might attempt to use detected copies for would > just surprise users in a bad way... Hm, it didn't sound like that. Would you think that users would be surprised by my suggestions? Or are they all too corner casey to be worth implementing anyway? Greetings, Eckhard
Re: [PATCH v3 2/3] merge: Add merge.renames config setting
On Fri, Apr 27, 2018 at 11:37 AM, Eckhard Maaß wrote: > On Fri, Apr 27, 2018 at 11:23:56AM +0900, Junio C Hamano wrote: >> I think demoting from copy to rename-only is a good idea, at least >> for now, because I do not believe we have figured out what we want >> to happen when we detect copied files are involved in a merge. > > Does anyone know some threads concerning that topic? I tried to search > for it, but somehow "merge copy detection" did not find me useful > threads so far. I would be interested in that topic anyway, so I would > like to know what the ideas are that floated so far. > I doubt it has ever been discussed before this thread. But, if you're curious, I'll try to dump a few thoughts. Let's say we have branches A and B, and: A: modifies file z B: copies z to y Should the modifications to z done in A propagate to both z and y? If not, what good is copy detection? If so, then there are several ramifications... - If B not only copied z but also first modified it, then do we have potential conflicts with both z and y -- possibly the exact same conflicts, making the user resolve them repeatedly? - What if A copied z to x? Do changes to z propagate to all three of z and x and y? Do changes to either x or y affect z? Do they affect each other? - If A deleted z, does that give us a copy/delete conflict for y? Do we also have to worry about copy/add conflicts? copy/add/delete? rename/copy (multiple variants)? copy/copy? - Extra degrees of freedom may mean new conflict types: - The extra degrees of freedom from renames introduced multiple new conflict types (e.g. rename/add, rename/rename(1to2), rename/rename(2to1)). - Directory rename detection added more (rename/rename/rename(1to3), rename/rename(Nto1), add/add/add, directory/file/file, n-fold transitive rename, etc.), -- and forced us to specify various rules to limit the possibility space so that conflicts could be representable in the index and understandable by the user. - I suspect adding copies would add new types of conflicts we haven't thought of yet. The more I think about it, the more I think that attempting to detect copies in a merge algorithm just doesn't make sense. Anything I can think of that someone might attempt to use detected copies for would just surprise users in a bad way...and it'd open up a big can of edge and corner case problems as well.
Re: [PATCH v3 2/3] merge: Add merge.renames config setting
On Fri, Apr 27, 2018 at 11:23:56AM +0900, Junio C Hamano wrote: > I think demoting from copy to rename-only is a good idea, at least > for now, because I do not believe we have figured out what we want > to happen when we detect copied files are involved in a merge. Does anyone know some threads concerning that topic? I tried to search for it, but somehow "merge copy detection" did not find me useful threads so far. I would be interested in that topic anyway, so I would like to know what the ideas are that floated so far. Greetings, Eckhard
Re: [PATCH v3 2/3] merge: Add merge.renames config setting
Hi Dsco, On Fri, Apr 27, 2018 at 12:23 AM, Johannes Schindelin wrote: > Hi, > > Guys, you argued long and hard that one config setting (diff.renames) > should magically imply another one (merge.renames), on the basis that they > essentially do the same. I apologize for any frustration; I've probably caused a good deal of it by repeatedly catching or noticing things after one review and then bringing them up later. I just didn't catch it all at first. Your restatement of the basis of the argument doesn't match my rationale, though. My reasoning was that (1) the configuration options exist to allow the user to specify how much of a performance cost they're willing to pay, (2) the two options are separate because some users might want to configure one more loosely or tightly than the other, but (3) many users will want to just specify once how much they are willing to pay without being forced to repeat themselves for different parts of git (diff, merge, possible future commands). I'll add to my former basis by stating that I think the diffstat at the end of the merge should be controlled by these variables, even if that's not implemented as part of Ben's series. And both of those configuration options clearly feed into whether that diffstat should be doing rename or copy or no detection. While they could be kept separate options, forcing us to somehow decide which one overrides which, I think it's much more natural if we already have merge.renames inheriting from and overriding diff.renames. > And now you are suggesting that they *cannot* be essentially the same? Are > you agreeing on such a violation of the Law of Least Surprise? > > Please, make up your mind. Inheriting merge.renames from diff.renames is > "magic" enough to be puzzling. Introducing that auto-demoting is just > simply creating a bad user experience. Please don't. >From my view, resolve and octopus merges auto-demote diff.renames and merge.renames to false. In fact, they optimize the work involved in that demotion by not even checking the value. I think demoting "copy" to "true" in the recursive merge machinery is no more surprising than that is. You can find more details to this argument in the portion of my email that you responded to but snipped out. But to add to that argument, if auto-demotion is a violation of the Law of Least Surprise, as you claim, then naming this option merge.renames is also a violation of the Law of Least Surprise. You would instead need to rename it to something like merge.recursive.renames. (And then when a new merge strategy comes along that also does renames, we'll need to add a merge.ort.renames.) > It is fine, of course, to admit at this stage that the whole magic idea of > letting merge.renames inherit from diff.renames was only half thought > through (we're in reviewing stage right now for the purpose of weighing > alternative approaches and trying to come up with the best solution after > all) and that we should go with Ben's original approach, which has the > rather huge and dramatic advantage of being very, very clear and easy to > understand to the user. We could even document that (and why) the 'copy' > value in merge.renames is not allowed for the time being. It was only half thought through, yes, at least by me. But the more I think through it, the more I like the inheritance personally. I see no problem with the demotion and think the inheritance has the advantage of being easier to understand, because I see your proposal as causing questions like: - "Why does merge.renameLimit inherit from diff.renameLimit but merge.renames doesn't from diff.renames?" - "Why can't I just specify in one place that I {am, am not} willing to pay for {full, both copy and} rename detection wherever it makes sense?" - "How do I control the detection for the diffstat at the end of the merge?" Hope that helps, Elijah
Re: [PATCH v3 2/3] merge: Add merge.renames config setting
Hi, On Thu, 26 Apr 2018, Elijah Newren wrote: > On Thu, Apr 26, 2018 at 7:23 PM, Junio C Hamano wrote: > > Ben Peart writes: > > > >> Color me puzzled. :) The consensus was that the default value for > >> merge.renames come from diff.renames. diff.renames supports copy > >> detection which means that merge.renames will inherit that value. My > >> assumption was that is what was intended so when I reimplemented it, I > >> fully implemented it that way. > >> > >> Are you now requesting to only use diff.renames as the default if the > >> value is true or false but not if it is copy? What should happen if > >> diff.renames is actually set to copy? Should merge silently change > >> that to true, display a warning, error out, or something else? Do you > >> have some other behavior for how to handle copy being inherited from > >> diff.renames you'd like to see? > >> > >> Can you write the documentation that clearly explains the exact > >> behavior you want? That would kill two birds with one stone... :) > > > > I think demoting from copy to rename-only is a good idea, at least > > for now, because I do not believe we have figured out what we want > > to happen when we detect copied files are involved in a merge. > > > > But I am not sure if we even want to fail merge.renames=copy as an > > invalid configuration. So my gut feeling of the best solution to > > the above is to do something like: > > > > - whether the configuration comes from diff.renames or > >merge.renames, turn *.renames=copy to true inside the merge > >recursive machinery. > > > > - document the fact in "git merge-recursive" documentation (or "git > >merge" documentation) to say "_currently_ asking for rename > >detection to find copies and renames will do the same > >thing---copies are ignored", impliying "this might change in the > >future", in the BUGS section. > > Yes, I agree. Guys, you argued long and hard that one config setting (diff.renames) should magically imply another one (merge.renames), on the basis that they essentially do the same. And now you are suggesting that they *cannot* be essentially the same? Are you agreeing on such a violation of the Law of Least Surprise? Please, make up your mind. Inheriting merge.renames from diff.renames is "magic" enough to be puzzling. Introducing that auto-demoting is just simply creating a bad user experience. Please don't. It is fine, of course, to admit at this stage that the whole magic idea of letting merge.renames inherit from diff.renames was only half thought through (we're in reviewing stage right now for the purpose of weighing alternative approaches and trying to come up with the best solution after all) and that we should go with Ben's original approach, which has the rather huge and dramatic advantage of being very, very clear and easy to understand to the user. We could even document that (and why) the 'copy' value in merge.renames is not allowed for the time being. Ciao, Dscho
Re: [PATCH v3 2/3] merge: Add merge.renames config setting
On Thu, Apr 26, 2018 at 5:54 PM, Ben Peart wrote: > On 4/26/2018 6:52 PM, Elijah Newren wrote: >> On Thu, Apr 26, 2018 at 1:52 PM, Ben Peart >> wrote: >> >>> diff --git a/merge-recursive.h b/merge-recursive.h >>> index 80d69d1401..0c5f7eff98 100644 >>> --- a/merge-recursive.h >>> +++ b/merge-recursive.h >>> @@ -17,7 +17,8 @@ struct merge_options { >>> unsigned renormalize : 1; >>> long xdl_opts; >>> int verbosity; >>> - int detect_rename; >>> + int diff_detect_rename; >>> + int merge_detect_rename; >>> int diff_rename_limit; >>> int merge_rename_limit; >>> int rename_score; >>> @@ -28,6 +29,11 @@ struct merge_options { >>> struct hashmap current_file_dir_set; >>> struct string_list df_conflict_file_set; >>> }; >>> +inline int merge_detect_rename(struct merge_options *o) >>> +{ >>> + return o->merge_detect_rename >= 0 ? o->merge_detect_rename : >>> + o->diff_detect_rename >= 0 ? o->diff_detect_rename : 1; >>> +} >> >> >> Why did you split o->detect_rename into two fields? You then >> recombine them in merge_detect_rename(), and after initial setup only >> ever access them through that function. Having two fields worries me >> that people will accidentally introduce bugs by using one of them >> instead of the merge_detect_rename() function. Is there a reason you >> decided against having the initial setup just set a single value and >> then use it directly? >> > The setup of this value is split into 3 places that may or may not all get > called. The initial values, the values that come from the config settings > and then any values passed on the command line. > > Because the merge value can now inherit from the diff value, you only know > the final value after you have received all possible inputs. That makes it > necessary to be a calculated value. > > If you look at diff_rename_limit/merge_rename_limit, detect_rename follow > the same pattern for the same reasons. It turns out detect_rename was a > little more complex because it is used in 3 different locations (vs just > one) which is why I wrapped the inheritance logic into the helper function > merge_detect_rename(). Ah, you're following the precedent set by diff_rename_limit/merge_rename_limit; that makes sense. Thanks for the explanation. I believe another possibility here is that for both the {merge,diff}_rename_limit pair of variables and the {diff,merge}_renames pair of variables, since the code parses all inputs before ever using the result, we could calculate the result once and store it rather than storing the constituent pieces of the calculation. That would also prevent people from trying to use one of the pieces of the calculation instead of treating it as a coherent whole. However, while I would have preferred that the rename_limit pair of variables also went away in favor of just one field which is updated as it parses each input option, what you have is fine for this series.
Re: [PATCH v3 2/3] merge: Add merge.renames config setting
On Thu, Apr 26, 2018 at 7:23 PM, Junio C Hamano wrote: > Ben Peart writes: > >> Color me puzzled. :) The consensus was that the default value for >> merge.renames come from diff.renames. diff.renames supports copy >> detection which means that merge.renames will inherit that value. My >> assumption was that is what was intended so when I reimplemented it, I >> fully implemented it that way. >> >> Are you now requesting to only use diff.renames as the default if the >> value is true or false but not if it is copy? What should happen if >> diff.renames is actually set to copy? Should merge silently change >> that to true, display a warning, error out, or something else? Do you >> have some other behavior for how to handle copy being inherited from >> diff.renames you'd like to see? >> >> Can you write the documentation that clearly explains the exact >> behavior you want? That would kill two birds with one stone... :) > > I think demoting from copy to rename-only is a good idea, at least > for now, because I do not believe we have figured out what we want > to happen when we detect copied files are involved in a merge. > > But I am not sure if we even want to fail merge.renames=copy as an > invalid configuration. So my gut feeling of the best solution to > the above is to do something like: > > - whether the configuration comes from diff.renames or >merge.renames, turn *.renames=copy to true inside the merge >recursive machinery. > > - document the fact in "git merge-recursive" documentation (or "git >merge" documentation) to say "_currently_ asking for rename >detection to find copies and renames will do the same >thing---copies are ignored", impliying "this might change in the >future", in the BUGS section. Yes, I agree. One more thing: - It may be best to avoid advertising "copies" as a vaild option for merge.renames since it doesn't have any current practical use anywhere. (Remove the sentence 'If set to "copies" or "copy", Git will detect copies, as well.' from the documentation) My rationale for translating "copy" to "true" is a little different than Junio's, though: 1) The reason we have configuration options around renames and copies is primarily because they are expensive to compute. So we let some users specify that they don't want them, other users are willing to pay for rename detection, and others are willing to pay for both rename and copy detection. 2) If rename/copy detection were cheap, every part of git would just compute whatever level of detection was relevant and use it. 3) The resolve and octopus merge strategies ignores diff.renames and merge.renames, because they don't have logic to use any rename information. diff and log can use both renames and copies. And the recursive merge machinery is code which can use renames but not copies. 4) Therefore, translating from "copy" to "true" inside the merge recursive machinery is fine and not an error because we are using as much detection information as is relevant to the algorithm and which the user is willing to pay for. To throw one more wrinkle in here, merge.renames could actually be set to "copy" and make sense, because we compute diffs multiple times. Twice within the recursive merge machinery (for which we'd want to translate "copy" to "true"), and once for the diffstat at the end (which comes from builtin/merge.c, and for which it could make sense to detect copies). (Kind of curious whether Junio agrees with my rationale or thinks I'm out in left field with it...)
Re: [PATCH v3 2/3] merge: Add merge.renames config setting
Ben Peart writes: > Color me puzzled. :) The consensus was that the default value for > merge.renames come from diff.renames. diff.renames supports copy > detection which means that merge.renames will inherit that value. My > assumption was that is what was intended so when I reimplemented it, I > fully implemented it that way. > > Are you now requesting to only use diff.renames as the default if the > value is true or false but not if it is copy? What should happen if > diff.renames is actually set to copy? Should merge silently change > that to true, display a warning, error out, or something else? Do you > have some other behavior for how to handle copy being inherited from > diff.renames you'd like to see? > > Can you write the documentation that clearly explains the exact > behavior you want? That would kill two birds with one stone... :) I think demoting from copy to rename-only is a good idea, at least for now, because I do not believe we have figured out what we want to happen when we detect copied files are involved in a merge. But I am not sure if we even want to fail merge.renames=copy as an invalid configuration. So my gut feeling of the best solution to the above is to do something like: - whether the configuration comes from diff.renames or merge.renames, turn *.renames=copy to true inside the merge recursive machinery. - document the fact in "git merge-recursive" documentation (or "git merge" documentation) to say "_currently_ asking for rename detection to find copies and renames will do the same thing---copies are ignored", impliying "this might change in the future", in the BUGS section.
Re: [PATCH v3 2/3] merge: Add merge.renames config setting
On 4/26/2018 6:52 PM, Elijah Newren wrote: On Thu, Apr 26, 2018 at 1:52 PM, Ben Peart wrote: +merge.renames:: + Whether and how Git detects renames. If set to "false", + rename detection is disabled. If set to "true", basic rename + detection is enabled. If set to "copies" or "copy", Git will + detect copies, as well. Defaults to the value of diff.renames. + We shouldn't allow users to force copy detection on for merges The diff side of the code will detect them correctly but the code in merge-recursive will mishandle the copy pairs. I think fixing it is somewhere between big can of worms and it's-a-configuration-that-doesn't-even-make-sense, but it's been a while since I thought about it. Color me puzzled. :) The consensus was that the default value for merge.renames come from diff.renames. diff.renames supports copy detection which means that merge.renames will inherit that value. My assumption was that is what was intended so when I reimplemented it, I fully implemented it that way. Are you now requesting to only use diff.renames as the default if the value is true or false but not if it is copy? What should happen if diff.renames is actually set to copy? Should merge silently change that to true, display a warning, error out, or something else? Do you have some other behavior for how to handle copy being inherited from diff.renames you'd like to see? Can you write the documentation that clearly explains the exact behavior you want? That would kill two birds with one stone... :) diff --git a/merge-recursive.h b/merge-recursive.h index 80d69d1401..0c5f7eff98 100644 --- a/merge-recursive.h +++ b/merge-recursive.h @@ -17,7 +17,8 @@ struct merge_options { unsigned renormalize : 1; long xdl_opts; int verbosity; - int detect_rename; + int diff_detect_rename; + int merge_detect_rename; int diff_rename_limit; int merge_rename_limit; int rename_score; @@ -28,6 +29,11 @@ struct merge_options { struct hashmap current_file_dir_set; struct string_list df_conflict_file_set; }; +inline int merge_detect_rename(struct merge_options *o) +{ + return o->merge_detect_rename >= 0 ? o->merge_detect_rename : + o->diff_detect_rename >= 0 ? o->diff_detect_rename : 1; +} Why did you split o->detect_rename into two fields? You then recombine them in merge_detect_rename(), and after initial setup only ever access them through that function. Having two fields worries me that people will accidentally introduce bugs by using one of them instead of the merge_detect_rename() function. Is there a reason you decided against having the initial setup just set a single value and then use it directly? The setup of this value is split into 3 places that may or may not all get called. The initial values, the values that come from the config settings and then any values passed on the command line. Because the merge value can now inherit from the diff value, you only know the final value after you have received all possible inputs. That makes it necessary to be a calculated value. If you look at diff_rename_limit/merge_rename_limit, detect_rename follow the same pattern for the same reasons. It turns out detect_rename was a little more complex because it is used in 3 different locations (vs just one) which is why I wrapped the inheritance logic into the helper function merge_detect_rename().
Re: [PATCH v3 2/3] merge: Add merge.renames config setting
On Thu, Apr 26, 2018 at 1:52 PM, Ben Peart wrote: > +merge.renames:: > + Whether and how Git detects renames. If set to "false", > + rename detection is disabled. If set to "true", basic rename > + detection is enabled. If set to "copies" or "copy", Git will > + detect copies, as well. Defaults to the value of diff.renames. > + We shouldn't allow users to force copy detection on for merges The diff side of the code will detect them correctly but the code in merge-recursive will mishandle the copy pairs. I think fixing it is somewhere between big can of worms and it's-a-configuration-that-doesn't-even-make-sense, but it's been a while since I thought about it. > diff --git a/merge-recursive.h b/merge-recursive.h > index 80d69d1401..0c5f7eff98 100644 > --- a/merge-recursive.h > +++ b/merge-recursive.h > @@ -17,7 +17,8 @@ struct merge_options { > unsigned renormalize : 1; > long xdl_opts; > int verbosity; > - int detect_rename; > + int diff_detect_rename; > + int merge_detect_rename; > int diff_rename_limit; > int merge_rename_limit; > int rename_score; > @@ -28,6 +29,11 @@ struct merge_options { > struct hashmap current_file_dir_set; > struct string_list df_conflict_file_set; > }; > +inline int merge_detect_rename(struct merge_options *o) > +{ > + return o->merge_detect_rename >= 0 ? o->merge_detect_rename : > + o->diff_detect_rename >= 0 ? o->diff_detect_rename : 1; > +} Why did you split o->detect_rename into two fields? You then recombine them in merge_detect_rename(), and after initial setup only ever access them through that function. Having two fields worries me that people will accidentally introduce bugs by using one of them instead of the merge_detect_rename() function. Is there a reason you decided against having the initial setup just set a single value and then use it directly?