[PATCH v2 0/7] Better heuristics make prettier diffs

Michael Haggerty Mon, 22 Aug 2016 04:24:55 -0700

This is v2 of a patch series to improve the heuristics that `git diff`
and related commands use to position ambiguous blocks of added/deleted
lines. Thanks to Stefan, Jacob, Peff, Junio, and Jakub for their
comments about v1 [1]. This patch series is also available from my
GitHub account [2] as branch `diff-indent-heuristic`.


This version started out as a light touch-up of v1, but as I was
working I realized that it needed more changes. Among other things,
the final heuristic is now improved relative to the one proposed in
v1. Summary of changes:

* I changed the `--compaction-heuristic` code such that if a group of
  added/deleted lines can be aligned with a group of deleted/added
  lines in the other file, then that should be done rather than try to
  slide to a blank line. In the existing code, the compaction
  heuristic was attempted, incorrectly, *after* trying to align
  groups, which doesn't make sense.

* I changed `recs_match()` similarly to `is_blank_line()`: now it
  takes two `xrecord_t *` as arguments rather than an array of
  `xrecord_t` and two indexes.

* I refactored the old `xdl_change_compact()` more thoroughly. All of
  the index manipulation was pretty confusing, and I was having
  trouble convincing myself that even the old code was working
  correctly. So I introduced a `struct group`, which is like a cursor
  that keeps track of a (possibly empty) group of changed lines in the
  old or new version of a file. I added functions to initialize a
  group to the first change in a file, to move to the next or previous
  groups, and to slide a group forward or backward if possible (i.e.,
  if the group is a "slider").

* In the course of that refactoring, I found (and fixed) another bug
  in the `--compaction-heuristic` code: it didn't notice if the top
  possible position of a slider had a blank line as its last line. See
  the commit message of patch [5/5] for more details.

* I realized that the behavior of the indent heuristic from v1,
  because it multiplied weighting factors times indent widths, would
  behave differently if a project changed its convention for how many
  spaces to use per level of indentation. That doesn't make sense, so
  I changed the handling of indentation:

  Now, the sum of the indentation width at the top and bottom of the
  slider are added, *but those sums are only compared* between slider
  positions, and the weight is multipled by -1, 0, or +1 depending on
  whether the first indent sum is less than, equal, or greater than
  the other. This should make the behavior relatively independent of
  the project's indentation convention, and thus make the heuristic
  work more consistently across projects.

  The resulting heuristic works significantly better than the one
  proposed in v1:

  | repository            | count |      Git 2.9.0 |             v1 |           
  v2 |
  | --------------------- | ----- | -------------- | -------------- | 
-------------- |
  | afnetworking          |   109 |    89  (81.7%) |     3   (2.8%) |     2   
(1.8%) |
  | alamofire             |    30 |    18  (60.0%) |     2   (6.7%) |     0   
(0.0%) |
  | angular               |   184 |   127  (69.0%) |    15   (8.2%) |     5   
(2.7%) |
  | animate               |   313 |     2   (0.6%) |     2   (0.6%) |     2   
(0.6%) |
  | ant                   |   380 |   356  (93.7%) |    22   (5.8%) |    15   
(3.9%) |
  | bugzilla              |   306 |   263  (85.9%) |    24   (7.8%) |    15   
(4.9%) |
  | corefx                |   126 |    91  (72.2%) |    15  (11.9%) |     6   
(4.8%) |
  | couchdb               |    78 |    44  (56.4%) |    17  (21.8%) |     6   
(7.7%) |
  | cpython               |   937 |   158  (16.9%) |    26   (2.8%) |     5   
(0.5%) |
  | discourse             |   160 |    95  (59.4%) |    24  (15.0%) |    13   
(8.1%) |
  | docker                |   307 |   194  (63.2%) |    16   (5.2%) |     8   
(2.6%) |
  | electron              |   163 |   132  (81.0%) |    15   (9.2%) |     6   
(3.7%) |
  | git                   |   536 |   470  (87.7%) |    18   (3.4%) |    16   
(3.0%) |
  | gitflow               |   127 |     0   (0.0%) |     0   (0.0%) |     0   
(0.0%) |
  | ionic                 |   133 |    89  (66.9%) |     6   (4.5%) |     1   
(0.8%) |
  | ipython               |   482 |   362  (75.1%) |    45   (9.3%) |    11   
(2.3%) |
  | junit                 |   161 |   147  (91.3%) |     4   (2.5%) |     1   
(0.6%) |
  | lighttable            |    15 |     5  (33.3%) |     2  (13.3%) |     0   
(0.0%) |
  | magit                 |    88 |    75  (85.2%) |     5   (5.7%) |     0   
(0.0%) |
  | neural-style          |    28 |     0   (0.0%) |     0   (0.0%) |     0   
(0.0%) |
  | nodejs                |   781 |   649  (83.1%) |    98  (12.5%) |     5   
(0.6%) |
  | phpmyadmin            |   491 |   481  (98.0%) |     7   (1.4%) |     2   
(0.4%) |
  | react-native          |   168 |   130  (77.4%) |     5   (3.0%) |     0   
(0.0%) |
  | rust                  |   171 |   128  (74.9%) |    17   (9.9%) |    14   
(8.2%) |
  | spark                 |   186 |   149  (80.1%) |    17   (9.1%) |     2   
(1.1%) |
  | tensorflow            |   115 |    66  (57.4%) |     7   (6.1%) |     5   
(4.3%) |
  | test-more             |    19 |    15  (78.9%) |     1   (5.3%) |     1   
(5.3%) |
  | test-unit             |    51 |    34  (66.7%) |     8  (15.7%) |     2   
(3.9%) |
  | xmonad                |    23 |    22  (95.7%) |     1   (4.3%) |     1   
(4.3%) |
  | --------------------- | ----- | -------------- | -------------- | 
-------------- |
  | totals                |  6668 |  4391  (65.9%) |   422   (6.3%) |   144   
(2.2%) |

* I noticed that most of the "bonus" weights were actually negative,
  so calling them "bonuses" was misleading. Therefore, I negated the
  coefficients and now call them "penalties"

* I noticed that `git blame` was parsing diff options like
  `--indent-heuristic` from the command line, but wasn't using the
  values. That is fixed in a new patch [07/07].

* I redid all of the analysis and training with a bigger corpus. To
  check whether I was overtraining the heuristic, I also did the
  following experiment: I optimized the weights against a training set
  consisting of only some of the repositories, then tested it against
  the rest of the corpus. See the full results in the commit message
  for patch [06/06].

* I added a couple of smoke tests.

The companion project [3] now provides an easy way to replicate and
extend these results, if anybody is interested.

The new heuristic still has to be enabled via the `--indent-heuristic`
command-line parameter or the `diff.indentHeuristic` configuration
setting. Before we release this, we should decide what the final UI
should look like and make it so. If we decide to replace the existing
compaction heuristic with this one and/or turn this heuristic on for
all users by default, those steps might look like branch
`indent-heuristic-default` in my GitHub repository [2].

Michael

[1] http://public-inbox.org/git/cover.1470259583.git.mhag...@alum.mit.edu/t/#u
[2] https://github.com/mhagger/git
[3] https://github.com/mhagger/diff-slider-tools

Michael Haggerty (7):
  xdl_change_compact(): fix compaction heuristic to adjust ixo
  xdl_change_compact(): only use heuristic if group can't be matched
  is_blank_line(): take a single xrecord_t as argument
  recs_match(): take two xrecord_t pointers as arguments
  xdl_change_compact(): introduce the concept of a change group
  diff: improve positioning of add/delete blocks in diffs
  blame: actually use the diff opts parsed from the command line

 Documentation/diff-config.txt  |   7 +-
 Documentation/diff-options.txt |   9 +-
 builtin/blame.c                |  11 +-
 diff.c                         |  23 +-
 git-add--interactive.perl      |   5 +-
 t/t4059-diff-indent.sh         | 160 +++++++++++
 xdiff/xdiff.h                  |   1 +
 xdiff/xdiffi.c                 | 635 ++++++++++++++++++++++++++++++++++-------
 8 files changed, 736 insertions(+), 115 deletions(-)
 create mode 100755 t/t4059-diff-indent.sh

-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 0/7] Better heuristics make prettier diffs

Reply via email to