Re: [PATCH 02/18] Add a new builtin: branch-diff

Johannes Schindelin Thu, 03 May 2018 14:50:31 -0700

Hi Stefan,

On Thu, 3 May 2018, Stefan Beller wrote:

> On Thu, May 3, 2018 at 1:42 PM, Johannes Schindelin
> <johannes.schinde...@gmx.de> wrote:
> 
> >> Speaking of colors, for origin/sb/blame-color Junio hinted at re-using
> >> cyan for "uninteresting" parts to deliver a consistent color scheme for
> >> Git. Eventually he dreams of having 2 layers of indirection IIUC, with
> >>     "uninteresting" -> cyan
> >>     "repeated lines in blame" -> uninteresting
> >>
> >> Maybe we can fit the coloring of this tool in this scheme, too?
> >
> > Sure. So you mean I should use cyan for... what part of the colored
> > output? ;-)
> 
> It is just a FYI heads up, not an actionable bikeshed painting plan. ;)

Oh, I did not understand it as bike-shedding at all. I had hoped that you
ran `git branch-diff --dual-color` on something interesting and found e.g.
the yellow color inherited from DIFF_COMMIT to be the wrong color for
unchanged commit pairs.

So please: as soon as you have a concrete suggestion where to use cyan
(and preferably even a DIFF_* constant to feed to diff_get_color_opt()), I
will be more than interested.

> >> Do we need to dynamic of a floating point, or would a rather small range
> >> suffice here? (Also see rename detection settings, that take percents as
> >> integers)
> >
> > I guess you are right, and we do not need floats. It was just very, very
> > convenient to do that instead of using integers because
> >
> > - I already had the Jonker-Volgenant implementation "lying around" from my
> >   previous life as an image processing expert, using doubles (but it was
> >   in Java, not in C, so I quickly converted it for branch-diff).
> >
> > - I was actually not paying attention whether divisions are a thing in the
> >   algorithm. From a cursory glance, it would appear that we are never
> >   dividing in hungarian.c, so theoretically integers should be fine.
> >
> > - using doubles neatly side-steps the overflow problem. If I use integers
> >   instead, I always will have to worry what to do if, say, adding
> >   `INT_MAX` to `INT_MAX`.
> >
> > I am particularly worried about that last thing: it could easily lead to
> > incorrect results if we blindly, say, pretend that `INT_MAX + INT_MAX ==
> > INT_MAX` for the purpose of avoiding overflows.
> >
> > If, however, I misunderstood and you are only concerned about using
> > *double-precision* floating point numbers, and would suggest using `float`
> > typed variables instead, that would be totally cool with me.
> 
> So by being worried about INT_MAX occurring, you are implying that
> we have to worry about a large range of values, so maybe floating points
> are the best choice here.

I am not really worried about a large range of values, I am worried about
a use case where we use the maximal value as an "impossible, must avoid at
all cost" value. See this line in hungarian.c:

                        u2 = DBL_MAX;

It does not seem as if any arithmetic is done on u2 after that
(theoretically, it should not survive the loop that comes after it and
tries to replace u2 with any smaller value it finds, but what if that loop
does not even run because column_count == 1? Sure, it is a pathological
case, but even those should have correct results).

But actually, yes, there *is* arithmetic performed on u2:

                        if (u1 < u2)
                                v[j1] -= u2 - u1;

So in the pathological case where we try to find the best assignment of a
single column to an arbitrary number of rows (where simply the row with
a smallest cost should be picked), we try to subtract from v[j1] an
insanely large value. As a consequence, v[j1] will be close to INT_MIN if
we were to switch to integers, and who is to say that the next time we get
to this part, j1 will be different? If it is the same, and we hit the same
u2, then we might end up subtracting something close to INT_MAX from
INT_MIN, which will definitely overflow and the computation will be
incorrect.

*That* is what I am worried about: overflowing integer arithmetic. IIRC if
you subtract DBL_MAX from -DBL_MAX, you still end up with -DBL_MAX. So in
that respect, using floating point numbers here is safer.

> Looking through that algorithm the costs seem to be integers only
> measuring number of lines, so I would not be too worried about running
> into INT_MAX problems except for the costs that are assigned INT_MAX
> explicitly.
> 
> I was more asking, if floating point is the right tool for the job.

I think I would have to spend some real quality time with the code in
hungarian.c to turn it into using integer costs instead of floating point
numbers, to ensure that the arithmetic is done in a way that is consistent
with the algorithm, even if we cannot represent the arithmetic faithfully
with limited-range integers.

I'll think about the best way forward.

Ciao,
Dscho

Re: [PATCH 02/18] Add a new builtin: branch-diff

Reply via email to