On Mon, 5 Sep 2016, Junio C Hamano wrote:
> Johannes Schindelin <johannes.schinde...@gmx.de> writes:
> > This patch series addresses a problem where `git diff` is called using
> > `-G` or `-S --pickaxe-regex` on new-born files that are configured
> > without user diff drivers, and that hence get mmap()ed into memory.
> Good spotting. This has been with us almost forever; I do not think
> the original pickaxe had it, but I am sure it is broken after the
> "--pickaxe-regex" enhancement.
Agreed, regexec() is the call where it segfaults.
> I am somehow surprised that this is a problem on Windows, though.
> Wouldn't we be at least running CRLF conversions, and causing diff
> or grep machinery to work on a NUL-terminated buffer?
It is true that the CR/LF conversion hides this problem. In fact, in the
case reported to me, it turned out that the segfault happened only
recently, when the repository was switched from LF line endings to CR/LF
That switch is unfortunately required: it saves *tons* of time because the
regular CR/LF conversion just takes too much time. It was worse before the
repository defined the .gitattributes: the auto-detection contributed to
the time spent by Git.
So yes: the CR/LF conversion hid the bug, but no, we cannot re-introduce
the CR/LF conversion into said repository.
> The convesion code would have to look at mmap'ed memory but I do not
> think it assumes NUL-termination. Perhaps people on Windows do not
> usually use straight-through and that is why this was discovered after
> many years, or something? In any case, that is a digression.
Indeed, it is.
> > Windows (the bug does not trigger a segmentation fault for me on
> > Linux, strangely enough, but it is really a problem on Windows).
> I think it is an issue on all platforms that lets us use mmap().
> When the size of a file is multiple of pagesize, the byte past the
> end of the file can very well fall on an unmapped address.
> > So at least I have a workaround in place. Ideally, though, we would
> > NUL-terminate the buffers only when needed, or somehow call regexec()
> > on ptr/size parameters instead of passing a supposedly NUL-terminated
> > string to it?
> I see two reasonable approaches.
> * We may want to revisit the performance numbers to see if using
> mmap() to read from the working tree files still buys us much.
> If not, we should stop borrowing from the working tree using
> mmap(); instead just slurp in and NUL-terminate it.
I would like to warn against putting too much stock into such a test,
unless it is performed on Linux, MacOSX, Windows and various BSDs. That
would make it hard, of course, to come up with a definitive result, but we
simply should not make the mistake of over-optimizing for one platform.
We used to, of course, and look how much performance it costs e.g. on
> * We could use <ptr,len> variant of regexp engine as you proposed,
> which I think is a preferrable solution. Do people know of a
> widely accepted implementation that we can throw into compat/ as
> fallback that is compatible with GPLv2?
As suggested by Peff, there is a compat/regex/, and I will spout my
thoughts in a reply to his mail.