On Thu, Sep 08, 2016 at 09:29:58AM +0200, Johannes Schindelin wrote:

> sorry for the late answer, I was really busy trying to come up with a new
> and improved version of the patch series, and while hunting a bug I
> introduced got bogged down with other tasks.

No problem. I am not in a hurry.

> > I always assumed the _point_ of re_search taking a ptr/len pair was
> > exactly to handle this case. The documentation[1] says:
> > 
> >    `string` is the string you want to match; it can contain newline and
> >    null characters. `size` is the length of that string.
> > 
> > Which seems pretty definitive to me (that's for re_match(), but
> > re_search() is defined in the docs in terms of re_match()).
> Right. The problem is: I *really* want to avoid using GNU-isms.

I don't think GNU-isms are a problem if we wrap them to give a nice
interface, and if we rely on having compat/regex. But if you mean "I do
not want to rely on using compat/regex everywhere", then OK. I can see
arguments both for and against using a consistent regex library, but I
do not care that much either way myself.

> > We can contain this to the existing compat/regexec/regexec.c, and just
> > provide a wrapper that is similar to regexec but takes a ptr/len pair.
> But we can do even better than that: we can provide a wrapper that uses
> REG_STARTEND where available (which is really the majority of platforms we
> care about: Linux, MacOSX, Windows, and even the *BSDs). Where it is not
> available, we simply malloc(), memcpy() and append a NUL.

Doesn't that make things much _worse_ for people on systems without
REG_STARTEND? If we imagine that most regexec calls would operate on a
NUL-terminated buffer, then they are now paying the extra malloc and
copy for each call to regexec_buf(), even if the buffer was already
NUL-terminated (because they have no idea whether it was or not).

I think I'd rather just have:

  #ifndef REG_STARTEND
  #error "Your regex library sucks. Compile with NO_REGEX=NeedsStartEnd"

(or you could just use REG_STARTEND and let the compiler complain, but
then the user has to figure out the right knob to twiddle).

One other question about REG_STARTEND is: what does it do with NULs
inside the buffer? Certainly glibc (and our compat/regex) treat it as a
buffer with a particular length and ignore embedded NULs, as we want.
But the NetBSD documentation says only:

     REG_STARTEND   The string is considered to start at string +
                    pmatch[0].rm_so and to have a terminating NUL
                    located at string + pmatch[0].rm_eo (there need not
                    actually be a NUL at that location), 

Besides avoiding a segfault, one of the benefits of regcomp_buf() is
that we will now find pickaxe-regex strings inside mixed binary/text
files. But it's not clear to me that NetBSD's implementation does this.

I guess we can assume it is fine (it is certainly no _worse_ than the
current behavior), and if people's platforms do not handle it, they can
build with NO_REGEX.


Reply via email to