Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

Gilles Darold Tue, 03 Aug 2021 02:45:32 -0700

Le 02/08/2021 à 23:22, Gilles Darold a écrit :

Le 02/08/2021 à 01:21, Tom Lane a écrit :

Gilles Darold <[email protected]> writes:

[ v5-0001-regexp-foo-functions.patch ]

I've gone through this whole patch now, and found quite a lot that I did
not like.  In no particular order:


* Wrapping parentheses around the user's regexp doesn't work.  It can
turn an invalid regexp into a valid one: for example 'a)(b' should draw
a syntax error.  With this patch, no error would be thrown, but the
"outer" parens wouldn't do what you expected.  Worse, it can turn a
valid regexp into an invalid one: the metasyntax options described in
9.7.3.4 only work at the start of the regexp.  So we have to handle
whole-regexp cases honestly rather than trying to turn them into an
instance of the parenthesized-subexpression case.

* You did a lot of things quite inefficiently, apparently to avoid
touching any existing code.  I think it's better to extend
setup_regexp_matches() and replace_text_regexp() a little bit so that
they can support the behaviors these new functions need.  In both of
them, it's absolutely trivial to allow a search start position to be
passed in; and it doesn't take much to teach replace_text_regexp()
to replace only the N'th match.

* Speaking of N'th, there is not much of anything that I like
about Oracle's terminology for the function arguments, and I don't
think we ought to adopt it.  If we're documenting the functions as
processing the "N'th match", it seems to me to be natural to call
the parameter "N" not "occurrence".  Speaking of the "occurrence'th
occurrence" is just silly, not to mention long and easy to misspell.
Likewise, "position" is a horribly vague term for the search start
position; it could be interpreted to mean several other things.
"start" seems much better.  "return_opt" is likewise awfully unclear.
I went with "endoption" below, though I could be talked into something
else.  The only one of Oracle's choices that I like is "subexpr" for
subexpression number ... but you went with DB2's rather vague "group"
instead.  I don't want to use their "capture group" terminology,
because that appears nowhere else in our documentation.  Our existing
terminology is "parenthesized subexpression", which seems fine to me
(and also agrees with Oracle's docs).

* I spent a lot of time on the docs too.  A lot of the syntax specs
were wrong (where you put the brackets matters), many of the examples
seemed confusingly overcomplicated, and the text explanations needed
copy-editing.

* Also, the regression tests seemed misguided.  This patch is not
responsible for testing the regexp engine as such; we have tests
elsewhere that do that.  So I don't think we need complex regexps
here.  We just need to verify that the parameters of these functions
act properly, and check their error cases.  That can be done much
more quickly and straightforwardly than what you had.


So here's a revised version that I like better.  I think this
is pretty nearly committable, aside from the question of whether
a too-large subexpression number should be an error or not.


Thanks a lot for the patch improvement and the guidance. I have read the
patch and I agree with your choices I think I was too much trying to
mimic the oraclisms. I don't think we should take care of the too-large
subexpression number, the regexp writer should always test its regular
expression and also this will not prevent him to chose the wrong capture
group number but just a non existing one.

Actually I just found that the regexp_like() function doesn't supportthe start parameter which is something we should support. I saw thatOracle do not support it but DB2 does and I think we should also supportit. I will post a new version of the patch once it is done.



Best regards,

--
Gilles Darold

Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

Reply via email to