Thank you Stuart for the analysis!

Please see my comments inline.

On 9/9/19 4:39 PM, Stuart Marks wrote:


On 9/5/19 1:43 PM, Ivan Gerasimov wrote:
Personally, I don't have a strong preference here.

The compatibility property are meant to be temporary anyways.

Maybe if we have a single option that will control several different aspects of behavior, it will be harder to get rid of it?

Partially, because it will be tempting to reuse it for other similar changes, should they be needed.

OK, let's take an inventory of what behavior changes are being contemplated for regexes:

JDK-8230675 restrict IDs for control chars
JDK-xxxxxxx allow case-insensitive IDs for control chars *NOTE*
JDK-8225021 Treat ambiguous embedded flags as parse syntax errors
JDK-8214245 Case insensitive matching doesn't work correctly for some character classes

I quickly searched JBS and found several more bugs/enhancements requests that, if implemented, may result in the behavior changes.

Here's (presumably incomplete) list:

JDK-8218146  $ matches before end of line, even without MULTILINE mode
JDK-8217977  Matcher matching trailing high surrogate reports false for requireEnd()
JDK-8217501  Matcher.hitEnd returns false for incomplete surrogate pairs
JDK-8217496  Matcher.group() can return null after usePattern
JDK-8216332  Grapheme regex does not work with emoji sequences
JDK-8199594  Regex Pattern class improperly ignores spaces in character classes
JDK-8187083  Regex: Capturing groups inside a lookahead aren't backtracked
JDK-8187082  Regex: Nested capturing groups under lazy repetition aren't backtracked
JDK-8183391  Regex: End of line found more than once for non-multiline regex
JDK-8179668  Valid regex patterns match the latter half of complete surrogate pairs
JDK-8029966  Broken supplementary character support in regex
JDK-6919621  Matcher find returns wrong result in java 1.6 for certain patterns

All of them are of low priorities, so I don't anticipate active work on these bugs in the near future. Though at least some of them, if fixed, would make the Java regexp engine better, so it probably wouldn't make sense to just abandon these request because of the compatibility reasons.

*NOTE* this was part of the original JDK-8230675 proposal, but you removed it after discussion. I don't know if we decided never to do this, or whether we're merely considering it separately. It seemed to me that there was a possibility that we'd do this in the future.

I was thinking of filling an enhancement request with the fix version set to TBD, so we can return to this proposal in some future release, if desirable.


Is this all the behavior changes being contemplated, or is this simply the set that we happened to have stumbled across recently? Put another way, if we decided to do some further analysis of regexes, would we run across other issues where we might say, "Yeah, we ought to fix that, but that would be a potentially incompatible behavior change, so we need to add another property." ?

In practice, such properties are only removed after a very long time, or perhaps even "never." It's not like this change would be added in this release (JDK 14), with backward compatibility support removed in a year (say, JDK 16) along with the property. The property, and the backward compatibility mode, would stick around in the code for many years.

What I want to avoid doing is to introduce behavior changes -- and properties to control them -- in a piecemeal fashion. It looks like we might have three or four candidates already. Would we want to live with three or four properties? If we did this and continued with additional changes, we might end up with six or eight or ten properties over time.

I'd like to see us look ahead a bit and take stock of what changes we're contemplating, and then decided how to deal with compatibility and migration based on that. I'd like to avoid making individual changes (and adding properties) one at a time, with decisions made in isolation, because that will lead to a proliferation of properties.

So, there are two alternatives at the table at this time:
1) A single compatibility property to revert to the old behavior; The property is reused for each of listed above bugs, so with each fix a portion of revert logic is added to the property.

PROS:  Easy to implement and maintain.
CONS:  Over time, can become hard to track, what exactly the property controls, so may be hard to use.  For example, if a user turns on this property to revert a single aspect of behavior, one will get all other behavior oddities.

2) Individual properties for every change of behavior.

PROS:  If needed, the behavior can be fine-grained.  Easier to understand what the expected behavior would be with every set of properties set. CONS:  Complex to maintain.  For the majority of cases would be just an overkill.  Also, can greatly increase number of testing (naively up to 2^{# of properties}).

One possible compromise might be to introduce one umbrella property + set of individual properties as desired.  This all can be plugged into one string property, of course:

jdk.util.regex.mode=strict  # default
jdk.util.regex.mode=compatibility  #  turns on all compatibility properties at once jdk.util.regex.mode=restrictCntrlCharIds=yes,rejectAmbiguousEmbeddedFlags=no # fine grained settings

If the changes implemented carefully, so that the individual properties are "orthogonal", then we wouldn't need to test all possible combinations, but only two opposite modes: strict and compatibility.

Do you think it's a viable approach?

--
With kind regards,
Ivan Gerasimov

Reply via email to