There's a strong expectation that ctrl-A and ctrl-a both map to \u0001, so I support Ivan's initiative. I'm surprised java regex gets this wrong. Might need a transitional system property.
What's the best bit-twiddle? Untested: if ((c ^= 0x40) < 0x20) return c; if ((c ^=0x20) <= 26 && c > 0) return c; 0x40 is more readable than 64. Does ctrol-? get mapped to 0x7f ? On Thu, Aug 29, 2019 at 8:15 PM Ivan Gerasimov <ivan.gerasi...@oracle.com> wrote: > Hi Stuart! > > Thank you for your comments on the proposal! > I totally agree with you that the fixes that result in changing the > behavior have to be carefully planned and well thought. > Please see my answers inline. > > On 8/29/19 6:15 PM, Stuart Marks wrote: > > Hi Ivan, > > > > This change certainly makes regex patterns more rigorous, but I'm > > concerned about the compatibility. This is a spec change and also a > > behavior change. While the current behavior might not strictly be > > correct, it does have some characteristics that applications might be > > depending on -- perhaps even by accident. If this change is made, it > > might cause subtle issues in applications that would be quite > > difficult to diagnose. > > > There are two types of changes in the proposal: > First, make \cx construct more restrictive w.r.t. possible values of x. > Second, make it case-insensitive, so \cz will mean the same as \cZ. > > I agree that the later type of changes can potentially cause hard to > diagnose failures in the existing applications. > The reason for this later change was an attempt to make Java regexp more > Perl-compatible (and Perl does exactly this: Treats \cx with a > lower-case x as \cX with its upper-case counterpart). > > The former type of change, on the other hand, might actually be useful > for the existing applications, as it may allow to see (otherwise hard to > find) bugs in the code. > > For example, if some existing application has a regexp, which contains > "\\c\t", this is most likely a programming error (or a nasty hack to > code a char 'I'), so it would be beneficial to report it via throwing a > PatternSyntaxException. > > > > Examples of changes I'm concerned about are: > > > > pattern \ca currently matches '!' would now match \u0001 > > pattern \cÀ currently matches \u0080 would now throw exception > > pattern \c0 currently matches 'p' would now throw exception > > > > and so forth. That is, using \c with characters in the range [a-z] > > would now match different characters from before, and using \c with > > characters outside the set that correspond to C0 control characters > > would now throw an exception whereas before they would matching > > something that was predictable, if in some sense incorrect. > > > > There are some ways to mitigate the incompatibility, for example, by > > adding a system property, or by adding a Pattern flag that explicitly > > enables this behavior, though I'm not sure that either is worthwhile. > > Maybe there are less intrusive ways that I haven't thought of. > > > I think it would make sense to first separate the changes mentioned > above, so it will be easier to reason about them. > > While I think that making lower-case \\cx a synonym for the upper-case > \\cX is a good thing to have (mainly for additional compatibility with > Perl-style regexps), it's supposed be deferred to a later time. > > > > > The current behavior seems to be have been established around 1999 > > (JDK 1.3?) so it's been around a long time, plenty of time for > > applications to have formed inadvertent dependencies on this behavior. > > An alternative would be simply to document the current behavior, even > > though it's arguably incorrect. > > > > Is there some benefit to this change, for example, does it enable one > > to write an application that wasn't possible before because of this bug? > > > In my opinion, the benefits are: > 1) Fail-early approach, which will let the developers catch the bugs > earlier, > 2) Better compatibility with Perl-style regexps, so that a wider class > of regular expressions can be used across different platforms, > 3) Well defined strict rules will allow IDEs to implement additional > edit-time checks of regexp, which will again help developers. > > Please find the updated webrev here: > http://cr.openjdk.java.net/~igerasim/8230365/01/webrev/ > > With kind regards, > Ivan > > > > s'marks > > > > > > On 8/29/19 4:39 PM, Ivan Gerasimov wrote: > >> Hello! > >> > >> In a regular expression pattern a sequence of the form \\cx is > >> allowed to specify a control character that corresponds to the name > >> char x. > >> > >> Current implementation has a few issues with that: > >> 1) It allows x to be just any character, including non-printable ones; > >> 2) The produced regexp may correspond to a non-control characters; > >> 3) The expression is case-sensitive, so, for example \\cA differs > >> from \\ca, while they both have to produce ctrl-A. > >> > >> It is proposed to make parsing more strict and reject invalid values > >> of x, and also clarify the documentation to explicitly list valid > >> values of x. > >> > >> If we agree on this proposal, then a CSR will probably need to be > >> filed to capture the changes in the regexp parsing. > >> > >> Would you please help review the fix? > >> > >> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8230365 > >> WEBREV: http://cr.openjdk.java.net/~igerasim/8230365/00/webrev/ > >> > > > -- > With kind regards, > Ivan Gerasimov > >