Hi Timo, Sherman,
Thanks for looking at this.
Sherman wrote:
This might practically put the api itself almost useless? it might be an easy
task to spot
whether or not it's a 0-width-match-possible regex when the regex is simple,
but it gets
harder and harder, if not impossible when the regex gets complicated,
especially consider
the possible use scenario that the use site is embedded deeply inside a
library implementation.
Well, not "useless", but perhaps less useful than one might like. :-)
I think this is potentially surprising behavior, which is why I at least wanted
to add the note. It's not clear to me whether we should try to fix this by
changing Scanner though.
Essentially, findAll() is defined in terms of findWithinHorizon(pattern, 0). So
if one were to write a loop like so:
String str;
while ((str = scanner.findWithinHorizon(pattern, 0)) != null) {
System.out.println(str);
}
then this loop would have the same problem if pattern were to match zero
characters.
The alternative is to "fix" it, maybe as what Matcher.find() does, if the
previous match is
zero-width-match (the fist==last), we step one to the next cursor before next
try. I know
Interesting, I didn't know Matcher.find() advances the cursor like this. But
Scanner.findWithinHorizon() apparently doesn't, so that's why an infinite loop
can occur.
Scanner.findPatternInBuffer() is setting new region set every time it is
invoked which makes
it complicated, but I would assume it might be still worth a trying? for
example, utilize the
"hasNextResult"/matcher.end(). I'm not sure without looking into the code, does
while (hasNext(pattern)) {
next(pattern);
}
have the same issue, when pattern matches 0-width?
No, this doesn't have the problem, because hasNext(pat) and next(pat) match
delimited tokens. Each call to next() implicitly advances past the next
delimiter to reach the subsequent token, if any.
On 3/30/17 8:56 AM, Timo Kinnunen wrote:
I guess this somewhat contrived example also wouldn’t work?
String s = "\\b\\w+|\\G|\\B";
String t = "Matcher m = Pattern.compile(s).matcher(t);\n";
Matcher m = Pattern.compile(s).matcher(t);
while(m.find()) {
System.out.println("'" + m.group() + "'");
}
Right, so if you rewrote this loop to use Scanner.findWithinHorizon() instead of
Matcher,
Scanner sc = new Scanner(t);
String str;
while ((str = sc.findWithinHorizon(s, 0)) != null) {
System.out.println("'" + str + "'");
}
you'd get an infinite loop with str being continually assigned the empty string.
As Sherman mentioned, the Matcher.find() will advance the cursor if it gets a
zero-width match, avoiding this problem.
* * *
This didn't come up in the code review thread, which was mostly about concurrent
modification and late-binding of the spliterator:
http://mail.openjdk.java.net/pipermail/core-libs-dev/2015-September/035034.html
I remember noting this phenomenon a while back, which is why I had filed the bug
to add a note. I seem to remember discussing it, though, but it might have been
in a meeting or in a hallway conversation.
This bug (JDK-8150488) does note that an infinite stream might be unexpected or
surprising, but it's not a fatal problem. It can be terminated with limit(). It
can also be terminated with takeWhile(), also added in JDK 9. Maybe I could
mention these in the API note.
I guess we could also consider changing the implicit findWithinHorizon() loop
that findAll() does, perhaps by having it terminate on a zero-width match. Or we
could even change findWithinHorizon's behavior if it gets a zero-width match,
siilar to what Matcher.find() does. But I'm quite reluctant to start making such
changes at this point.
s'marks
// Outputs:
// 'Matcher'
// ''
// 'm'
// ''
// ''
// ''
// 'Pattern'
// ''
// 'compile'
// ''
// 's'
// ''
// ''
// 'matcher'
// ''
// 't'
// ''
// ''
// ''
// ''
Sent from Mail for Windows 10
From: Xueming Shen
Sent: Thursday, March 30, 2017 05:41
To: [email protected]
Subject: Re: JDK 9 RFR(s): 8150488: add note to Scanner.findAll()
regardingpossible infinite streams
On 3/29/17, 5:56 PM, Stuart Marks wrote:
Hi all,
Please review these non-normative textual additions to the
Scanner.findAll() method docs. These methods were added earlier in JDK
9; there's a small pitfall if the regex can match zero characters.
Stuart,
This might practically put the api itself almost useless? it might be an
easy task to spot
whether or not it's a 0-width-match-possible regex when the regex is
simple, but it gets
harder and harder, if not impossible when the regex gets complicated,
especially consider
the possible use scenario that the use site is embedded deeply inside a
library implementation.
The alternative is to "fix" it, maybe as what Matcher.find() does, if
the previous match is
zero-width-match (the fist==last), we step one to the next cursor before
next try. I know
Scanner.findPatternInBuffer() is setting new region set every time it is
invoked which makes
it complicated, but I would assume it might be still worth a trying? for
example, utilize the
"hasNextResult"/matcher.end(). I'm not sure without looking into the
code, does
while (hasNext(pattern)) {
next(pattern);
}
have the same issue, when pattern matches 0-width?
Thanks!
-Sherman
Thanks,
s'marks
# HG changeset patch
# User smarks
# Date 1490749958 25200
# Tue Mar 28 18:12:38 2017 -0700
# Node ID 6b43c4698752779793d58813f46d3687c17dde75
# Parent fb54b256d751ae3191e9cef42ff9f5630931f047
8150488: add note to Scanner.findAll() regarding possible infinite
streams
Reviewed-by: XXX
diff -r fb54b256d751 -r 6b43c4698752
src/java.base/share/classes/java/util/Scanner.java
--- a/src/java.base/share/classes/java/util/Scanner.java Mon Mar 27
15:12:01 2017 -0700
+++ b/src/java.base/share/classes/java/util/Scanner.java Tue Mar 28
18:12:38 2017 -0700
@@ -2808,6 +2808,10 @@
* }
* }</pre>
*
+ * <p>The pattern must always match at least one character. If
the pattern
+ * can match zero characters, the result will be an infinite stream
+ * of empty matches.
+ *
* @param pattern the pattern to be matched
* @return a sequential stream of match results
* @throws NullPointerException if pattern is null
@@ -2829,6 +2833,11 @@
* scanner.findAll(Pattern.compile(patString))
* }</pre>
*
+ * @apiNote
+ * The pattern must always match at least one character. If the
pattern
+ * can match zero characters, the result will be an infinite stream
+ * of empty matches.
+ *
* @param patString the pattern string
* @return a sequential stream of match results
* @throws NullPointerException if patString is null