On 24 Nov 2005, at 20:26, Erik Hatcher wrote:
There are some older regex implementations in java, but I
have no idea about the licences and the availabiility.
Doesn't apache have one somewhere?
Two actually! ORO and Regexp. Here's ORO - <http://
jakarta.apache.org/oro/> (link to Regexp from there)
I'll dig into those soon and see what useful goodies lurk within.
From perusing the API via Javadocs, Regexp mentioned just what we
need, but I didn't see the same sort of thing with ORO. So I pulled
down Jakarta Regexp and dropped it in. I had to add a getter for a
package protected internal "prefix" to REProgram, but once I did
that, here are some passing tests...
assertEquals(1, getPrefix("a[bc]*"));
assertEquals(2, getPrefix("a\\$[bc]*"));
assertEquals(0, getPrefix("r?over"));
private int getPrefix(String expression) {
REProgram program = new RECompiler().compile(expression);
char[] prefix = program.getPrefix();
return prefix == null ? 0 : prefix.length;
}
Quite promising! The REProgram has the full parse tree as
"instructions", so it'd be possible to use this for clever rotation
also, I believe. I'm sure Regexp doesn't support the full Perl5
syntax that Java's regex package does, but it seems to be good enough
for the basic regex syntax.
A couple of issues... 1) to use this additional library, (Span)
RegexQuery should be pulled into contrib/regex, 2) It'd be a little
awkward to use Jakarta Regexp to determine the prefix and potentially
be used for rotation logic, and then use JDK regex for the actual
matching. I have no data to say which has faster matching, or
another pros/cons, just that it could potentially mismatch. I'm
inclined to swap completely to Jakarta Regexp for matching as well,
at least for the time being in order to keep things in sync and
benefit from more clever term enumeration. The time saved in term
enumeration seems likely to more than make up for matching speed
differences.
Thoughts?
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]