On 24 Nov 2005, at 20:26, Erik Hatcher wrote:
There are some older regex implementations in java, but I
have no idea about the licences and the availabiility.
Doesn't apache have one somewhere?

Two actually! ORO and Regexp. Here's ORO - <http:// jakarta.apache.org/oro/> (link to Regexp from there)

I'll dig into those soon and see what useful goodies lurk within.

From perusing the API via Javadocs, Regexp mentioned just what we need, but I didn't see the same sort of thing with ORO. So I pulled down Jakarta Regexp and dropped it in. I had to add a getter for a package protected internal "prefix" to REProgram, but once I did that, here are some passing tests...

    assertEquals(1, getPrefix("a[bc]*"));
    assertEquals(2, getPrefix("a\\$[bc]*"));
    assertEquals(0, getPrefix("r?over"));


  private int getPrefix(String expression) {
    REProgram program = new RECompiler().compile(expression);
    char[] prefix = program.getPrefix();
    return prefix == null ? 0 : prefix.length;
  }

Quite promising! The REProgram has the full parse tree as "instructions", so it'd be possible to use this for clever rotation also, I believe. I'm sure Regexp doesn't support the full Perl5 syntax that Java's regex package does, but it seems to be good enough for the basic regex syntax.

A couple of issues... 1) to use this additional library, (Span) RegexQuery should be pulled into contrib/regex, 2) It'd be a little awkward to use Jakarta Regexp to determine the prefix and potentially be used for rotation logic, and then use JDK regex for the actual matching. I have no data to say which has faster matching, or another pros/cons, just that it could potentially mismatch. I'm inclined to swap completely to Jakarta Regexp for matching as well, at least for the time being in order to keep things in sync and benefit from more clever term enumeration. The time saved in term enumeration seems likely to more than make up for matching speed differences.

Thoughts?

        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to