On Thursday 24 November 2005 10:25, Erik Hatcher wrote: > > On 24 Nov 2005, at 03:17, Paul Elschot wrote: > >> I must admit that I haven't used the surround parser. For my custom > >> parser (a legacy syntax that no one here would want), I take any term > >> that has an *, ?, or [...] syntax as a regex term. > > > > I had another look at the javadocs of java regex package. > > The normal brackets in a regex are not needed for queries, so they > > can be left as they are. > > I don't understand what you mean about brackets not being needed. > Brackets certainly factor in greatly into regular expressions: > > <http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html> > > What am I missing?
Capturing groups and special contexts need normal brackets (). Capturing groups are used for replacements, and I don't see a use for that in a query language. Special constructs with () brackets are used for non capturing groups, match flags, and lookahead/lookbehind. Would you know a use for these in a query language? > >> There are still some TODO's with the (Span)RegexQuery - such as being > >> wise about the prefix length. Right now it is not wise enough. I've > >> spent some time looking for a regex parser that could parse a regex > >> expression into an AST so that it could be used for determining the > >> last static character to start term enumeration. This would also > >> come in very handy in being able to rotate a regular expression > >> string to maximize the static prefix when indexing with an analyzer > >> that rotates terms. If anyone has suggestions/pointers to how this > >> could be accomplished, it'd be most appreciated! > > > > I think I'll simply treat each term as a potential regex and > > use alphanumeric characters for the prefix. I'll try and leave > > parsing of the regex to the java regex package as much as possible. > > Simply using alphanumeric to determine the prefix doesn't work in all > cases though. Here's an example I just added, commented out, to > TestRegexQuery: > > assertEquals(1, regexQueryNrHits("r?over")); > > The 'r' is optional, yet it is chosen as a prefix and thus no > documents match even though "over" does match that regex. The quick You're right, I missed the suffix meaning of ? . > fix for this is, like FuzzyQuery, to enumerate all terms blindly. > Another option is to allow the user of (Span)RegexQuery to provide > the prefix, at least pushing the burden back a bit. > > I still think a regex parser would be mighty useful for this issue, > as well as the next... > > > Rotating from the suffix should also be straightforward for > > alphanumerical chars. > > It's not as trivial as this, if you want the rotated prefixes to be > maximized. Take, for example, the regex expression > > ThisContainsTwo[abc]RegexPieces.*Total > > Using rotation to maximize the prefix and speeding up term > enumeration a calculation of of the maximum non-regex piece is > needed, including a calculation on whether the head and tail combined > make a larger prefix. For example, using '$' to denote the end of > the string, the rotated version of this should be: > > Total$ThisContainsTwo[abc]RegexPieces.* > > With a regex parse tree, it should be possible to be wise about what > is a static prefix and to compute the size of all the static pieces > allowing for clever rotation to make regex queries as efficient as > possible. Now where is that regex parser grammar?! :/ I also missed things like \u2014, which only add to the problem. There are some older regex implementations in java, but I have no idea about the licences and the availabiility. Doesn't apache have one somewhere? Without a regex parser a prefix regex operator like PRE(prefix,regex) in the surround parser would just do what is needed, except for the comma and the brackets, and a regular expression can work around these by using \xhh or \0nn . Btw. $ also has a special meaning in regexes. Regards, Paul Elschot --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]