Looks like interesting stuff Mark, but why did you make everything so
configurable (syntax-wise)? IMO, there is a lot of value to
standards, and doing things like changing the precedence of operators
isn't necessarily a good thing :-)
I made it so configurable because I needed to implement a certain query
language at work, but I think that the language is not that great. I
don't like most of the choices in it. I needed something though, and it
was going to require a lot of work...not only did we need arbitrary
mixing of boolean and proximity operators, but we needed the sentence
and paragraph proximity as well as the thesaurus expansion. We also have
many people who ask for one offs that only apply to their setup, like
NEAR being an operator that is really within 10. All of this was not
something I could guarantee that I could do (I just entered the
workforce), and I certainly didn't have time at work with everything
else I needed to do for this project I am working on. I wasn't going to
put so much free time into a parser that I did not like though. So I
made it very configurable so that it could be configured into the parser
I needed while still being the parser I wanted.
Did you ever get a chance to look at Paul's surround language? (I've
never had the chance to dive into it myself)
I have looked into Paul's parser and it is a very nice piece of work.
Unfortunately, I needed to duplicate a very specific syntax. Also,
Paul's parser would not give me sentence and paragraph proximity or
<i>arbitrary</i> connecting of boolean and prox operators. That brings
me back to why this is so configurable: a big reason is to be able to
simulate a syntax that a customer may be familiar with and want
retained. I think that order of operations should be standard too, but I
see no problem with the standard at someones site being different than
the standard I use for another site. Some people may want/need proximity
to bind tighter than ANDNOR, while others might need/want the reverse.
Being too configurable has it's draw backs, but I am attempting to
create an alternative parser, not a QueryParser replacement. Choose the
best weapon for the job ;)
Query-time thesaurus expansion / General token to query expansion :
Takes advantage of a general find/replace feature, "expand" might map to
"(expander | expanded)" ... or any other valid syntax.
The QueryParser does this instead of TokenFilters?
Is it based on static configuration?
I do not use TokenFilters as it does not fit my requirements (I think).
Right now, a hashmap is used to map a token to replacement syntax. A
queryparser is generated from a parserfactory. The parserfactory takes a
configuration class. When you get a queryparser from the factory you can
choose to inherit the config from the factory, or you can just set the
options and configuration directly on the parser. I did this because I
have a need for a base configuration to a common syntax that individual
accounts than want to be able to tweak to their needs.
The queryparser is a two pass system. The first pass does not
tokenize...it does query expansion and preps the suggested query (the
suggested query must be suggested in the syntax the query was typed in,
and without expansion). I had worried about speed when I made the 2 pass
decision, but it has allowed me great flexibility, and with my testing
so far I have had 0 speed problems.
By the way, I have recently tested the paragraph/sentence proximity
searching (mark within 4 sentences of dog) on a 300k doc index (docs
8-20k) and the perceived speed was as fast as a normal one or two work
boolean search (not a very scientific test :))
A problem with the paragraph/sentence proximity search right now is that
if there is only 1 doc in the index the proximity search will wrap. I am
sure this can be fixed.
- Mark
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]