Looks like interesting stuff Mark, but why did you make everything so
configurable (syntax-wise)?  IMO, there is a lot of value to
standards, and doing things like changing the precedence of operators
isn't necessarily a good thing :-)

I made it so configurable because I needed to implement a certain query language at work, but I think that the language is not that great. I don't like most of the choices in it. I needed something though, and it was going to require a lot of work...not only did we need arbitrary mixing of boolean and proximity operators, but we needed the sentence and paragraph proximity as well as the thesaurus expansion. We also have many people who ask for one offs that only apply to their setup, like NEAR being an operator that is really within 10. All of this was not something I could guarantee that I could do (I just entered the workforce), and I certainly didn't have time at work with everything else I needed to do for this project I am working on. I wasn't going to put so much free time into a parser that I did not like though. So I made it very configurable so that it could be configured into the parser I needed while still being the parser I wanted.

Did you ever get a chance to look at Paul's surround language? (I've
never had the chance to dive into it myself)

I have looked into Paul's parser and it is a very nice piece of work. Unfortunately, I needed to duplicate a very specific syntax. Also, Paul's parser would not give me sentence and paragraph proximity or <i>arbitrary</i> connecting of boolean and prox operators. That brings me back to why this is so configurable: a big reason is to be able to simulate a syntax that a customer may be familiar with and want retained. I think that order of operations should be standard too, but I see no problem with the standard at someones site being different than the standard I use for another site. Some people may want/need proximity to bind tighter than ANDNOR, while others might need/want the reverse. Being too configurable has it's draw backs, but I am attempting to create an alternative parser, not a QueryParser replacement. Choose the best weapon for the job ;)

Query-time thesaurus expansion / General token to query expansion :
Takes advantage of a general find/replace feature, "expand" might map to
"(expander | expanded)" ... or any other valid syntax.

The QueryParser does this instead of TokenFilters?
Is it based on static configuration?

I do not use TokenFilters as it does not fit my requirements (I think). Right now, a hashmap is used to map a token to replacement syntax. A queryparser is generated from a parserfactory. The parserfactory takes a configuration class. When you get a queryparser from the factory you can choose to inherit the config from the factory, or you can just set the options and configuration directly on the parser. I did this because I have a need for a base configuration to a common syntax that individual accounts than want to be able to tweak to their needs.

The queryparser is a two pass system. The first pass does not tokenize...it does query expansion and preps the suggested query (the suggested query must be suggested in the syntax the query was typed in, and without expansion). I had worried about speed when I made the 2 pass decision, but it has allowed me great flexibility, and with my testing so far I have had 0 speed problems.

By the way, I have recently tested the paragraph/sentence proximity searching (mark within 4 sentences of dog) on a 300k doc index (docs 8-20k) and the perceived speed was as fast as a normal one or two work boolean search (not a very scientific test :))

A problem with the paragraph/sentence proximity search right now is that if there is only 1 doc in the index the proximity search will wrap. I am sure this can be fixed.

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to