Re: New Lucene QueryParser

Mark Miller Wed, 03 Jan 2007 09:32:20 -0800


Looks like interesting stuff Mark, but why did you make everything so
configurable (syntax-wise)?  IMO, there is a lot of value to
standards, and doing things like changing the precedence of operators
isn't necessarily a good thing :-)

I made it so configurable because I needed to implement a certain querylanguage at work, but I think that the language is not that great. Idon't like most of the choices in it. I needed something though, and itwas going to require a lot of work...not only did we need arbitrarymixing of boolean and proximity operators, but we needed the sentenceand paragraph proximity as well as the thesaurus expansion. We also havemany people who ask for one offs that only apply to their setup, likeNEAR being an operator that is really within 10. All of this was notsomething I could guarantee that I could do (I just entered theworkforce), and I certainly didn't have time at work with everythingelse I needed to do for this project I am working on. I wasn't going toput so much free time into a parser that I did not like though. So Imade it very configurable so that it could be configured into the parserI needed while still being the parser I wanted.

Did you ever get a chance to look at Paul's surround language? (I've
never had the chance to dive into it myself)

I have looked into Paul's parser and it is a very nice piece of work.Unfortunately, I needed to duplicate a very specific syntax. Also,Paul's parser would not give me sentence and paragraph proximity or<i>arbitrary</i> connecting of boolean and prox operators. That bringsme back to why this is so configurable: a big reason is to be able tosimulate a syntax that a customer may be familiar with and wantretained. I think that order of operations should be standard too, but Isee no problem with the standard at someones site being different thanthe standard I use for another site. Some people may want/need proximityto bind tighter than ANDNOR, while others might need/want the reverse.Being too configurable has it's draw backs, but I am attempting tocreate an alternative parser, not a QueryParser replacement. Choose thebest weapon for the job ;)

Query-time thesaurus expansion / General token to query expansion :
Takes advantage of a general find/replace feature, "expand" might map to
"(expander | expanded)" ... or any other valid syntax.


The QueryParser does this instead of TokenFilters?
Is it based on static configuration?

I do not use TokenFilters as it does not fit my requirements (I think).Right now, a hashmap is used to map a token to replacement syntax. Aqueryparser is generated from a parserfactory. The parserfactory takes aconfiguration class. When you get a queryparser from the factory you canchoose to inherit the config from the factory, or you can just set theoptions and configuration directly on the parser. I did this because Ihave a need for a base configuration to a common syntax that individualaccounts than want to be able to tweak to their needs.

The queryparser is a two pass system. The first pass does nottokenize...it does query expansion and preps the suggested query (thesuggested query must be suggested in the syntax the query was typed in,and without expansion). I had worried about speed when I made the 2 passdecision, but it has allowed me great flexibility, and with my testingso far I have had 0 speed problems.

By the way, I have recently tested the paragraph/sentence proximitysearching (mark within 4 sentences of dog) on a 300k doc index (docs8-20k) and the perceived speed was as fast as a normal one or two workboolean search (not a very scientific test :))

A problem with the paragraph/sentence proximity search right now is thatif there is only 1 doc in the index the proximity search will wrap. I amsure this can be fixed.


- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: New Lucene QueryParser

Reply via email to