On Friday 02 December 2005 16:03, mark harwood wrote: > There seems to be a growing gap between Lucene > functionality and the query language offered by > QueryParser (eg no support for regex queries, span > queries, "more like this", filter queries, > minNumShouldMatch etc etc). > > Closing this gap is hard when: > a) The availability of Javacc+Lucene skills is a > bottleneck > b) The syntax of the query language makes it difficult > to add new features eg rapidly running out of "special > characters" > > I don't think extending the existing query > parser/language is necessarily useful and I see it > being used purely to support the classic "simple > search engine" syntax. > > Unfortunately the fall-back position for applications > which require more complex queries is to "just write > some Java code to instantiate the Query objects > programmatically." This is OK but I think there is > value in having an advanced search syntax capable of > supporting the latest Lucene features and expressed in > XML. It's worth considering why it's useful to have a > String-representable form for queries: > 1) Queries can be stored eg in audit logs or "saved > queries" used for tasks like auto-categorization > 2) Clients built in languages other than Java can > issue queries to a Lucene server > 3) I can decouple a request from the code that > implements the query when distributing software e.g my > applet may not want Lucene dragging down to the client > > Currently we cannot easily do the above for any > "complex" queries because they are not easily > persisted (yes, we could serialize Query objects but > that seems messy and does not solve points 2 and 3). > > We can potentially use XML in the same way ANT does > i.e. a declarative way of invoking an extensible list > of Java-implemented features. A query interpreter is > used to instantiate the configured Java Query objects > and populates them with settings from the XML in a > generic fashion (using reflection) eg: > .... > <MoreLikeThis minNumberShouldMatch="3" > maxQueryTerms="30"> > <text> > Lorem ipsum dolor sit amet, consectetuer > adipiscing > elit. Morbi eget ante blandit quam faucibus > posuere. Vivamus > porta, elit fringilla venenatis consequat, neque > lectus > gravida dolor, sed cursus nunc elit non lorem. > Nullam congue > orci id eros. Nunc aliquet posuere enim. > </text> > </MoreLikeThis> > </BooleanClause>
Quidquid id est ... Do we have a Latin analyzer? > > Do people feel this would be a worthwhile endeavour? > I'm not sure if enough people feel pain around the > points 1-3 outlined above to make it worth pursuing. There are at least two more issues: Some queries can be nested inside others, and some nesting combinations can not be searched. For example it is not possible to have a BooleanQuery inside a PhraseQuery. How to deal with these? XML is not readable/writable by the most humans that could make good use of the extra power in the gap left open by the default query language. See also this: http://ciir.cs.umass.edu/irdemo/inqinfo/inqueryhelp.html Do you want to decouple (as above) at the human interface? There is also the contrib/surround query language/ This language avoids using special characters by using prefix operators. Adding prefix operators like this is straightforward: moreLikeThis(3, 30, termList(Lorem ipsum dolor sit amet)) for practical use, this could be simplified to: mlt(3, 30, (Lorem ipsum dolor sit amet)) Such additions are a bit of work, but the query possibilities of Lucene do not change that fast. Adding infix operators with operators in between their arguments (infix) is a bit more involved. Regards, Paul Elschot --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]