This sort of fine distinction probably requires user feed back. If the idioms are highly distinctive, then a learning system that is highly resistant to over-fitting could be used to learn a query that includes phrasal components like "not for sale" and such.
If you have to find more flexible phrases that are based on synonymic substitutions, then you should look at techniques like random indexing or LSA or LDA so that you can express the phrases you extract from training documents in terms of more general semantic components. Sparse random indexing is probably the easiest to apply to a term based retrieval system such as Lucene. Here is one effective learning system: http://www.aclweb.org/anthology/P/P08/P08-2059.pdf http://www.cs.jhu.edu/~mdredze/publications/icml_variance.pdf To summarize, what I would recommend is something like this: step 0: create a Lucene index with positional and, optionally, semantic information such as from sparse random indexing step 1: take user input to retrieve a sample set of documents step 2: let the user judge some of these documents as relevant or not step 3: extract possible features such as terms, phrases, semantic phrases and so on from the sample documents step 4: run the learning algorithm on the judged documents step 5: report starting at 1, but now with an augmented query that includes a post-scoring phase On Mon, Oct 26, 2009 at 7:29 AM, poeta simbolista <[email protected] > wrote: > > Hi, > > Imagine you have a text : > "Apartment not for sale". > and another > "Sale! Apartment for rent" > Search query: "Apartment for sale". > The above search query will return the texts above highly scored. I would > like to know how I could tackle the following issue better with Lucene. My > ideas: > - recognise certain sets "Not for sale" as different from "for sale". That > is, invalidate "for sale" if it comes preceded by "not". How could I do > this? > - Recognise sale only if preceded by "for", since the second meaning > (bargain vs. something for sale) is tricky. > - transcript "sale" as "for sale", grouped in the query (produce "-sale > +(for sale)" ). Wouldn't that query invalidate those with the "sale" term? > How to achieve this with Lucene otherwise? > > Should this be tackled only by preprocessing the data before it makes it to > the index? Ideally I would like to preserve the original text on the > index. > > Thanks a lot in advance > Diego > -- > View this message in context: > http://www.nabble.com/Solution-for-unwanted-ngrams-tp26060874p26060874.html > Sent from the Lucene - General mailing list archive at Nabble.com. > > -- Ted Dunning, CTO DeepDyve
