Re: [Resent] Document boosting based on .. semantics?

Grant Ingersoll Wed, 20 Feb 2008 05:43:12 -0800


On Feb 20, 2008, at 2:51 AM, Markus Fischer wrote:

Hi,
[Resent: guess I sent the first before I completed my subscription,just in case it comes up twice ...]
the subject may be a bit weird but I couldn't find a better way todescribe a problem I'm trying to solve.
If I'm not mistaken, one factor of scoring is the distance of theword within the document and the length of the document. I'm titlingmy problem as the cauliflower-problem, but it's related to any "typeof" problem.
When searching for "cauliflower", I get x hits. Now documents in thetop range (pos < 10) are most likely selected by the user.Unfortunately, although "cauliflower" is in the documents and it'sword position is like around 3000 characters in the document, thedocument itself has nothing much to do with "cauliflower" or"vegetables" or "eating", etc.
Unfortunately, the most relevant documents come much later in theindex (pos >= 10) because the "cauliflower" word is positioned likearound 5000 characters within the document.
Based on the relation on the content, these later documents are muchmore appropriate to the search term, because the also deal with"vegetables" and "eating", etc.
I'm stuck here how I can signal Lucene to boost those laterdocuments, because frankly I don't know on what. I would probablyhave to tag the relation of the document (is-about vegetables, is-about eating) and also detect that the searched term is-a vegetable.This gets even more complex with non-single-term queries.

Well, this is a classic problem in IR. The question is, how do youknow when a user types "cauliflower" that they really are interestedin "vegetables" and "eating" and not the other document? There reallyis nothing in that query, by itself, that gives you or Lucene thatinformation. Your top hit has the term and has other factors, suchas document length etc. that make it the top result. That is not tosay there is nothing you can do, just that it is hard and can bebrittle.

Some suggestions (and the use of cauliflower and vegetables, etc. isfigurative, not literal):1. If you know cauliflower should be related to vegetables and eating,add those as synonyms to your query terms. This can be hard togeneralize.2. If you have some user profile that suggests that user is interestedin vegetables/eating over other things, then you could incorporate that.3. If you see that most of your users like the vegetable documents forthe query cauliflower by doing some log analysis, then you could usepopularity of the document as a factor in your scoring (seeFunctionQuery capability in Lucene)4. You could try to do some fancy-schmancy reasoning using Wordnet,hyper/hypo - nyms5. You could use MoreLikeThis to allow the user to choose thevegetable result and say "Give me more documents like this"6. Last, but certainly not least, if you want the user to get acertain document as #1 or #2, then make the document #1 or #2. Youdon't need search for this. It's called editorial boosting. Again,hard to generalize, but sometimes you just need a document to be #1and trying to tune the various knobs a search engine gives you isgoing to break a whole lot of other things.

Also, what kind of queries are you using such that you get the first3K characters in your query? Are you using some type of SpanQuery orare you just referring to the effects of length normalization? Youmight also try using a different Similarity implementation thatdoesn't punish longer documents as much.

Finally, the explain() method may help you better understand thefactors that go into why your documents score the way they do.

On a related topic, I'm also searching for a way to suggestalternate spelling of words to the user, when we found a word whichis very less frequent used in the index or not in the index at all.I'm Austrian based, when I e.g. search for "retthich" (wrong spelled"rettich" which is radish), Google suggests me the proper spelledword. I'm searching for a way to figure how to accomplish this, butagain this may be lucene off-topic and/or I should properly start aseparate thread ...

Search the archives here for "Google suggest" or "suggestions", "ajaxsuggestions", etc. There are a couple of implementations out therefor Lucene/Solr I believe that use the TermEnum class to go and getsuggestions. I believe there might even be some patches in JIRA.

Has someone an advice how to approach this kind of problems? Is itappropriate/can it be solved with Lucene? Am I right here on thislist anyway? :)
thanks for any feedback,
- Markus


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
http://www.lucenebootcamp.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [Resent] Document boosting based on .. semantics?

Reply via email to