Hi Nadav,
Thanks for suggestions.
As for improving multi-word queries, Doug Cutting recently posted a link to
his presentation,
http://www.haifa.ibm.com/Workshops/ir2005/papers/DougCutting-Haifa05.pdf,
just scroll down to Nutch N-Grams there, and you'll see the answer.
Basically, "Buffy the Vampire" is indexed as buffy, buffy-the+0, the,
the-vampire+0, vampire; but that's only for common terms like "the", "www"
etc. I wander if it's possible to use the same technique to store phrases...
Maxym
==================================
Maxym Mykhalchuk
(+39) 320 8593170
PhD student at University of Trento, ITALY
==================================
----- Original Message -----
From: "Nadav Har'El" <[EMAIL PROTECTED]>
To: <java-user@lucene.apache.org>
Sent: Tuesday, April 11, 2006 10:33 AM
Subject: Re: Small field indexing and ranking
"Maxym Mykhalchuk" <[EMAIL PROTECTED]> wrote on 10/04/2006 09:46:16 PM:
Here's the issue: All my "documents" will be having a few (2-3:
title, short description) short fields. You see, it's rare that the
same word is repeated several times in a title, so will Lucene be
able to give me a decent ranking, or will it be able to tell me "oh,
yes, this term is in the following 300 titles".
On what I've read on the topic so far, it seems that inverted
indexes do work good on big texts, as they are able to exploit the
repetition of words to do ranking.
Lucene is no psychic. If you're looking for "dog", and the document
contains two short documents, actually titles:
"Sparky the Fire Dog"
and "Dog Hause Home Page"
(just two silly titles from Google's top 10 results for "dog"...)
Then there's hardly any way for Lucene to determine which document
should be ranked higher.
For single word queries in a situation like this, you might want
to help Lucene learn the "good" ranking. One way is to use
Document.setBoost() (or Field.setBoost) to pre-determine which
document is more "important" regardless of its text (e.g.,
using some sort of link analysis, or whatever trick that is
applicable in your situation). Another way is to override
Lucene's relevance ranking with some other type of sorting
(see the Sort class) - for example, to sort all the matching
results by date, to get the newer matching results first.
In many applications, you might want to let your users control
this sort order; For example, in a shopping site (where product
names are the very short "documents"), you might want to let
the user sort the results by price, by popularity, by release
date, by users' ranking, and so on.
For multi-word queries, it is actually possible to improve
on Lucene's standard ranking. For example, let's say you
have the two titles
"Hot Dog on a Stick"
"Your Dog in Hot Weather"
And get a query "hot dog" (without quotation marks).
Using QueryParser, Lucene will normally rank the two titles
more or less the same. However, the first one is probably
much better because the words "hot" and "dog", don't just
appear there, they actually appear very close, and in this
case even in order.
This sort of proximity-influenced scoring is missing from
Lucene's QueryParser, and I've been wondering recently
on how it is best to add it, and whether it is possible to
easily do it with existing Lucene machinary, like the
SpanQuery class. Has anyone ever tried to do something
like this before, and can tell us their experience?
Good Luck,
Nadav.
--
Nadav Har'El
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]