Hi, Ian,

Thanks for your information. It would be really helpful to have some
documentation maybe on the WIKI about retrieval algorithm and how to
hack it. At least, something there even if like several paragraphs to
get started...

Thanks,

Jian

On Wed, 26 Jan 2005 12:40:54 -0500, Ian Soboroff <[EMAIL PROTECTED]> wrote:
> jian chen <[EMAIL PROTECTED]> writes:
> 
> > Just to continue this discussion. I think right now Lucene's retrieval
> > algorithm is based purely on Vector Space Model, which is simple and
> > efficient.
> 
> As I understand it, it's indeed a tf-idf vector space approach, except
> that the queries are structured and as such, the tf-idf weights are
> totaled as a straight cosine among siblings of a BooleanQuery, but
> other query nodes may do things differently, for example, I haven't
> read it but I assume PhraseQueries require all terms present and
> adjacent to contribute to the score.
> 
> There is also a document-specific boost factor in the equation which
> is essentially a hook for document things like recency, PageRank, etc
> etc.
> 
> You can tweak this by defining custom Similarity classes which can say
> what the tf, idf, norm, and boost mean.  You can also affect the
> term normalization at the query end in BooleanScorer (I think? through
> the sumOfSquares method?).
> 
> We've implemented something kind of like the Similarity class but
> based on a model which decsribes a larger family of "similarity
> functions".  (For the curious or similarly IR-geeky, it's from Justin
> Zobel's paper from a few years ago in SIGIR Forum.)  Essentially I
> need more general hooks than the Lucene Similarity provides.  I think
> those hooks might exist, but I'm not sure I know which classes they're
> in.
> 
> I'm also interested in things like relevance feedback which can affect
> term weights as well as adding terms to the query... just how many
> places in the code do I have to subclass or change?
> 
> It's clear that if I'm interested in a completely different model like
> language modeling the IndexReader is the way to go.  In which case,
> what parts of the Lucene class structure should I adapt to maintain
> the incremental-results-return, inverted list skips, and other
> features which make the inverted search fast?
> 
> Ian
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to