Bucketing (was Re: Wikia search goes live today)

Otis Gospodnetic Tue, 08 Jan 2008 23:07:28 -0800

Sounds useful.  I suppose this means one would have custom function for 
within-bucket-reordering? e.g. for a web search you might reorder based on the 
URL length if you think shorter URLs are an indicator of higher quality.  It 
also sounds like something that can easily sit outside Lucene....or do you have 
something else in mind, such as a mechanism to pass a reordering function in 
Lucene?


Otis 

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Andrzej Bialecki <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Tuesday, January 8, 2008 5:24:01 PM
Subject: Re: Wikia search goes live today

Ryan McKinley wrote:
> Andrzej Bialecki wrote:
>> Lukas Vlcek wrote:
>>> So staring will be accommodated only during indexing phase. Does it
 
>>> mean it
>>> will be pretty static value not a dynamically changing variable... 
>>> correct?
>>> In other words if I add my starts to some document it won't affect
 the
>>> scoring immediately but after indexing cycle. Correct?
>>
>> (I'm not involved in Wikia development). There are some ways to go 
>> about it even in the pure Lucene-land, so that the updates are fast 
>> without reindexing the main content. Hint: ParallelReader.
>>
> 
> in solr (1.3-dev) you can have an external value source with a
 function 
> query...

True, although function query tends to bring more overhead ...

While we're on the subject of complex scoring - I read an interesting 
paper (I don't have a link now), which discussed a so called bucketed 
scoring. The idea is that if your basic scoring is good enough to
 ensure 
that top-N results are highly relevant, then you can split these
 results 
into buckets of k documents (let's say 10 ;) ), and within each bucket 
apply arbitrary re-ranking function, which is then very inexpensive to 
perform because of the limited number of documents.

Example: you have a large corpus of web pages, and you want home pages 
to appear first, even if they score somewhat lower - and it doesn't pay
 
off to modify the base scoring, because of overfitting, i.e. the
 scoring 
would be good for home pages but poor for other relevant documents.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Bucketing (was Re: Wikia search goes live today)

Reply via email to