[ 
http://issues.apache.org/jira/browse/NUTCH-92?page=comments#action_12453682 ] 
            
Dogacan Güney commented on NUTCH-92:
------------------------------------

Here is my second attempt at this. Now DistributedSearch$Client keeps a mapping 
from addresses to numDocs, and in search(), computes total number of documents 
from live servers.

> DistributedSearch incorrectly scores results
> --------------------------------------------
>
>                 Key: NUTCH-92
>                 URL: http://issues.apache.org/jira/browse/NUTCH-92
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 0.7, 0.8
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>         Attachments: distributed-idf-v2.patch, distributed-idf.patch
>
>
> When running search servers in a distributed setup, using 
> DistributedSearch$Server and Client, total scores are incorrectly calculated. 
> The symptoms are that scores differ depending on how segments are deployed to 
> Servers, i.e. if there is uneven distribution of terms in segment indexes 
> (due to segment size or content differences) then scores will differ 
> depending on how many and which segments are deployed on a particular Server. 
> This may lead to prioritizing of non-relevant results over more relevant ones.
> The underlying reason for this is that each IndexSearcher (which uses local 
> index on each Server) calculates scores based on the local IDFs of query 
> terms, and not the global IDFs from all indexes together. This means that 
> scores arriving from different Servers to the Client cannot be meaningfully 
> compared, unless all indexes have similar distribution of Terms and similar 
> numbers of documents in them. However, currently the Client mixes all scores 
> together, sorts them by absolute values and picks top hits. These absolute 
> values will change if segments are un-evenly deployed to Servers.
> Currently the workaround is to deploy the same number of documents in 
> segments per Server, and to ensure that segments contain well-randomized 
> content so that term frequencies for common terms are very similar.
> The solution proposed here (as a result of discussion between ab and cutting, 
> patches are coming) is to calculate global IDFs prior to running the query, 
> and pre-boost query Terms with these global IDFs. This will require one more 
> RPC call per each query (this can be optimized later, e.g. through caching). 
> Then the scores will become normalized according to the global IDFs, and 
> Client will be able to meaningfully compare them. Scores will also become 
> independent of the segment content or local number of documents per Server. 
> This will involve at least the following changes:
> * change NutchSimilarity.idf(Term, Searcher) to always return 1.0f. This 
> enables us to manipulate scores independently of local IDFs.
> * add a new method to Searcher interface, int[] getDocFreqs(Term[]), which 
> will return document frequencies for query terms.
> * modify getSegmentNames() so that it returns also the total number of 
> documents in each segment, or implement this as a separate method (this will 
> be called once during segment init)
> * in DistributedSearch$Client.search() first make a call to servers to return 
> local IDFs for the current query, and calculate global IDFs for each relevant 
> Term in that query.
> * multiply the TermQuery boosts by idf(totalDocFreq, totalIndexedDocs), and 
> PhraseQuery boosts by the sum of the idf(totalDocFreqs, totalIndexedDocs) for 
> all of its terms
> This solution should be applicable with only minor changes to all branches, 
> but initially the patches will be relative to trunk/ .
> Comments, suggestions and review are welcome!

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to