[ http://issues.apache.org/jira/browse/NUTCH-92?page=comments#action_12453682 ] Dogacan Güney commented on NUTCH-92: ------------------------------------
Here is my second attempt at this. Now DistributedSearch$Client keeps a mapping from addresses to numDocs, and in search(), computes total number of documents from live servers. > DistributedSearch incorrectly scores results > -------------------------------------------- > > Key: NUTCH-92 > URL: http://issues.apache.org/jira/browse/NUTCH-92 > Project: Nutch > Issue Type: Bug > Components: searcher > Affects Versions: 0.7, 0.8 > Reporter: Andrzej Bialecki > Assigned To: Andrzej Bialecki > Attachments: distributed-idf-v2.patch, distributed-idf.patch > > > When running search servers in a distributed setup, using > DistributedSearch$Server and Client, total scores are incorrectly calculated. > The symptoms are that scores differ depending on how segments are deployed to > Servers, i.e. if there is uneven distribution of terms in segment indexes > (due to segment size or content differences) then scores will differ > depending on how many and which segments are deployed on a particular Server. > This may lead to prioritizing of non-relevant results over more relevant ones. > The underlying reason for this is that each IndexSearcher (which uses local > index on each Server) calculates scores based on the local IDFs of query > terms, and not the global IDFs from all indexes together. This means that > scores arriving from different Servers to the Client cannot be meaningfully > compared, unless all indexes have similar distribution of Terms and similar > numbers of documents in them. However, currently the Client mixes all scores > together, sorts them by absolute values and picks top hits. These absolute > values will change if segments are un-evenly deployed to Servers. > Currently the workaround is to deploy the same number of documents in > segments per Server, and to ensure that segments contain well-randomized > content so that term frequencies for common terms are very similar. > The solution proposed here (as a result of discussion between ab and cutting, > patches are coming) is to calculate global IDFs prior to running the query, > and pre-boost query Terms with these global IDFs. This will require one more > RPC call per each query (this can be optimized later, e.g. through caching). > Then the scores will become normalized according to the global IDFs, and > Client will be able to meaningfully compare them. Scores will also become > independent of the segment content or local number of documents per Server. > This will involve at least the following changes: > * change NutchSimilarity.idf(Term, Searcher) to always return 1.0f. This > enables us to manipulate scores independently of local IDFs. > * add a new method to Searcher interface, int[] getDocFreqs(Term[]), which > will return document frequencies for query terms. > * modify getSegmentNames() so that it returns also the total number of > documents in each segment, or implement this as a separate method (this will > be called once during segment init) > * in DistributedSearch$Client.search() first make a call to servers to return > local IDFs for the current query, and calculate global IDFs for each relevant > Term in that query. > * multiply the TermQuery boosts by idf(totalDocFreq, totalIndexedDocs), and > PhraseQuery boosts by the sum of the idf(totalDocFreqs, totalIndexedDocs) for > all of its terms > This solution should be applicable with only minor changes to all branches, > but initially the patches will be relative to trunk/ . > Comments, suggestions and review are welcome! -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers