Fairly new to Lucene/Nutch and Search in general - so bear with me.
Using Lucene in an application and (although not a concern yet) want to understand implications for scalability going forward. I was reading in the Lucene in Action book Nutch case study, about how Nutch splits its indexes across many machines. <snip> "The Query Handler does some light processing of the query and forwards the search terms to a large set of Index Searcher machines." "There are now many streams of search results that come back to the Query Handler. The Query Handler collates the results, finding the best ranking across all of them." "The Query Handler asks each Index Searcher for only a small number of documents (usually 10)" </snip> What I don't follow is what are the implications of splitting the indexes in this way for relevancy? Let's say the first 20 docs on Index Searcher machine A are highly relevant and the first 10 docs on Index Searcher machine B are not very relevant. But if I understand correctly, the user will see only 10 docs from machine A and 10 docs from machine B. i.e. docs 11-20 in the search result will not be very relevant? Not sure I really see a way around this - I guess one of the critical things is how you choose to split your indexes? My impression is Nutch does this based on the URL of the content being indexed? Thanks for any insights Martin
