Hi: a general comment: I'm pretty new to nutch too and I'm starting a blog to describe my research/trials and errors/evolution/etc. It's http://nutch.wordpress.com There are a few links in the introduction blog entry that might help. Also, the recrawl/merge blog entry talks about the segments. --Kai
----- Original Message ---- From: Martin Bayly <[EMAIL PROTECTED]> To: [email protected] Sent: Wednesday, July 18, 2007 10:55:14 AM Subject: Newbie question about Nutch query architecture - multiple indexes Fairly new to Lucene/Nutch and Search in general - so bear with me. Using Lucene in an application and (although not a concern yet) want to understand implications for scalability going forward. I was reading in the Lucene in Action book Nutch case study, about how Nutch splits its indexes across many machines. <snip> "The Query Handler does some light processing of the query and forwards the search terms to a large set of Index Searcher machines." "There are now many streams of search results that come back to the Query Handler. The Query Handler collates the results, finding the best ranking across all of them." "The Query Handler asks each Index Searcher for only a small number of documents (usually 10)" </snip> What I don't follow is what are the implications of splitting the indexes in this way for relevancy? Let's say the first 20 docs on Index Searcher machine A are highly relevant and the first 10 docs on Index Searcher machine B are not very relevant. But if I understand correctly, the user will see only 10 docs from machine A and 10 docs from machine B. i.e. docs 11-20 in the search result will not be very relevant? Not sure I really see a way around this - I guess one of the critical things is how you choose to split your indexes? My impression is Nutch does this based on the URL of the content being indexed? Thanks for any insights Martin ____________________________________________________________________________________ Don't pick lemons. See all the new 2007 cars at Yahoo! Autos. http://autos.yahoo.com/new_cars.html
