Greetings list,

I am attempting to use Nutch's distributed search server mode and seeing some unexpected results. Searches take ages to execute -- I seem to have caused Nutch to perform the same search 16 times (I have 16 nodes).

Over the past week, I have been building my indexes:

$ bin/hadoop dfs -ls segments/
/user/nutch/segments/20060406061358     <dir> (~100k pages, first run)
/user/nutch/segments/20060411165547     <dir> (1M pages)
/user/nutch/segments/20060412214204     <dir> (2M pages)
/user/nutch/segments/20060413004057     <dir> (5M pages)


I then indexed them all into a single index. Here is my current DFS "du" listing:

/user/nutch/crawldb     6379003931
/user/nutch/indexes     12895240115 (index built from all above segs.)
/user/nutch/indexes_old 400611137   (index built from smallest seg.)
/user/nutch/linkdb      107951330
/user/nutch/segments    67746176573

I built my big index by running a command similar to:

$ bin/nutch index indexes crawldb linkdb segments/*

However, as DFS doesn't seem to support wildcards, or my shell was usurping them, I was forced to specify each segment manually.

After building my index I proceeded to setup my distributed search servers per Stefan's excellent wiki, at http://wiki.media-style.com/display/nutchDocu/setup+multiple+search+sever. I was not able to use the literal instructions, as my indexes and segments are in DFS while the document presumes a local filesystem installation, and I was also not able to "partition" my indexes or segment by host. I don't know how to do that.

When I examine Tomcat's catalina.out log, as well as the logs of the distributed search servers themselves, I see some odd behavior:

060415 011943 29 query request from 10.10.0.6
060415 011943 29 query: baby
060415 011943 29 searching for 20 raw hits
060415 011950 29 re-searching for 40 raw hits, query: baby -site:"1858.niengineering.co.uk" -site:"54.kometkarpets-southwest.co.uk"
060415 011958 29 found 2741775 raw hits
060415 011958 29 re-searching for 80 raw hits, query: baby -site:"1858.niengineering.co.uk" -site:"54.kometkarpets-southwest.co.uk" -site:"ffcembroidery.com" -site:"infobluebook.com" -site:"aaliyahlova.suddenlaunch.com"
060415 012006 29 found 2734890 raw hits
060415 012007 29 total hits: 2754135

I'm not sure why it is re-searching using a refactored query. Huh? I don't see this behavior when there is one search server, instead of the 16 I am using now. As you can see, the query is unacceptably slow.

When I examine the search results I see many duplicate results. Looking at it further it seems like the results of performing the same search across all 16 nodes is being combined into one result set - duplicates and all. I can only assume that I need to somehow partition my index or segments, but I'm unsure how to do that.

I guess I need to take my master index and set of segments and split them into 16 equal parts, and copy (?) those to their respective nodes. It seems onerous and wasteful - I will be duplicating data that is already in DFS. Am I wrong?

Thanks to anyone who read this far ;)

-Shawn

Reply via email to