Using Nutch's distributed search server mode

Shawn Gervais Fri, 14 Apr 2006 23:57:41 -0700

Greetings list,

I am attempting to use Nutch's distributed search server mode and seeingsome unexpected results. Searches take ages to execute -- I seem to havecaused Nutch to perform the same search 16 times (I have 16 nodes).


Over the past week, I have been building my indexes:

$ bin/hadoop dfs -ls segments/
/user/nutch/segments/20060406061358     <dir> (~100k pages, first run)
/user/nutch/segments/20060411165547     <dir> (1M pages)
/user/nutch/segments/20060412214204     <dir> (2M pages)
/user/nutch/segments/20060413004057     <dir> (5M pages)

I then indexed them all into a single index. Here is my current DFS "du"listing:


/user/nutch/crawldb     6379003931
/user/nutch/indexes     12895240115 (index built from all above segs.)
/user/nutch/indexes_old 400611137   (index built from smallest seg.)
/user/nutch/linkdb      107951330
/user/nutch/segments    67746176573

I built my big index by running a command similar to:

$ bin/nutch index indexes crawldb linkdb segments/*

However, as DFS doesn't seem to support wildcards, or my shell wasusurping them, I was forced to specify each segment manually.

After building my index I proceeded to setup my distributed searchservers per Stefan's excellent wiki, athttp://wiki.media-style.com/display/nutchDocu/setup+multiple+search+sever.I was not able to use the literal instructions, as my indexes andsegments are in DFS while the document presumes a local filesysteminstallation, and I was also not able to "partition" my indexes orsegment by host. I don't know how to do that.

When I examine Tomcat's catalina.out log, as well as the logs of thedistributed search servers themselves, I see some odd behavior:


060415 011943 29 query request from 10.10.0.6
060415 011943 29 query: baby
060415 011943 29 searching for 20 raw hits

060415 011950 29 re-searching for 40 raw hits, query: baby-site:"1858.niengineering.co.uk" -site:"54.kometkarpets-southwest.co.uk"

060415 011958 29 found 2741775 raw hits

060415 011958 29 re-searching for 80 raw hits, query: baby-site:"1858.niengineering.co.uk" -site:"54.kometkarpets-southwest.co.uk"-site:"ffcembroidery.com" -site:"infobluebook.com"-site:"aaliyahlova.suddenlaunch.com"

060415 012006 29 found 2734890 raw hits
060415 012007 29 total hits: 2754135

I'm not sure why it is re-searching using a refactored query. Huh? Idon't see this behavior when there is one search server, instead of the16 I am using now. As you can see, the query is unacceptably slow.

When I examine the search results I see many duplicate results. Lookingat it further it seems like the results of performing the same searchacross all 16 nodes is being combined into one result set - duplicatesand all. I can only assume that I need to somehow partition my index orsegments, but I'm unsure how to do that.

I guess I need to take my master index and set of segments and splitthem into 16 equal parts, and copy (?) those to their respective nodes.It seems onerous and wasteful - I will be duplicating data that isalready in DFS. Am I wrong?


Thanks to anyone who read this far ;)

-Shawn

Using Nutch's distributed search server mode

Reply via email to