Greetings list,
I am attempting to use Nutch's distributed search server mode and seeing
some unexpected results. Searches take ages to execute -- I seem to have
caused Nutch to perform the same search 16 times (I have 16 nodes).
Over the past week, I have been building my indexes:
$ bin/hadoop dfs -ls segments/
/user/nutch/segments/20060406061358 <dir> (~100k pages, first run)
/user/nutch/segments/20060411165547 <dir> (1M pages)
/user/nutch/segments/20060412214204 <dir> (2M pages)
/user/nutch/segments/20060413004057 <dir> (5M pages)
I then indexed them all into a single index. Here is my current DFS "du"
listing:
/user/nutch/crawldb 6379003931
/user/nutch/indexes 12895240115 (index built from all above segs.)
/user/nutch/indexes_old 400611137 (index built from smallest seg.)
/user/nutch/linkdb 107951330
/user/nutch/segments 67746176573
I built my big index by running a command similar to:
$ bin/nutch index indexes crawldb linkdb segments/*
However, as DFS doesn't seem to support wildcards, or my shell was
usurping them, I was forced to specify each segment manually.
After building my index I proceeded to setup my distributed search
servers per Stefan's excellent wiki, at
http://wiki.media-style.com/display/nutchDocu/setup+multiple+search+sever.
I was not able to use the literal instructions, as my indexes and
segments are in DFS while the document presumes a local filesystem
installation, and I was also not able to "partition" my indexes or
segment by host. I don't know how to do that.
When I examine Tomcat's catalina.out log, as well as the logs of the
distributed search servers themselves, I see some odd behavior:
060415 011943 29 query request from 10.10.0.6
060415 011943 29 query: baby
060415 011943 29 searching for 20 raw hits
060415 011950 29 re-searching for 40 raw hits, query: baby
-site:"1858.niengineering.co.uk" -site:"54.kometkarpets-southwest.co.uk"
060415 011958 29 found 2741775 raw hits
060415 011958 29 re-searching for 80 raw hits, query: baby
-site:"1858.niengineering.co.uk" -site:"54.kometkarpets-southwest.co.uk"
-site:"ffcembroidery.com" -site:"infobluebook.com"
-site:"aaliyahlova.suddenlaunch.com"
060415 012006 29 found 2734890 raw hits
060415 012007 29 total hits: 2754135
I'm not sure why it is re-searching using a refactored query. Huh? I
don't see this behavior when there is one search server, instead of the
16 I am using now. As you can see, the query is unacceptably slow.
When I examine the search results I see many duplicate results. Looking
at it further it seems like the results of performing the same search
across all 16 nodes is being combined into one result set - duplicates
and all. I can only assume that I need to somehow partition my index or
segments, but I'm unsure how to do that.
I guess I need to take my master index and set of segments and split
them into 16 equal parts, and copy (?) those to their respective nodes.
It seems onerous and wasteful - I will be duplicating data that is
already in DFS. Am I wrong?
Thanks to anyone who read this far ;)
-Shawn