Re: DFS search

Ken Krugler Sun, 16 Dec 2007 08:30:05 -0800

So, whether we do the crawling by using multiple nodes or running it
on a single node with the regular bin/generate, bin/fetcher, series of
commands, finally we get a single crawldb on the local system where
Nutch is supposed to run. Am I right?

The "crawldb" is just where Nutch keeps track of the state of URLs itknows about.

The end result of a crawl is series of "segments" that contain thedownloaded web page content, and (once you've indexed them) theLucene index for each segment.

At this point you have the choice of merging the segments and indexesinto a single segment that you'd serve up with a regular Nutch searchbean, or you can generate multiple segments that live on multiple"remote searcher" boxes.

Part of the confusion with Nutch is that it is both a crawler and asearch server, and the architecture/mechanism for running adistributed crawl (Hadoop DFS/MapReduce) is completely different fromthe architecture/mechanism for running a distributed search server.


-- Ken

On Dec 16, 2007 11:47 AM, Dennis Kubes <[EMAIL PROTECTED]> wrote:

 If you are talking about the clustering plugin, that is about grouping
 (hopefully) related documents in the search results.

 Running the crawler and other nutch processes on multiple nodes is nutch
 and hadoop running the map reduce paradigm.  Moving final indexes and
 database to local file systems for searching is simply best practices.

 Dennis Kubes

 Bent Hugh wrote:
 > I am a little confused. In the Nutch wiki there are chapters on
 > clustering. I have never tried them though. So what is clustering
 > about? Is it running the crawler on multiple nodes and creating
 > crawldb on multiple nodes? And then finally merging all these on a
 > local system and running the Nutch web-gui from that?
 >
 >
 > On Dec 16, 2007 10:17 AM, Dennis Kubes <[EMAIL PROTECTED]> wrote:
 >> Technically you can.  The speed for most search applications would be
 >> unacceptable.  Searching of indexes is best done on local files systems
 >> for speed.
 >>
 >> Dennis Kubes
 >>
 >>
 >> hzhong wrote:
 >>> Hello,
 >>>
 >>> Why can't  we search on the Hadoop DFS?
 >>>
 >>> Thanks



--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Re: DFS search

Reply via email to