Re: DFS search

Dennis Kubes Sun, 16 Dec 2007 13:56:00 -0800


Bent Hugh wrote:

So, whether we do the crawling by using multiple nodes or running it
on a single node with the regular bin/generate, bin/fetcher, series of
commands, finally we get a single crawldb on the local system where
Nutch is supposed to run. Am I right?

Yes and no. The artifacts of nutch are the crawldb, segments (whichcontains a generated fetch list, fetched content, content text, andcontent meta data), linkdb, and indexes. If you are running everythingon a single node then everything would be on the local file system.

We run over 100 nodes so the way we do it is we have a master crawldband master linkdb. We generate, fetch, update, index, and deploy piecesor as google calls them shards of 1-2M pages. The processing happensusing map reduce and dfs.

During each shard processing cycle we update both the master crawldb andlinkdb. We use these master databases for generating new fetch listsand for scoring during indexing. Each shard eventually has its ownsegments, indexes, and a portion of the master linkdb copied to a shardlinkdb to serve only the urls in that shard. These three are generatedand stored on the dfs and then copied to a local file system where theycan be searched. The deploy of the shard is to a dedicated searchserver on its local file system. The number of shards grows linearlywith the size of our index.


Dennis Kubes


On Dec 16, 2007 11:47 AM, Dennis Kubes <[EMAIL PROTECTED]> wrote:

If you are talking about the clustering plugin, that is about grouping
(hopefully) related documents in the search results.

Running the crawler and other nutch processes on multiple nodes is nutch
and hadoop running the map reduce paradigm.  Moving final indexes and
database to local file systems for searching is simply best practices.

Dennis Kubes


Bent Hugh wrote:

I am a little confused. In the Nutch wiki there are chapters on
clustering. I have never tried them though. So what is clustering
about? Is it running the crawler on multiple nodes and creating
crawldb on multiple nodes? And then finally merging all these on a
local system and running the Nutch web-gui from that?


On Dec 16, 2007 10:17 AM, Dennis Kubes <[EMAIL PROTECTED]> wrote:

Technically you can.  The speed for most search applications would be
unacceptable.  Searching of indexes is best done on local files systems
for speed.

Dennis Kubes


hzhong wrote:

Hello,

Why can't  we search on the Hadoop DFS?

Thanks

Re: DFS search

Reply via email to