[Nutch-general] Re: Distributed Crawl

Karen Church Wed, 06 Jul 2005 10:44:24 -0700

Yep your right - I realize now what I'm doing wrong. I'm need to fetch thepages from the initial fetchlist first, then update the db with the set ofnew pages to fetch, then generate a new fetchlist based on these new pagesand so on....

The reason I wanted to setup the multiple machines for fetching was becauseI though it would be faster but based on your response below - that's notalways the case. I'm also crawling different content types and I was hopingto keep these content types in separate db's so I can analyze themseparately. I'm working on a project that involves crawling a set of pagesand then recrawling them periodically to track any changes. So I want to beable to crawl a set of pages and then recrawl the same set at specific timesover a given time period be it weeks or months. I was hoping to 'speed up'the crawl process so I could do e.g. a crawl a week and I though thatmultiple machines was the way to go.


I'll have to have another think about it.....

Thanks,
Karen

----- Original Message -----From: "Andy Liu" <[EMAIL PROTECTED]>

To: <[email protected]>
Sent: Wednesday, July 06, 2005 6:21 PM
Subject: Re: Distributed Crawl

From what you described in your last post, it seems that you're

generating all 10 segments right off the bat.  If you look more
closely at the CrawlTool code, for each iteration, in addition to
generating the fetchlist, it's fetching the segment, and updating the
webdb with the results of the fetch.

It sounds like for your uses CrawlTool may be sufficient.  What's your
rationale behind using multiple machines to fetch?  Using multiple
machines may not always speed up the crawling process, especially
since you're adding in extra time and complexity by generating
multiple segments and copying them back and forth between machines.

Another option is to try using NDFS, although I haven't used it in
months and I don't know how supported it is anymore.  Or you can wait
until Doug releases his MapReduce code.

Andy

On 7/6/05, Karen Church <[EMAIL PROTECTED]> wrote:

Thanks for reply Andy. Based on what you've suggested I've been doingsome

tests.  I've:

1. created a db
2. injected a set of urls
3. used the FetchListTool with the -numFetchers option to specify how many
fetchlists I want to emit.

However, I'm a little confused about something. Lets say I want to crawlto

a depth of 5.  In looking at the CrawlTool code, the FetchListTool is just
called X number of times where X is the desired crawl depth.  I've taken a
similar approach, i.e. I've created a small application that calls the
FetchListTool X number of times using a for loop, where the bound on the
loop is the crawl depth I want which in this case is 5.

For test purposes, I'm setting the numfetchers to 2 and the crawl depth to
5.  The result is 10 segment directories each with its on fetchlist.
Example:

segment1-0
segment2-0
segment3-0
segment4-0
segment5-0
segment1-1
segment2-1
segment3-1
segment4-1
segment5-1

Am I on the right track with this approach???

So for me to carry out the fetch process on multiple machines do I justcopy

these segments onto the various machines and fetch the segments using the
'Fetcher' class?

Apologies if my questions seem strange - I'm just a little confused about
whether or not I'm on the right track and how to proceed from here....

Thanks,
Karen

----- Original Message -----
From: "Andy Liu" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Wednesday, July 06, 2005 2:36 PM
Subject: Re: Distributed Crawl

You can emit multiple fetchlists using the -numFetchers option, copy
each segment to a different machine to fetch, copy the segments back,
and run updatedb on all the segments.

Andy

On 7/6/05, Karen Church <[EMAIL PROTECTED]> wrote:
> Hi All,
>
> I was wondering if someone could point me in the right direction for

> carrying out a distributed crawl. Basically I was to split a crawl over> a

> few machines. Is there a way of just 'fetching' the pages using multiple
> machines and then merging the results onto a single machine? Can I then
> run the Nutch indexing process over that single machine?
>
> Thanks
> Karen
>
>




-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Distributed Crawl

Reply via email to