Yep your right - I realize now what I'm doing wrong. I'm need to fetch the
pages from the initial fetchlist first, then update the db with the set of
new pages to fetch, then generate a new fetchlist based on these new pages
and so on....
The reason I wanted to setup the multiple machines for fetching was because
I though it would be faster but based on your response below - that's not
always the case. I'm also crawling different content types and I was hoping
to keep these content types in separate db's so I can analyze them
separately. I'm working on a project that involves crawling a set of pages
and then recrawling them periodically to track any changes. So I want to be
able to crawl a set of pages and then recrawl the same set at specific times
over a given time period be it weeks or months. I was hoping to 'speed up'
the crawl process so I could do e.g. a crawl a week and I though that
multiple machines was the way to go.
I'll have to have another think about it.....
Thanks,
Karen
----- Original Message -----
From: "Andy Liu" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Wednesday, July 06, 2005 6:21 PM
Subject: Re: Distributed Crawl
From what you described in your last post, it seems that you're
generating all 10 segments right off the bat. If you look more
closely at the CrawlTool code, for each iteration, in addition to
generating the fetchlist, it's fetching the segment, and updating the
webdb with the results of the fetch.
It sounds like for your uses CrawlTool may be sufficient. What's your
rationale behind using multiple machines to fetch? Using multiple
machines may not always speed up the crawling process, especially
since you're adding in extra time and complexity by generating
multiple segments and copying them back and forth between machines.
Another option is to try using NDFS, although I haven't used it in
months and I don't know how supported it is anymore. Or you can wait
until Doug releases his MapReduce code.
Andy
On 7/6/05, Karen Church <[EMAIL PROTECTED]> wrote:
Thanks for reply Andy. Based on what you've suggested I've been doing
some
tests. I've:
1. created a db
2. injected a set of urls
3. used the FetchListTool with the -numFetchers option to specify how many
fetchlists I want to emit.
However, I'm a little confused about something. Lets say I want to crawl
to
a depth of 5. In looking at the CrawlTool code, the FetchListTool is just
called X number of times where X is the desired crawl depth. I've taken a
similar approach, i.e. I've created a small application that calls the
FetchListTool X number of times using a for loop, where the bound on the
loop is the crawl depth I want which in this case is 5.
For test purposes, I'm setting the numfetchers to 2 and the crawl depth to
5. The result is 10 segment directories each with its on fetchlist.
Example:
segment1-0
segment2-0
segment3-0
segment4-0
segment5-0
segment1-1
segment2-1
segment3-1
segment4-1
segment5-1
Am I on the right track with this approach???
So for me to carry out the fetch process on multiple machines do I just
copy
these segments onto the various machines and fetch the segments using the
'Fetcher' class?
Apologies if my questions seem strange - I'm just a little confused about
whether or not I'm on the right track and how to proceed from here....
Thanks,
Karen
----- Original Message -----
From: "Andy Liu" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Wednesday, July 06, 2005 2:36 PM
Subject: Re: Distributed Crawl
You can emit multiple fetchlists using the -numFetchers option, copy
each segment to a different machine to fetch, copy the segments back,
and run updatedb on all the segments.
Andy
On 7/6/05, Karen Church <[EMAIL PROTECTED]> wrote:
> Hi All,
>
> I was wondering if someone could point me in the right direction for
> carrying out a distributed crawl. Basically I was to split a crawl over
> a
> few machines. Is there a way of just 'fetching' the pages using multiple
> machines and then merging the results onto a single machine? Can I then
> run the Nutch indexing process over that single machine?
>
> Thanks
> Karen
>
>
-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general