Re: what is needed to index for about 10000 domains

John Martyniak Tue, 03 Mar 2009 15:21:48 -0800

Well the way that nutch works is that you would inject your list ofdomains into the DB, and that would be the starting point. Sincenutch uses a crawler it would grab those pages, and determine if thereare any links on those pages, and then add them to the DB. So thenext time that you generated your urls to fetch, it would take youroriginal list, plus the ones that it found to generate the new segment.

If you wanted to limit it to only pages contained on your 10000domains, you could use the regex-urlfilter.txt file in the confdirectory to limit it to your list. But you would have to create aregular expression for each one.

I am not familiar with the merge script on the wiki, but have mergedsegments before and it did work. But that was on Linux, don't thinkthat should make a difference though.


-John



On Mar 3, 2009, at 5:10 PM, [email protected] wrote:

Hi,
Thanks for the reply. I have list? of those domains only. I am notsure how many pages they have. Is a DSL? connection sufficient torun nutch in my case. Did you run nutch for all of your pages atonce or separately for a given subset of them. Btw, yesterday Itried to use merge shell script that we have on wiki. It gave a lotof errors. I run it on cygwin though.
Thanks.
A.







-----Original Message-----
From: John Martyniak <[email protected]>
To: [email protected]
Sent: Tue, 3 Mar 2009 1:44 pm
Subject: Re: what is needed to  index for about 10000 domains
I think that in order to answer that questions, it is necessary toknow how many total pages are being indexed.?
?
I currently have ~3.5 million pages indexed, and the segmentdirectories are around 45GB, The response time is relatively fast.?
?
In the test site it is running on a dual processor Dell 1850 with3GB of RAM.?
?

-John?
?

On Mar 3, 2009, at 3:44 PM, [email protected] wrote:?
?
Hello,?
?
I use nutch-0.9 and need to index about 10000? domains.? I want to> know? minimum requirements to hardware and memory.?
?
Thanks in advance.?
Alex.?
?

Re: what is needed to index for about 10000 domains

Reply via email to