Hi:

Dog cutting has ever write a wiki about hardware requirement of nutch, you
can check it out

http://wiki.apache.org/nutch/HardwareRequirements

good luck

yanky


2009/3/4 John Martyniak <[email protected]>

> Regarding the machine, you could run it on anything,  it all depends what
> kind of performance you want.  So yes you could run it on the machine that
> you have or something like the linux machine that I have.  And the DSL
> connection should be fine, you just need to make sure that it stays up the
> whole time, because if not it will start erring out, and you will have to
> re-fetch that whole segment as there is no way to pick up from where you
> left off.
>
> The only reason that I merged the segments was that I had many of them, and
> I wanted to build a big one before I started creating new ones, another
> advantage of merging is that you can use that to clear out unwanted urls.
>  For example I had a bunch of .js files in there, that I didn't want to have
> as part of the index, so I cleared them out.
>
> I used "bin/nutch mergesegs".
>
> Regarding merging the other parts, I have never used, but I don't think
> that it is necessary unless you have multiple linkdbs, etc.  in my case I do
> not.
>
> -John
>
>
>
>
> On Mar 3, 2009, at 7:14 PM, [email protected] wrote:
>
>
>> Hi,
>>
>> I will need to index all links in domains then. What do you think a linux
>> box like yours with DSL connection is OK to index the domains I have?
>>
>> Why only segments? I thought we need to merge all sub folders under crawl
>> folder. What did you use for merging them?
>>
>> Thanks.
>> A.
>>
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: John Martyniak <[email protected]>
>> To: [email protected]
>> Sent: Tue, 3 Mar 2009 3:21 pm
>> Subject: Re: what is needed to  index for about 10000 domains
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Well the way that nutch works is that you would inject your list of
>> domains into the DB, and that would be the starting point.  Since nutch uses
>> a crawler it would grab those pages, and determine if there are any links on
>> those pages, and then add them to the DB.  So the next time that you
>> generated your urls to fetch, it would take your original list, plus the
>> ones that it found to generate the new segment.?
>> ?
>>
>> If you wanted to limit it to only pages contained on your 10000 domains,
>> you could use the regex-urlfilter.txt file in the conf directory to limit it
>> to your list.  But you would have to create a regular expression for each
>> one.?
>> ?
>>
>> I am not familiar with the merge script on the wiki, but have merged
>> segments before and it did work.  But that was on Linux, don't think that
>> should make a difference though.?
>> ?
>>
>> -John?
>> ?
>>
>> ?
>>
>> On Mar 3, 2009, at 5:10 PM, [email protected] wrote:?
>> ?
>>
>>  ?
>>>
>>
>>  Hi,?
>>>
>>
>>  ?
>>>
>>
>>  Thanks for the reply. I have list? of those domains only. I am not > sure
>>> how many pages they have. Is a DSL? connection sufficient to > run nutch in
>>> my case. Did you run nutch for all of your pages at > once or separately for
>>> a given subset of them. Btw, yesterday I > tried to use merge shell script
>>> that we have on wiki. It gave a lot > of errors. I run it on cygwin though.?
>>>
>>
>>  ?
>>>
>>
>>  Thanks.?
>>>
>>
>>  A.?
>>>
>>
>>  ?
>>>
>>
>>  ?
>>>
>>
>>  ?
>>>
>>
>>  ?
>>>
>>
>>  ?
>>>
>>
>>  ?
>>>
>>
>>  ?
>>>
>>
>>  -----Original Message-----?
>>>
>>
>>  From: John Martyniak <[email protected]>?
>>>
>>
>>  To: [email protected]?
>>>
>>
>>  Sent: Tue, 3 Mar 2009 1:44 pm?
>>>
>>
>>  Subject: Re: what is needed to  index for about 10000 domains?
>>>
>>
>>  ?
>>>
>>
>>  ?
>>>
>>
>>  ?
>>>
>>
>>  ?
>>>
>>
>>  ?
>>>
>>
>>  ?
>>>
>>
>>  ?
>>>
>>
>>  ?
>>>
>>
>>  ?
>>>
>>
>>  I think that in order to answer that questions, it is necessary to > know
>>> how many total pages are being indexed.??
>>>
>>
>>  ??
>>>
>>
>>  ?
>>>
>>
>>  I currently have ~3.5 million pages indexed, and the segment >
>>> directories are around 45GB, The response time is relatively fast.??
>>>
>>
>>  ??
>>>
>>
>>  ?
>>>
>>
>>  In the test site it is running on a dual processor Dell 1850 with > 3GB
>>> of RAM.??
>>>
>>
>>  ??
>>>
>>
>>  ?
>>>
>>
>>  -John??
>>>
>>
>>  ??
>>>
>>
>>  ?
>>>
>>
>>  On Mar 3, 2009, at 3:44 PM, [email protected] wrote:??
>>>
>>
>>  ??
>>>
>>
>>  ?
>>>
>>
>>  Hello,??
>>>>
>>>
>>  ?
>>>
>>
>>  ??
>>>>
>>>
>>  ?
>>>
>>
>>  I use nutch-0.9 and need to index about 10000? domains.? I want to >> >
>>>> know? minimum requirements to hardware and memory.??
>>>>
>>>
>>  ?
>>>
>>
>>  ??
>>>>
>>>
>>  ?
>>>
>>
>>  Thanks in advance.??
>>>>
>>>
>>  ?
>>>
>>
>>  Alex.??
>>>>
>>>
>>  ??
>>>
>>
>>  ?
>>>
>>
>>  ?
>>>
>>
>>  ?
>>>
>>
>>  ?
>>>
>>
>>  ?
>>>
>> ?
>>
>>
>>
>>
>>
>>
>

Reply via email to