Re: whole web crawl

Jack Yu Mon, 05 Oct 2009 22:32:37 -0700

0.1billion is pages not urls,
sorry for that should be 4TB 0.1 billion pages


On 10/6/09, Gaurang Patel <gaurangtpa...@gmail.com> wrote:
> Hey Jack,
>
> *One concern:*
>
> I am not sure where can I get  0.1 billion page urls? I am using DMOZ Open
> Directory(which has around 3M urls) to inject the crawldb.
>
> Please help.
>
> Regards,
> Gaurang
>
> 2009/10/4 Jack Yu <jackyu...@gmail.com>
>
>> 0.1 billion pages for 1.5TB
>>
>>
>> On 10/5/09, Gaurang Patel <gaurangtpa...@gmail.com> wrote:
>> > All-
>> >
>> > I am novice to using Nutch. Can anyone tell me the estimated size in (I
>> > suppose, in TBs) that will be required to store the crawled results for
>> > whole web? I want to get estimate of the memory requirements for my
>> project,
>> > that uses Nutch web crawler.
>> >
>> >
>> >
>> > Regards,
>> > Gaurang Patel
>> >
>>
>

Re: whole web crawl

Reply via email to