Hi,
Would suggest starting out by looking at Common Crawl:
https://commoncrawl.org/
Regards,
Gora
Hi everyone,
I hope this meets you well. I am planning to crawl the entire web. I already
know how to setup nutch, solr, and a database. I need advice on how to crawl
the entire web. How many nodes should I have? What instances should I use on
say AWS Cloud? What should my setup be like? This
Thanks for the response Markus. disabling urlnormalizer-basic works.
On Tue, Jan 9, 2024 at 3:43 PM Markus Jelsma
wrote:
> Hello Steve,
>
> Having those spaces normalized/encoded is expected behaviour with
> urlnormalizer-basic active. I would recommend to keep it this way and have
> all URLs
Hi Gora,
Thank you very much for your advice. That is exactly what I need.
Best wishes
Ridwan
From: Gora Mohanty
Sent: Wednesday, January 10, 2024 5:21:49 PM
To: user@nutch.apache.org
Subject: Re: Crawling the entire web
Hi,
Would suggest starting out by
4 matches
Mail list logo