Re: Crawling the entire web

2024-01-10 Thread Gora Mohanty
Hi, Would suggest starting out by looking at Common Crawl: https://commoncrawl.org/ Regards, Gora

Crawling the entire web

2024-01-10 Thread Ridwan Naibi
Hi everyone, I hope this meets you well. I am planning to crawl the entire web. I already know how to setup nutch, solr, and a database. I need advice on how to crawl the entire web. How many nodes should I have? What instances should I use on say AWS Cloud? What should my setup be like? This

Re: nutch adds %20 in urls instead of spaces

2024-01-10 Thread Steve Cohen
Thanks for the response Markus. disabling urlnormalizer-basic works. On Tue, Jan 9, 2024 at 3:43 PM Markus Jelsma wrote: > Hello Steve, > > Having those spaces normalized/encoded is expected behaviour with > urlnormalizer-basic active. I would recommend to keep it this way and have > all URLs

Re: Crawling the entire web

2024-01-10 Thread Ridwan Naibi
Hi Gora, Thank you very much for your advice. That is exactly what I need. Best wishes Ridwan From: Gora Mohanty Sent: Wednesday, January 10, 2024 5:21:49 PM To: user@nutch.apache.org Subject: Re: Crawling the entire web Hi, Would suggest starting out by