Nutch not crawling all URLs

2021-12-13 Thread Roseline Antai
Hi, I am working with Apache nutch 1.18 and Solr. I have set up the system successfully, but I'm now having the problem that Nutch is refusing to crawl all the URLs. I am now at a loss as to what I should do to correct this problem. It fetches about half of the URLs in the seed.txt file. For

Re: Nutch not crawling all URLs

2021-12-13 Thread Sebastian Greenholtz
I don't know how I joined this mailing list but please take me off of this list, I have not used Nutch for a long time. Thanks! On Mon, Dec 13, 2021 at 7:03 AM Roseline Antai wrote: > Hi, > > > > I am working with Apache nutch 1.18 and Solr. I have set up the system > successfully, but I’m now

RE: Nutch not crawling all URLs

2021-12-13 Thread Roseline Antai
Hi Lewis, Yes, there are public websites. Below are the 20 test URLs I've been trying to crawl. http://traivefinance.com http://www.ceibal.edu.uy http://www.talovstudio.com https://portaltelemedicina.com.br/en/telediagnostic-platform http://www.notco.com http://www.saiph.org

Re: Nutch not crawling all URLs

2021-12-13 Thread Sebastian Nagel
Hi, (looping back to user@nutch - sorry, pressed the wrong reply button) > Some URLs were denied by robots.txt, > while a few failed with: Http code=403 That's two ways to signalize that these pages shouldn't be crawled, HTTP 403 means "Forbidden". > 3. I looked in CrawlDB and most URLs are in

Re: Nutch not crawling all URLs

2021-12-13 Thread Sebastian Nagel
Hi Roseline, > 5,36405,0,http://www.notco.com What is the status for https://notco.com/which is the final redirect target? Is the target page indexed? ~Sebastian

Re: Nutch not crawling all URLs

2021-12-13 Thread lewis john mcgibbney
Hi Roseline, Looks like you are ignoring external URLs… that could be the problem right there. I encourage you to track counters on inject, generate and fetch phases to understand where records may be being dropped. Are the seeds you are using public? If so please post your seed file so we can

Re: Nutch not crawling all URLs

2021-12-13 Thread Ayhan Koyun
Hi Sebastian, yes that I mean. Do you think there is a way to learn more about, how to crawl any website?! >Hi Ayhan, >you mean?

Re: Nutch not crawling all URLs

2021-12-13 Thread Sebastian Nagel
Hi Ayhan, you mean? https://stackoverflow.com/questions/69352136/nutch-does-not-crawl-sites-that-allows-all-crawler-by-robots-txt Sebastian On 12/13/21 20:59, Ayhan Koyun wrote: > Hi, > > as I wrote before, it seems that I am not the only one who can not crawl all > the seed.txt url's. I

RE: Nutch not crawling all URLs

2021-12-13 Thread Roseline Antai
Hi Lewis, I got a really weird reply back from what I sent, so I thought it better to resend the URLs again. I'm unsure if you got the URLs in the first instance. I've sent them as a text file attachment as well. http://traivefinance.com http://www.ceibal.edu.uy http://www.talovstudio.com

Re: Nutch not crawling all URLs

2021-12-13 Thread Ayhan Koyun
Hi, as I wrote before, it seems that I am not the only one who can not crawl all the seed.txt url's. I couldn't find a solution really. I collected 450 domains and approximately 200 nutch will or can not crawl. I want to know why this happens, is there a solution to force crawling sites? It