RE: Nutch not crawling all URLs

2022-02-16 Thread Roseline Antai
ve Tika parser for mime-type application/javascript Regards, Roseline The University of Strathclyde is a charitable body, registered in Scotland, number SC015263. -Original Message- From: Roseline Antai Sent: 13 January 2022 17:02 To: user@nutch.apache.org; Sebastian Nagel Su

RE: Nutch not crawling all URLs

2022-01-13 Thread Roseline Antai
Thank you Sebastian. I will try these. Kind regards, Roseline -Original Message- From: Sebastian Nagel Sent: 13 January 2022 12:33 To: user@nutch.apache.org Subject: Re: Nutch not crawling all URLs Hi Roseline, > Does it work at all with Chrome? Yes. > It seems you need t

Re: Nutch not crawling all URLs

2022-01-13 Thread Sebastian Nagel
t; processing. > > Kind regards, > Roseline > > > > > > -Original Message----- > From: Sebastian Nagel > Sent: 12 January 2022 16:12 > To: user@nutch.apache.org > Subject: Re: Nutch not crawling all URLs > > Hi Roseline, > >> the

RE: Nutch not crawling all URLs

2022-01-12 Thread Roseline Antai
16:12 To: user@nutch.apache.org Subject: Re: Nutch not crawling all URLs Hi Roseline, > the mail below went to my junk folder and I didn't see it. No problem. I hope you nevertheless enjoyed the holidays. And sorry for any delays but I want to emphasize that Nutch is a community proj

Re: Nutch not crawling all URLs

2022-01-12 Thread Sebastian Nagel
r > than it will be truncated; otherwise, no truncation at all. Do not > confuse this setting with the file.content.limit setting. > > > > db.ignore.external.links.mode > byHost > > > db.injector.overwrite > true > > > http.timeout

RE: Nutch not crawling all URLs

2022-01-12 Thread Roseline Antai
byHost db.injector.overwrite true http.timeout 5 The default network timeout, in milliseconds. Regards, Roseline -Original Message- From: Sebastian Nagel Sent: 13 December 2021 17:35 To: user@nutch.apache.org Subject: Re: Nutch not crawling all URLs CAUTION: This email originated o

RE: Nutch not crawling all URLs

2021-12-15 Thread Roseline Antai
Hi, Following on from my previous enquiry, I was told to send the URLs I was trying to crawl to be tried from your end. I sent these, but did not receive any confirmation of receipt. Can you please confirm if these have been received, and when I can look forward to getting some feedback? I

Re: Nutch not crawling all URLs

2021-12-13 Thread Ayhan Koyun
Hi Sebastian, yes that I mean. Do you think there is a way to learn more about, how to crawl any website?! >Hi Ayhan, >you mean?

Re: Nutch not crawling all URLs

2021-12-13 Thread Sebastian Nagel
Hi Ayhan, you mean? https://stackoverflow.com/questions/69352136/nutch-does-not-crawl-sites-that-allows-all-crawler-by-robots-txt Sebastian On 12/13/21 20:59, Ayhan Koyun wrote: > Hi, > > as I wrote before, it seems that I am not the only one who can not crawl all > the seed.txt url's. I

RE: Nutch not crawling all URLs

2021-12-13 Thread Roseline Antai
of Strathclyde, Glasgow, UK The University of Strathclyde is a charitable body, registered in Scotland, number SC015263. -Original Message- From: lewis john mcgibbney Sent: 13 December 2021 17:18 To: user@nutch.apache.org Subject: Re: Nutch not crawling all URLs CAUTION: This email

Re: Nutch not crawling all URLs

2021-12-13 Thread Ayhan Koyun
Hi, as I wrote before, it seems that I am not the only one who can not crawl all the seed.txt url's. I couldn't find a solution really. I collected 450 domains and approximately 200 nutch will or can not crawl. I want to know why this happens, is there a solution to force crawling sites? It

Re: Nutch not crawling all URLs

2021-12-13 Thread Sebastian Nagel
Hi Roseline, > 5,36405,0,http://www.notco.com What is the status for https://notco.com/which is the final redirect target? Is the target page indexed? ~Sebastian

RE: Nutch not crawling all URLs

2021-12-13 Thread Roseline Antai
, Glasgow, UK The University of Strathclyde is a charitable body, registered in Scotland, number SC015263. -Original Message- From: lewis john mcgibbney Sent: 13 December 2021 17:18 To: user@nutch.apache.org Subject: Re: Nutch not crawling all URLs CAUTION: This email originated

Re: Nutch not crawling all URLs

2021-12-13 Thread lewis john mcgibbney
Hi Roseline, Looks like you are ignoring external URLs… that could be the problem right there. I encourage you to track counters on inject, generate and fetch phases to understand where records may be being dropped. Are the seeds you are using public? If so please post your seed file so we can

Re: Nutch not crawling all URLs

2021-12-13 Thread Sebastian Nagel
Antai > Research Fellow > Hunter Centre for Entrepreneurship > Strathclyde Business School > University of Strathclyde, Glasgow, UK > > > The University of Strathclyde is a charitable body, registered in Scotland, > number SC015263. > > > -Original Message-

Re: Nutch not crawling all URLs

2021-12-13 Thread Sebastian Greenholtz
I don't know how I joined this mailing list but please take me off of this list, I have not used Nutch for a long time. Thanks! On Mon, Dec 13, 2021 at 7:03 AM Roseline Antai wrote: > Hi, > > > > I am working with Apache nutch 1.18 and Solr. I have set up the system > successfully, but I’m now