Hi,
I am working with Apache nutch 1.18 and Solr. I have set up the system
successfully, but I'm now having the problem that Nutch is refusing to crawl
all the URLs. I am now at a loss as to what I should do to correct this
problem. It fetches about half of the URLs in the seed.txt file.
For
I don't know how I joined this mailing list but please take me off of this
list, I have not used Nutch for a long time.
Thanks!
On Mon, Dec 13, 2021 at 7:03 AM Roseline Antai
wrote:
> Hi,
>
>
>
> I am working with Apache nutch 1.18 and Solr. I have set up the system
> successfully, but I’m now
Hi Lewis,
Yes, there are public websites. Below are the 20 test URLs I've been trying to
crawl.
http://traivefinance.com
http://www.ceibal.edu.uy
http://www.talovstudio.com
https://portaltelemedicina.com.br/en/telediagnostic-platform
http://www.notco.com
http://www.saiph.org
Hi,
(looping back to user@nutch - sorry, pressed the wrong reply button)
> Some URLs were denied by robots.txt,
> while a few failed with: Http code=403
That's two ways to signalize that these pages shouldn't be crawled,
HTTP 403 means "Forbidden".
> 3. I looked in CrawlDB and most URLs are in
Hi Roseline,
> 5,36405,0,http://www.notco.com
What is the status for https://notco.com/which is the final redirect
target?
Is the target page indexed?
~Sebastian
Hi Roseline,
Looks like you are ignoring external URLs… that could be the problem right
there.
I encourage you to track counters on inject, generate and fetch phases to
understand where records may be being dropped.
Are the seeds you are using public? If so please post your seed file so we
can
Hi Sebastian,
yes that I mean. Do you think there is a way to learn more about,
how to crawl any website?!
>Hi Ayhan,
>you mean?
Hi Ayhan,
you mean?
https://stackoverflow.com/questions/69352136/nutch-does-not-crawl-sites-that-allows-all-crawler-by-robots-txt
Sebastian
On 12/13/21 20:59, Ayhan Koyun wrote:
> Hi,
>
> as I wrote before, it seems that I am not the only one who can not crawl all
> the seed.txt url's. I
Hi Lewis,
I got a really weird reply back from what I sent, so I thought it better to
resend the URLs again. I'm unsure if you got the URLs in the first instance.
I've sent them as a text file attachment as well.
http://traivefinance.com
http://www.ceibal.edu.uy
http://www.talovstudio.com
Hi,
as I wrote before, it seems that I am not the only one who can not crawl all
the seed.txt url's. I couldn't
find a solution really. I collected 450 domains and approximately 200 nutch
will or can not crawl. I want to
know why this happens, is there a solution to force crawling sites?
It
10 matches
Mail list logo