ve Tika parser for
mime-type application/javascript
Regards,
Roseline
The University of Strathclyde is a charitable body, registered in Scotland,
number SC015263.
-Original Message-
From: Roseline Antai
Sent: 13 January 2022 17:02
To: user@nutch.apache.org; Sebastian Nagel
Su
Thank you Sebastian.
I will try these.
Kind regards,
Roseline
-Original Message-
From: Sebastian Nagel
Sent: 13 January 2022 12:33
To: user@nutch.apache.org
Subject: Re: Nutch not crawling all URLs
Hi Roseline,
> Does it work at all with Chrome?
Yes.
> It seems you need t
t; processing.
>
> Kind regards,
> Roseline
>
>
>
>
>
> -Original Message-----
> From: Sebastian Nagel
> Sent: 12 January 2022 16:12
> To: user@nutch.apache.org
> Subject: Re: Nutch not crawling all URLs
>
> Hi Roseline,
>
>> the
16:12
To: user@nutch.apache.org
Subject: Re: Nutch not crawling all URLs
Hi Roseline,
> the mail below went to my junk folder and I didn't see it.
No problem. I hope you nevertheless enjoyed the holidays.
And sorry for any delays but I want to emphasize that Nutch is a community
proj
r
> than it will be truncated; otherwise, no truncation at all. Do not
> confuse this setting with the file.content.limit setting.
>
>
>
> db.ignore.external.links.mode
> byHost
>
>
> db.injector.overwrite
> true
>
>
> http.timeout
byHost
db.injector.overwrite
true
http.timeout
5
The default network timeout, in milliseconds.
Regards,
Roseline
-Original Message-
From: Sebastian Nagel
Sent: 13 December 2021 17:35
To: user@nutch.apache.org
Subject: Re: Nutch not crawling all URLs
CAUTION: This email originated o
Hi,
Following on from my previous enquiry, I was told to send the URLs I was trying
to crawl to be tried from your end. I sent these, but did not receive any
confirmation of receipt. Can you please confirm if these have been received,
and when I can look forward to getting some feedback?
I
Hi Sebastian,
yes that I mean. Do you think there is a way to learn more about,
how to crawl any website?!
>Hi Ayhan,
>you mean?
Hi Ayhan,
you mean?
https://stackoverflow.com/questions/69352136/nutch-does-not-crawl-sites-that-allows-all-crawler-by-robots-txt
Sebastian
On 12/13/21 20:59, Ayhan Koyun wrote:
> Hi,
>
> as I wrote before, it seems that I am not the only one who can not crawl all
> the seed.txt url's. I
of Strathclyde, Glasgow, UK
The University of Strathclyde is a charitable body, registered in Scotland,
number SC015263.
-Original Message-
From: lewis john mcgibbney
Sent: 13 December 2021 17:18
To: user@nutch.apache.org
Subject: Re: Nutch not crawling all URLs
CAUTION: This email
Hi,
as I wrote before, it seems that I am not the only one who can not crawl all
the seed.txt url's. I couldn't
find a solution really. I collected 450 domains and approximately 200 nutch
will or can not crawl. I want to
know why this happens, is there a solution to force crawling sites?
It
Hi Roseline,
> 5,36405,0,http://www.notco.com
What is the status for https://notco.com/which is the final redirect
target?
Is the target page indexed?
~Sebastian
, Glasgow, UK
The University of Strathclyde is a charitable body, registered in Scotland,
number SC015263.
-Original Message-
From: lewis john mcgibbney
Sent: 13 December 2021 17:18
To: user@nutch.apache.org
Subject: Re: Nutch not crawling all URLs
CAUTION: This email originated
Hi Roseline,
Looks like you are ignoring external URLs… that could be the problem right
there.
I encourage you to track counters on inject, generate and fetch phases to
understand where records may be being dropped.
Are the seeds you are using public? If so please post your seed file so we
can
Antai
> Research Fellow
> Hunter Centre for Entrepreneurship
> Strathclyde Business School
> University of Strathclyde, Glasgow, UK
>
>
> The University of Strathclyde is a charitable body, registered in Scotland,
> number SC015263.
>
>
> -Original Message-
I don't know how I joined this mailing list but please take me off of this
list, I have not used Nutch for a long time.
Thanks!
On Mon, Dec 13, 2021 at 7:03 AM Roseline Antai
wrote:
> Hi,
>
>
>
> I am working with Apache nutch 1.18 and Solr. I have set up the system
> successfully, but I’m now
16 matches
Mail list logo