Re: Nutch not crawling all URLs

Sebastian Greenholtz Mon, 13 Dec 2021 07:42:00 -0800

I don't know how I joined this mailing list but please take me off of this
list, I have not used Nutch for a long time.


Thanks!

On Mon, Dec 13, 2021 at 7:03 AM Roseline Antai <roseline.an...@strath.ac.uk>
wrote:

> Hi,
>
>
>
> I am working with Apache nutch 1.18 and Solr. I have set up the system
> successfully, but I’m now having the problem that Nutch is refusing to
> crawl all the URLs. I am now at a loss as to what I should do to correct
> this problem. It fetches about half of the URLs in the seed.txt file.
>
>
>
> For instance, when I inject 20 URLs, only 9 are fetched. I have made a
> number of changes based on the suggestions I saw on the Nutch forum, as
> well as on Stack overflow, but nothing seems to work.
>
>
>
> This is what my nutch-site.xml file looks like:
>
>
>
>
>
> *<?xml version="1.0"?>*
>
> *<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>*
>
>
>
> *<!-- Put site-specific property overrides in this file. -->*
>
>
>
> *<configuration>*
>
> *<property>*
>
> *<name>http.agent.name <http://http.agent.name></name>*
>
> *<value>Nutch Crawler</value>*
>
> *</property>*
>
> *<property>*
>
> *<name>http.agent.email</name>                         *
>
> *<value>datalake.ng at gmail d</value> *
>
> *</property>*
>
> *<property>*
>
> *    <name>db.ignore.internal.links</name>*
>
> *    <value>false</value>*
>
> *</property>*
>
> *<property>*
>
> *    <name>db.ignore.external.links</name>*
>
> *    <value>true</value>*
>
> *</property>*
>
> *<property>*
>
> *  <name>plugin.includes</name>*
>
> *
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier</value>*
>
> *</property>*
>
> *<property>*
>
> *    <name>parser.skip.truncated</name>*
>
> *    <value>false</value>*
>
> *    <description>Boolean value for whether we should skip parsing for
> truncated documents. By default this*
>
> *        property is activated due to extremely high levels of CPU which
> parsing can sometimes take.*
>
> *    </description>*
>
> *</property>*
>
> *<property>*
>
> *   <name>db.max.outlinks.per.page
> <http://db.max.outlinks.per.page></name>*
>
> *   <value>-1</value>*
>
> *   <description>The maximum number of outlinks that we'll process for a
> page.*
>
> *   If this value is nonnegative (>=0), at most db.max.outlinks.per.page
> <http://db.max.outlinks.per.page> outlinks*
>
> *   will be processed for a page; otherwise, all outlinks will be
> processed.*
>
> *   </description>*
>
> *</property>*
>
> *<property>*
>
> *  <name>http.content.limit</name>*
>
> *  <value>-1</value>*
>
> *  <description>The length limit for downloaded content using the http://*
>
> *  protocol, in bytes. If this value is nonnegative (>=0), content longer*
>
> *  than it will be truncated; otherwise, no truncation at all. Do not*
>
> *  confuse this setting with the file.content.limit setting.*
>
> *  </description>*
>
> *</property>*
>
> *<property>*
>
> *  <name>db.ignore.external.links.mode</name>*
>
> *  <value>byDomain</value>*
>
> *</property>*
>
> *<property>*
>
> *  <name>db.injector.overwrite</name>*
>
> *  <value>true</value>*
>
> *</property>*
>
> *<property>*
>
> *  <name>http.timeout</name>*
>
> *  <value>50000</value>*
>
> *  <description>The default network timeout, in
> milliseconds.</description>*
>
> *</property>*
>
> *</configuration>*
>
>
>
> Other changes I have made include changing the following in
> nutch-default.xml:
>
>
>
> *property>*
>
> *  <name>http.redirect.max</name>*
>
> *  <value>2</value>*
>
> *  <description>The maximum number of redirects the fetcher will follow
> when*
>
> *  trying to fetch a page. If set to negative or 0, fetcher won't
> immediately*
>
> *  follow redirected URLs, instead it will record them for later fetching.*
>
> *  </description>*
>
> *</property>*
>
> ****************************************************************
>
>
>
> *<property>*
>
> *  <name>ftp.timeout</name>*
>
> *  <value>100000</value>*
>
> *</property>*
>
>
>
> *<property>*
>
> *  <name>ftp.server.timeout</name>*
>
> *  <value>150000</value>*
>
> *</property>*
>
>
>
> ***************************************************************
>
>
>
> *property>*
>
> *  <name>fetcher.server.delay</name>*
>
> *  <value>65.0</value>*
>
> *</property>*
>
>
>
> *<property>*
>
> *  <name>fetcher.server.min.delay</name>*
>
> *  <value>25.0</value>*
>
> *</property>*
>
>
>
> *<property>*
>
> * <name>fetcher.max.crawl.delay</name>*
>
> * <value>70</value>*
>
> *</property> *
>
>
>
> I also commented out the line below in the regex-urlfilter file:
>
>
>
> *# skip URLs containing certain characters as probable queries, etc.*
>
> *-[?*!@=]*
>
>
>
> Nothing seems to work.
>
>
>
> What is it that I’m not doing, or doing wrongly here?
>
>
>
> Regards,
>
> Roseline
>
>
>
> *Dr Roseline Antai*
>
> *Research Fellow*
>
> Hunter Centre for Entrepreneurship
>
> Strathclyde Business School
>
> University of Strathclyde, Glasgow, UK
>
>
>
> [image: Small eMail Sig]
>
> The University of Strathclyde is a charitable body, registered in
> Scotland, number SC015263.
>
>
>
>
>

Re: Nutch not crawling all URLs

Reply via email to