Nutch not crawling all URLs

Roseline Antai Mon, 13 Dec 2021 04:02:59 -0800

Hi,

I am working with Apache nutch 1.18 and Solr. I have set up the system 
successfully, but I'm now having the problem that Nutch is refusing to crawl 
all the URLs. I am now at a loss as to what I should do to correct this 
problem. It fetches about half of the URLs in the seed.txt file.


For instance, when I inject 20 URLs, only 9 are fetched. I have made a number 
of changes based on the suggestions I saw on the Nutch forum, as well as on 
Stack overflow, but nothing seems to work.

This is what my nutch-site.xml file looks like:


<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>http.agent.name</name>
<value>Nutch Crawler</value>
</property>
<property>
<name>http.agent.email</name>
<value>datalake.ng at gmail d</value>
</property>
<property>
    <name>db.ignore.internal.links</name>
    <value>false</value>
</property>
<property>
    <name>db.ignore.external.links</name>
    <value>true</value>
</property>
<property>
  <name>plugin.includes</name>
  
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier</value>
</property>
<property>
    <name>parser.skip.truncated</name>
    <value>false</value>
    <description>Boolean value for whether we should skip parsing for truncated 
documents. By default this
        property is activated due to extremely high levels of CPU which parsing 
can sometimes take.
    </description>
</property>
<property>
   <name>db.max.outlinks.per.page</name>
   <value>-1</value>
   <description>The maximum number of outlinks that we'll process for a page.
   If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
   will be processed for a page; otherwise, all outlinks will be processed.
   </description>
</property>
<property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content using the http://
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  </description>
</property>
<property>
  <name>db.ignore.external.links.mode</name>
  <value>byDomain</value>
</property>
<property>
  <name>db.injector.overwrite</name>
  <value>true</value>
</property>
<property>
  <name>http.timeout</name>
  <value>50000</value>
  <description>The default network timeout, in milliseconds.</description>
</property>
</configuration>

Other changes I have made include changing the following in nutch-default.xml:

property>
  <name>http.redirect.max</name>
  <value>2</value>
  <description>The maximum number of redirects the fetcher will follow when
  trying to fetch a page. If set to negative or 0, fetcher won't immediately
  follow redirected URLs, instead it will record them for later fetching.
  </description>
</property>
**************************************************************


<property>

  <name>ftp.timeout</name>

  <value>100000</value>

</property>



<property>

  <name>ftp.server.timeout</name>

  <value>150000</value>

</property>

*************************************************************


property>

  <name>fetcher.server.delay</name>

  <value>65.0</value>

</property>



<property>

  <name>fetcher.server.min.delay</name>

  <value>25.0</value>

</property>



<property>

 <name>fetcher.max.crawl.delay</name>

 <value>70</value>

</property>

I also commented out the line below in the regex-urlfilter file:


# skip URLs containing certain characters as probable queries, etc.

-[?*!@=]

Nothing seems to work.

What is it that I'm not doing, or doing wrongly here?

Regards,
Roseline

Dr Roseline Antai
Research Fellow
Hunter Centre for Entrepreneurship
Strathclyde Business School
University of Strathclyde, Glasgow, UK

[Small eMail Sig]
The University of Strathclyde is a charitable body, registered in Scotland, 
number SC015263.

Nutch not crawling all URLs

Reply via email to