I don't know how I joined this mailing list but please take me off of this list, I have not used Nutch for a long time.
Thanks! On Mon, Dec 13, 2021 at 7:03 AM Roseline Antai <roseline.an...@strath.ac.uk> wrote: > Hi, > > > > I am working with Apache nutch 1.18 and Solr. I have set up the system > successfully, but I’m now having the problem that Nutch is refusing to > crawl all the URLs. I am now at a loss as to what I should do to correct > this problem. It fetches about half of the URLs in the seed.txt file. > > > > For instance, when I inject 20 URLs, only 9 are fetched. I have made a > number of changes based on the suggestions I saw on the Nutch forum, as > well as on Stack overflow, but nothing seems to work. > > > > This is what my nutch-site.xml file looks like: > > > > > > *<?xml version="1.0"?>* > > *<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>* > > > > *<!-- Put site-specific property overrides in this file. -->* > > > > *<configuration>* > > *<property>* > > *<name>http.agent.name <http://http.agent.name></name>* > > *<value>Nutch Crawler</value>* > > *</property>* > > *<property>* > > *<name>http.agent.email</name> * > > *<value>datalake.ng at gmail d</value> * > > *</property>* > > *<property>* > > * <name>db.ignore.internal.links</name>* > > * <value>false</value>* > > *</property>* > > *<property>* > > * <name>db.ignore.external.links</name>* > > * <value>true</value>* > > *</property>* > > *<property>* > > * <name>plugin.includes</name>* > > * > <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier</value>* > > *</property>* > > *<property>* > > * <name>parser.skip.truncated</name>* > > * <value>false</value>* > > * <description>Boolean value for whether we should skip parsing for > truncated documents. By default this* > > * property is activated due to extremely high levels of CPU which > parsing can sometimes take.* > > * </description>* > > *</property>* > > *<property>* > > * <name>db.max.outlinks.per.page > <http://db.max.outlinks.per.page></name>* > > * <value>-1</value>* > > * <description>The maximum number of outlinks that we'll process for a > page.* > > * If this value is nonnegative (>=0), at most db.max.outlinks.per.page > <http://db.max.outlinks.per.page> outlinks* > > * will be processed for a page; otherwise, all outlinks will be > processed.* > > * </description>* > > *</property>* > > *<property>* > > * <name>http.content.limit</name>* > > * <value>-1</value>* > > * <description>The length limit for downloaded content using the http://* > > * protocol, in bytes. If this value is nonnegative (>=0), content longer* > > * than it will be truncated; otherwise, no truncation at all. Do not* > > * confuse this setting with the file.content.limit setting.* > > * </description>* > > *</property>* > > *<property>* > > * <name>db.ignore.external.links.mode</name>* > > * <value>byDomain</value>* > > *</property>* > > *<property>* > > * <name>db.injector.overwrite</name>* > > * <value>true</value>* > > *</property>* > > *<property>* > > * <name>http.timeout</name>* > > * <value>50000</value>* > > * <description>The default network timeout, in > milliseconds.</description>* > > *</property>* > > *</configuration>* > > > > Other changes I have made include changing the following in > nutch-default.xml: > > > > *property>* > > * <name>http.redirect.max</name>* > > * <value>2</value>* > > * <description>The maximum number of redirects the fetcher will follow > when* > > * trying to fetch a page. If set to negative or 0, fetcher won't > immediately* > > * follow redirected URLs, instead it will record them for later fetching.* > > * </description>* > > *</property>* > > **************************************************************** > > > > *<property>* > > * <name>ftp.timeout</name>* > > * <value>100000</value>* > > *</property>* > > > > *<property>* > > * <name>ftp.server.timeout</name>* > > * <value>150000</value>* > > *</property>* > > > > *************************************************************** > > > > *property>* > > * <name>fetcher.server.delay</name>* > > * <value>65.0</value>* > > *</property>* > > > > *<property>* > > * <name>fetcher.server.min.delay</name>* > > * <value>25.0</value>* > > *</property>* > > > > *<property>* > > * <name>fetcher.max.crawl.delay</name>* > > * <value>70</value>* > > *</property> * > > > > I also commented out the line below in the regex-urlfilter file: > > > > *# skip URLs containing certain characters as probable queries, etc.* > > *-[?*!@=]* > > > > Nothing seems to work. > > > > What is it that I’m not doing, or doing wrongly here? > > > > Regards, > > Roseline > > > > *Dr Roseline Antai* > > *Research Fellow* > > Hunter Centre for Entrepreneurship > > Strathclyde Business School > > University of Strathclyde, Glasgow, UK > > > > [image: Small eMail Sig] > > The University of Strathclyde is a charitable body, registered in > Scotland, number SC015263. > > > > >