Usama Tahir created NUTCH-2610:
----------------------------------
Summary: How to exclude specific domains from Nutch crawling
Key: NUTCH-2610
URL: https://issues.apache.org/jira/browse/NUTCH-2610
Project: Nutch
Issue Type: Bug
Components: injector, plugin
Affects Versions: 2.3.1
Environment: OS: Ubuntu 16.04
Reporter: Usama Tahir
I am using nutch for crawling sites. i want to use a blacklisting concept.
For example if i add a domain in black list, none of its document should be in
my crawl.
can you guide me how to do that?
I came to know it can be done by regex-urlfilter.txt file in which we write
domain with - sign like:
* -jang.com.pk
is there better way to do that?
or is there any way that we write all our blacklist domains into a separate
file and include it in regex-urlfilter file to exclude those blacklisted
domains?
please guide me as soon as possible.
TIA
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)