[jira] [Created] (NUTCH-2610) How to exclude specific domains from Nutch crawling

Usama Tahir (JIRA) Fri, 22 Jun 2018 04:36:22 -0700

Usama Tahir created NUTCH-2610:
----------------------------------

             Summary: How to exclude specific domains from Nutch crawling
                 Key: NUTCH-2610
                 URL: https://issues.apache.org/jira/browse/NUTCH-2610
             Project: Nutch
          Issue Type: Bug
          Components: injector, plugin
    Affects Versions: 2.3.1
         Environment: OS: Ubuntu 16.04
            Reporter: Usama Tahir



I am using nutch for crawling sites. i want to use a blacklisting concept.

For example if i add a domain in black list, none of its document should be in 
my crawl.

can you guide me how to do that?

I came to know it can be done by regex-urlfilter.txt file in which we write 
domain with - sign like:
 * -jang.com.pk

is there better way to do that?

or is there any way that we write all our blacklist domains into a separate 
file and include it in regex-urlfilter file to exclude those blacklisted 
domains?
please guide me as soon as possible.

TIA



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (NUTCH-2610) How to exclude specific domains from Nutch crawling

Reply via email to