[jira] [Resolved] (NUTCH-2610) How to exclude specific domains from Nutch crawling

Sebastian Nagel (JIRA) Fri, 22 Jun 2018 06:36:12 -0700


     [ 
https://issues.apache.org/jira/browse/NUTCH-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sebastian Nagel resolved NUTCH-2610.
------------------------------------
    Resolution: Not A Problem

Please use the [Nutch user mailing 
list|[http://nutch.apache.org/mailing_lists.html] for questions how to 
configure Nutch. Thanks, [~usama_]!

> How to exclude specific domains from Nutch crawling
> ---------------------------------------------------
>
>                 Key: NUTCH-2610
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2610
>             Project: Nutch
>          Issue Type: Bug
>          Components: injector, plugin
>    Affects Versions: 2.3.1
>         Environment: OS: Ubuntu 16.04
>            Reporter: Usama Tahir
>            Priority: Major
>
> I am using nutch for crawling sites. i want to use a blacklisting concept.
> For example if i add a domain in black list, none of its document should be 
> in my crawl.
> can you guide me how to do that?
> I came to know it can be done by regex-urlfilter.txt file in which we write 
> domain with - sign like:
>  * -jang.com.pk
> is there better way to do that?
> or is there any way that we write all our blacklist domains into a separate 
> file and include it in regex-urlfilter file to exclude those blacklisted 
> domains?
> please guide me as soon as possible.
> TIA



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (NUTCH-2610) How to exclude specific domains from Nutch crawling

Reply via email to