Re: Url post filtering

Albinscode Fri, 26 Sep 2014 11:29:31 -0700

Hello Markus and thank you for the quick answer!
I didn't see that URLFilters were referenced directly in IndexerMapReduce.
That's really powerful! Thanks!


2014-09-26 13:17 GMT+02:00 Markus Jelsma <[email protected]>:
> Hi - you can use different regex files at the indexing stage, see 
> nutch-default for the configuration directive and use -Dparam=val to override 
> the default regex-urlfilter.txt file at indexing stage.
> Markus
>
>
>
> -----Original message-----
>> From:Albinscode <[email protected]>
>> Sent: Friday 26th September 2014 11:25
>> To: [email protected]
>> Subject: Url post filtering
>>
>> Hello everybody,
>>
>> I'm used to filter urls before fetch operation by using regex-filter
>> to avoid crawling the world wide web.
>>
>> I've got a specific need: one main page giving all urls to crawl. I
>> want to crawl the main page to have outlinks but I dont want to index
>> this page. How can I proceed?
>>
>> I could enable this feature in my specific plugin but I want to be
>> sure nothing is already existing as ever ;)
>> Dirty solution would be to delete this main page url in the generated
>> solr index with a json query but yeah this is really dirty ;)
>>
>> Hope I'm clear.
>>

Re: Url post filtering

Reply via email to