RE: Url post filtering

Markus Jelsma Fri, 26 Sep 2014 04:18:07 -0700

Hi - you can use different regex files at the indexing stage, see nutch-default 
for the configuration directive and use -Dparam=val to override the default 
regex-urlfilter.txt file at indexing stage.
Markus


 
 
-----Original message-----
> From:Albinscode <[email protected]>
> Sent: Friday 26th September 2014 11:25
> To: [email protected]
> Subject: Url post filtering
> 
> Hello everybody,
> 
> I'm used to filter urls before fetch operation by using regex-filter
> to avoid crawling the world wide web.
> 
> I've got a specific need: one main page giving all urls to crawl. I
> want to crawl the main page to have outlinks but I dont want to index
> this page. How can I proceed?
> 
> I could enable this feature in my specific plugin but I want to be
> sure nothing is already existing as ever ;)
> Dirty solution would be to delete this main page url in the generated
> solr index with a json query but yeah this is really dirty ;)
> 
> Hope I'm clear.
>

RE: Url post filtering

Reply via email to