[Nutch-general] Limiting crawl to specific list of URLS

Kevvin Sevvvin Wed, 29 Nov 2006 15:35:54 -0800

Hi Everybody,

I'm real new to Nutch. I've read through the documentation and many  
months
of mailinglist archives and I don't think this question has been  
answered.


I have two tasks I would like Nutch to handle. I would like it to  
crawl and
index ONLY a specific set of urls. This is a stronger limitation that
confining to specific sites (so db.ignore.external.links is  
insufficient): it
should not follow ANY links on pages in the list of urls.

Secondly, after creating the crawl and index of specific sites, I  
would like
to occasionally add SINGLE urls to the index.

Is this possible? If so, is it trivially possible with something like  
'--topN 0'
(or should that be '--topN 1' ??) ? Or could I create a single local  
web page
with all the links on it and run the crawler with '-depth 1' ?

Apologies if this is an overasked or misguided question; if so I'd  
appreciate
pointers to appropriate documentation or code so I can figure it out  
on my own.

Thanks!
-k7

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Limiting crawl to specific list of URLS

Reply via email to