hi k7, Add the urls you want to crawl into a folder called /urls. then in conf/regex-urlfilter.txt add the regular expressions for the url patterns you want included/excluded.
Hope this answers your question. Fadzi On Wed, 2006-11-29 at 15:34 -0800, Kevvin Sevvvin wrote: > Hi Everybody, > > I'm real new to Nutch. I've read through the documentation and many > months > of mailinglist archives and I don't think this question has been > answered. > > I have two tasks I would like Nutch to handle. I would like it to > crawl and > index ONLY a specific set of urls. This is a stronger limitation that > confining to specific sites (so db.ignore.external.links is > insufficient): it > should not follow ANY links on pages in the list of urls. > > Secondly, after creating the crawl and index of specific sites, I > would like > to occasionally add SINGLE urls to the index. > > Is this possible? If so, is it trivially possible with something like > '--topN 0' > (or should that be '--topN 1' ??) ? Or could I create a single local > web page > with all the links on it and run the crawler with '-depth 1' ? > > Apologies if this is an overasked or misguided question; if so I'd > appreciate > pointers to appropriate documentation or code so I can figure it out > on my own. > > Thanks! > -k7
