Hi Kranthi, Are you doing an intranet crawl (using the "bin/nutch crawl" command) or a whole-web crawl (using the various other sub-commands of bin/nutch, for example)? conf/crawl-urlfilter.txt is used only in the intranet crawl, you need use conf/regex-urlfilter.txt otherwise.
Another effective way of restricting a crawl to the domains from the seed list is to set the db.ignore.external.links property to true in conf/nutch-site.xml. conf/nutch-default.xml includes a description of this property. Best, Siddhartha On Thu, Jun 26, 2008 at 11:31 PM, kranthi reddy <[EMAIL PROTECTED]> wrote: > Hi , > > I am trying to crawl a fixed domain ... say IBNLIVE.COM ... > > I have changed my conf/crawl-urlfilter.txt . I have added the line > > "+^http://([a-z0-9]*\.)*ibnlive.com/ " > > > But i dont wat is going on ... i get results like > > "fetching http://www.google-analytics.com/urchin.js > fetching http://www.josh18.com/showstory.php?id=236481 > fetching > > http://www.cricketnext.com/news/gambhir-raina-make-merry-as-bowlers-struggle/32395-13.html > " > > > I have given it in the format specified in the wiki/nutch site.... > But it doesn't seem to work... > > Some one please help me out... > > Thanking you > kranthi reddy.b > -- http://www.grok.in "Ignorance killed the cat, curiosity was framed."
