Re: Crawling a fixed domain

Siddhartha Reddy Thu, 26 Jun 2008 11:25:32 -0700

Hi Kranthi,

Are you doing an intranet crawl (using the "bin/nutch crawl" command) or a
whole-web crawl (using the various other sub-commands of bin/nutch, for
example)? conf/crawl-urlfilter.txt is used only in the intranet crawl, you
need use conf/regex-urlfilter.txt otherwise.


Another effective way of restricting a crawl to the domains from the seed
list is to set the db.ignore.external.links property to true in
conf/nutch-site.xml. conf/nutch-default.xml includes a description of this
property.

Best,
Siddhartha

On Thu, Jun 26, 2008 at 11:31 PM, kranthi reddy <[EMAIL PROTECTED]>
wrote:

> Hi ,
>
>  I am trying to crawl a fixed domain ... say IBNLIVE.COM ...
>
>  I have changed my conf/crawl-urlfilter.txt . I have added the line
>
>  "+^http://([a-z0-9]*\.)*ibnlive.com/ "
>
>
>   But i dont wat is going on ... i get results like
>
>  "fetching http://www.google-analytics.com/urchin.js
>   fetching http://www.josh18.com/showstory.php?id=236481
>   fetching
>
> http://www.cricketnext.com/news/gambhir-raina-make-merry-as-bowlers-struggle/32395-13.html
> "
>
>
>   I have given it in the format specified in the wiki/nutch site....
>   But it doesn't seem to work...
>
>  Some one please help me out...
>
> Thanking you
> kranthi reddy.b
>

-- 
http://www.grok.in
"Ignorance killed the cat, curiosity was framed."

Re: Crawling a fixed domain

Reply via email to