Re: Crawling a fixed domain

kranthi reddy Thu, 26 Jun 2008 11:48:27 -0700

hi siddhartha,

   I am doing a whole web crawl.


  Thanks for the response...i am now able to restrict the search to  a
particular domain.For this i have changed the db.ignore.external.links to
true...

But when i put the value to false and
try changing the conf/regex-urlfilter.txt  by adding the line ...

     "+^http://([a-z0-9]*\.)*ibnlive.com/"
 It doesn't work...


Thank you
Kranthi Reddy.B

On Thu, Jun 26, 2008 at 11:54 PM, Siddhartha Reddy <[EMAIL PROTECTED]> wrote:

> Hi Kranthi,
>
> Are you doing an intranet crawl (using the "bin/nutch crawl" command) or a
> whole-web crawl (using the various other sub-commands of bin/nutch, for
> example)? conf/crawl-urlfilter.txt is used only in the intranet crawl, you
> need use conf/regex-urlfilter.txt otherwise.
>
> Another effective way of restricting a crawl to the domains from the seed
> list is to set the db.ignore.external.links property to true in
> conf/nutch-site.xml. conf/nutch-default.xml includes a description of this
> property.
>
> Best,
> Siddhartha
>
> On Thu, Jun 26, 2008 at 11:31 PM, kranthi reddy <[EMAIL PROTECTED]>
> wrote:
>
> > Hi ,
> >
> >  I am trying to crawl a fixed domain ... say IBNLIVE.COM ...
> >
> >  I have changed my conf/crawl-urlfilter.txt . I have added the line
> >
> >  "+^http://([a-z0-9]*\.)*ibnlive.com/ "
> >
> >
> >   But i dont wat is going on ... i get results like
> >
> >  "fetching http://www.google-analytics.com/urchin.js
> >   fetching http://www.josh18.com/showstory.php?id=236481
> >   fetching
> >
> >
> http://www.cricketnext.com/news/gambhir-raina-make-merry-as-bowlers-struggle/32395-13.html
> > "
> >
> >
> >   I have given it in the format specified in the wiki/nutch site....
> >   But it doesn't seem to work...
> >
> >  Some one please help me out...
> >
> > Thanking you
> > kranthi reddy.b
> >
>
> --
> http://www.grok.in
> "Ignorance killed the cat, curiosity was framed."
>

Re: Crawling a fixed domain

Reply via email to