hi siddhartha,
I am doing a whole web crawl.
Thanks for the response...i am now able to restrict the search to a
particular domain.For this i have changed the db.ignore.external.links to
true...
But when i put the value to false and
try changing the conf/regex-urlfilter.txt by adding the line ...
"+^http://([a-z0-9]*\.)*ibnlive.com/"
It doesn't work...
Thank you
Kranthi Reddy.B
On Thu, Jun 26, 2008 at 11:54 PM, Siddhartha Reddy <[EMAIL PROTECTED]> wrote:
> Hi Kranthi,
>
> Are you doing an intranet crawl (using the "bin/nutch crawl" command) or a
> whole-web crawl (using the various other sub-commands of bin/nutch, for
> example)? conf/crawl-urlfilter.txt is used only in the intranet crawl, you
> need use conf/regex-urlfilter.txt otherwise.
>
> Another effective way of restricting a crawl to the domains from the seed
> list is to set the db.ignore.external.links property to true in
> conf/nutch-site.xml. conf/nutch-default.xml includes a description of this
> property.
>
> Best,
> Siddhartha
>
> On Thu, Jun 26, 2008 at 11:31 PM, kranthi reddy <[EMAIL PROTECTED]>
> wrote:
>
> > Hi ,
> >
> > I am trying to crawl a fixed domain ... say IBNLIVE.COM ...
> >
> > I have changed my conf/crawl-urlfilter.txt . I have added the line
> >
> > "+^http://([a-z0-9]*\.)*ibnlive.com/ "
> >
> >
> > But i dont wat is going on ... i get results like
> >
> > "fetching http://www.google-analytics.com/urchin.js
> > fetching http://www.josh18.com/showstory.php?id=236481
> > fetching
> >
> >
> http://www.cricketnext.com/news/gambhir-raina-make-merry-as-bowlers-struggle/32395-13.html
> > "
> >
> >
> > I have given it in the format specified in the wiki/nutch site....
> > But it doesn't seem to work...
> >
> > Some one please help me out...
> >
> > Thanking you
> > kranthi reddy.b
> >
>
> --
> http://www.grok.in
> "Ignorance killed the cat, curiosity was framed."
>