Re: Does anybody know how to let nutch crawl this kind of website?

Alexander Aristov Tue, 11 Nov 2008 23:43:31 -0800

Hi,

I have just tried to crawl the sites with my server - no problems, works as
expected.


I used the crawl command with params from the Nutch how-to page.

bin/nutch crawl urls -dir crawl -depth 3 -topN 50


Do you clean previouse crawled data from the disk? Generator might not
produce links to re-fetch already fetched resources. There is a special
policy that it won't recrawl recently crawled data untill some time passes.
(configured parameter)

And so generator produces no more links to fetch.

Alexander


2008/11/12 Windflying <[EMAIL PROTECTED]>

> Hi Alex,
>
> Good day. Sorry to interrupt you again.
>
> I fine two website,
> http://svn.macosforge.org/repository/macports/
> http://svn.collab.net/repos/svn/
>
> When I use my nutch to crawl them, I got:
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=0 - no more URLs to fetch.
>
> I have configured the nutch-site.xml and crawl-urlfilter.txt.
> As I can crawl http://svn.apache.org/repos/asf/lucene/nutch/ , so I assume
> my configuration is ok. Do u think so?
> Just make sure no more work with my nutch configuration.
>
> Thanks.
>
> -----Original Message-----
> From: Alexander Aristov [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, 11 November 2008 11:07 PM
> To: [email protected]
> Subject: Re: Does anybody know how to let nutch crawl this kind of website?
>
> No, you do not. Forget about it then, Nutch should crawl such sites without
> any problems. So you have problem with something else.
>
> Alexander
>
> 2008/11/11 Windflying <[EMAIL PROTECTED]>
>
> > No, it is "404 Not Found" for http://svn.smartlabs.com/robots.txt.
> > Do I need to add one? Sorry for my silly questions.
> >
> > Thanks.
> >
> > -----Original Message-----
> > From: Alexander Aristov [mailto:[EMAIL PROTECTED]
> > Sent: Tuesday, 11 November 2008 10:41 PM
> > To: [email protected]
> > Subject: Re: Does anybody know how to let nutch crawl this kind of
> website?
> >
> > The robots.txt file is available by this address
> >
> > http://your_host/robots.txt
> >
> > for example : http://svn.apache.org/robots.txt
> >
> > Check it and if the file is like you wrote then it's not surprisingly
> that
> > Nutch doesn't crawl your svn.
> >
> > Alexander
> >
> >
> > 2008/11/11 Windflying <[EMAIL PROTECTED]>
> >
> > > I guess we don't have robots.txt in svn. Only found this file in
> > > folder/usr/share/Nagios/ as following:
> > >   "User-agent: *
> > >    Disallow: /"
> > >
> > > What's this file for?
> > >
> > > -----Original Message-----
> > > From: Alexander Aristov [mailto:[EMAIL PROTECTED]
> > > Sent: Tuesday, 11 November 2008 4:50 PM
> > > To: [email protected]
> > > Subject: Re: Does anybody know how to let nutch crawl this kind of
> > website?
> > >
> > >  I don't know how to configure your svn and add XSLT. But if your svn
> can
> > > be
> > > viewed from a brawser then it should always be crawled by Nutch. One
> > note,
> > > does your svn has the robots.txt file? Nutch is polite to public
> > resources
> > > and respects their rules. Check the file if it exists and allows
> robots.
> > >
> > > Are you using inranet crawling or internet? There are differences in
> > > configuration.
> > >
> > > Alexander
> > >
> > > 2008/11/11 Windflying <[EMAIL PROTECTED]>
> > >
> > > > Hi Alex,
> > > > Thanks for your reply. :)
> > > >
> > > > Yes, you are right. I just tried to search
> > > > http://svn.apache.org/repos/asf/lucene/nutch/, and it did work.
> > > >
> > > > But I still can not search my own svn repository site.
> > > > Generator: 0 records selected for fetching, exiting...
> > > > Stopping at depth=0 - no more URLs to fetch.
> > > > Authentication is not a problem. I already used the https-client
> > plugin.
> > > > Some resources stored in this svn repository are also referenced by
> > > another
> > > > intranet website, and they all can be searched and indexed from that
> > > > website.
> > > >
> > > > I am new here. What I was told is that in teh case of my company svn
> > the
> > > > xml
> > > > files are just file/folder names, most of the useful stuff in the svn
> > is
> > > > just referenced by the xml. What the XML Stylesheet does is turn the
> > XML
> > > > into HTML so the broswers can follow the links.
> > > >
> > > > I guess there must be something difference inbetween NutchSVN and my
> > > > company
> > > > SVN, which I do not know yet.
> > > >
> > > > Thanks & best regards,.
> > > >
> > > > -----Original Message-----
> > > > From: Alexander Aristov [mailto:[EMAIL PROTECTED]
> > > > Sent: Tuesday, 11 November 2008 3:33 PM
> > > > To: [email protected]
> > > > Subject: Re: Does anybody know how to let nutch crawl this kind of
> > > website?
> > > >
> > > > this should work in the same way as for other sites. Folders are
> > regular
> > > > links. If you are talking about parsing content (files in the
> > repository)
> > > > then you should have necessary parsers, for example the text parser,
> > xml
> > > > parser ...
> > > >
> > > > And you should give anonymouse access to svn or configure nutch to
> sign
> > > in.
> > > >
> > > > Alexander
> > > >
> > > > 2008/11/11 Windflying <[EMAIL PROTECTED]>
> > > >
> > > > > Hi all,
> > > > >
> > > > > My company intranet website is a svn repository, similar to :
> > > > > http://svn.apache.org/repos/asf/lucene/nutch/ .
> > > > >
> > > > > Does anybody have an idea on how to let nutch do search on it?
> > > > >
> > > > >
> > > > >
> > > > > Thanks.
> > > > >
> > > > >
> > > > >
> > > > > Bryan
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Best Regards
> > > > Alexander Aristov
> > > >
> > > >
> > >
> > >
> > > --
> > > Best Regards
> > > Alexander Aristov
> > >
> > >
> >
> >
> > --
> > Best Regards
> > Alexander Aristov
> >
> >
>
>
> --
> Best Regards
> Alexander Aristov
>
>


-- 
Best Regards
Alexander Aristov

Re: Does anybody know how to let nutch crawl this kind of website?

Reply via email to