No, you do not. Forget about it then, Nutch should crawl such sites without
any problems. So you have problem with something else.

Alexander

2008/11/11 Windflying <[EMAIL PROTECTED]>

> No, it is "404 Not Found" for http://svn.smartlabs.com/robots.txt.
> Do I need to add one? Sorry for my silly questions.
>
> Thanks.
>
> -----Original Message-----
> From: Alexander Aristov [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, 11 November 2008 10:41 PM
> To: [email protected]
> Subject: Re: Does anybody know how to let nutch crawl this kind of website?
>
> The robots.txt file is available by this address
>
> http://your_host/robots.txt
>
> for example : http://svn.apache.org/robots.txt
>
> Check it and if the file is like you wrote then it's not surprisingly that
> Nutch doesn't crawl your svn.
>
> Alexander
>
>
> 2008/11/11 Windflying <[EMAIL PROTECTED]>
>
> > I guess we don't have robots.txt in svn. Only found this file in
> > folder/usr/share/Nagios/ as following:
> >   "User-agent: *
> >    Disallow: /"
> >
> > What's this file for?
> >
> > -----Original Message-----
> > From: Alexander Aristov [mailto:[EMAIL PROTECTED]
> > Sent: Tuesday, 11 November 2008 4:50 PM
> > To: [email protected]
> > Subject: Re: Does anybody know how to let nutch crawl this kind of
> website?
> >
> >  I don't know how to configure your svn and add XSLT. But if your svn can
> > be
> > viewed from a brawser then it should always be crawled by Nutch. One
> note,
> > does your svn has the robots.txt file? Nutch is polite to public
> resources
> > and respects their rules. Check the file if it exists and allows robots.
> >
> > Are you using inranet crawling or internet? There are differences in
> > configuration.
> >
> > Alexander
> >
> > 2008/11/11 Windflying <[EMAIL PROTECTED]>
> >
> > > Hi Alex,
> > > Thanks for your reply. :)
> > >
> > > Yes, you are right. I just tried to search
> > > http://svn.apache.org/repos/asf/lucene/nutch/, and it did work.
> > >
> > > But I still can not search my own svn repository site.
> > > Generator: 0 records selected for fetching, exiting...
> > > Stopping at depth=0 - no more URLs to fetch.
> > > Authentication is not a problem. I already used the https-client
> plugin.
> > > Some resources stored in this svn repository are also referenced by
> > another
> > > intranet website, and they all can be searched and indexed from that
> > > website.
> > >
> > > I am new here. What I was told is that in teh case of my company svn
> the
> > > xml
> > > files are just file/folder names, most of the useful stuff in the svn
> is
> > > just referenced by the xml. What the XML Stylesheet does is turn the
> XML
> > > into HTML so the broswers can follow the links.
> > >
> > > I guess there must be something difference inbetween NutchSVN and my
> > > company
> > > SVN, which I do not know yet.
> > >
> > > Thanks & best regards,.
> > >
> > > -----Original Message-----
> > > From: Alexander Aristov [mailto:[EMAIL PROTECTED]
> > > Sent: Tuesday, 11 November 2008 3:33 PM
> > > To: [email protected]
> > > Subject: Re: Does anybody know how to let nutch crawl this kind of
> > website?
> > >
> > > this should work in the same way as for other sites. Folders are
> regular
> > > links. If you are talking about parsing content (files in the
> repository)
> > > then you should have necessary parsers, for example the text parser,
> xml
> > > parser ...
> > >
> > > And you should give anonymouse access to svn or configure nutch to sign
> > in.
> > >
> > > Alexander
> > >
> > > 2008/11/11 Windflying <[EMAIL PROTECTED]>
> > >
> > > > Hi all,
> > > >
> > > > My company intranet website is a svn repository, similar to :
> > > > http://svn.apache.org/repos/asf/lucene/nutch/ .
> > > >
> > > > Does anybody have an idea on how to let nutch do search on it?
> > > >
> > > >
> > > >
> > > > Thanks.
> > > >
> > > >
> > > >
> > > > Bryan
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Best Regards
> > > Alexander Aristov
> > >
> > >
> >
> >
> > --
> > Best Regards
> > Alexander Aristov
> >
> >
>
>
> --
> Best Regards
> Alexander Aristov
>
>


-- 
Best Regards
Alexander Aristov

Reply via email to