No, you do not. Forget about it then, Nutch should crawl such sites without any problems. So you have problem with something else.
Alexander 2008/11/11 Windflying <[EMAIL PROTECTED]> > No, it is "404 Not Found" for http://svn.smartlabs.com/robots.txt. > Do I need to add one? Sorry for my silly questions. > > Thanks. > > -----Original Message----- > From: Alexander Aristov [mailto:[EMAIL PROTECTED] > Sent: Tuesday, 11 November 2008 10:41 PM > To: [email protected] > Subject: Re: Does anybody know how to let nutch crawl this kind of website? > > The robots.txt file is available by this address > > http://your_host/robots.txt > > for example : http://svn.apache.org/robots.txt > > Check it and if the file is like you wrote then it's not surprisingly that > Nutch doesn't crawl your svn. > > Alexander > > > 2008/11/11 Windflying <[EMAIL PROTECTED]> > > > I guess we don't have robots.txt in svn. Only found this file in > > folder/usr/share/Nagios/ as following: > > "User-agent: * > > Disallow: /" > > > > What's this file for? > > > > -----Original Message----- > > From: Alexander Aristov [mailto:[EMAIL PROTECTED] > > Sent: Tuesday, 11 November 2008 4:50 PM > > To: [email protected] > > Subject: Re: Does anybody know how to let nutch crawl this kind of > website? > > > > I don't know how to configure your svn and add XSLT. But if your svn can > > be > > viewed from a brawser then it should always be crawled by Nutch. One > note, > > does your svn has the robots.txt file? Nutch is polite to public > resources > > and respects their rules. Check the file if it exists and allows robots. > > > > Are you using inranet crawling or internet? There are differences in > > configuration. > > > > Alexander > > > > 2008/11/11 Windflying <[EMAIL PROTECTED]> > > > > > Hi Alex, > > > Thanks for your reply. :) > > > > > > Yes, you are right. I just tried to search > > > http://svn.apache.org/repos/asf/lucene/nutch/, and it did work. > > > > > > But I still can not search my own svn repository site. > > > Generator: 0 records selected for fetching, exiting... > > > Stopping at depth=0 - no more URLs to fetch. > > > Authentication is not a problem. I already used the https-client > plugin. > > > Some resources stored in this svn repository are also referenced by > > another > > > intranet website, and they all can be searched and indexed from that > > > website. > > > > > > I am new here. What I was told is that in teh case of my company svn > the > > > xml > > > files are just file/folder names, most of the useful stuff in the svn > is > > > just referenced by the xml. What the XML Stylesheet does is turn the > XML > > > into HTML so the broswers can follow the links. > > > > > > I guess there must be something difference inbetween NutchSVN and my > > > company > > > SVN, which I do not know yet. > > > > > > Thanks & best regards,. > > > > > > -----Original Message----- > > > From: Alexander Aristov [mailto:[EMAIL PROTECTED] > > > Sent: Tuesday, 11 November 2008 3:33 PM > > > To: [email protected] > > > Subject: Re: Does anybody know how to let nutch crawl this kind of > > website? > > > > > > this should work in the same way as for other sites. Folders are > regular > > > links. If you are talking about parsing content (files in the > repository) > > > then you should have necessary parsers, for example the text parser, > xml > > > parser ... > > > > > > And you should give anonymouse access to svn or configure nutch to sign > > in. > > > > > > Alexander > > > > > > 2008/11/11 Windflying <[EMAIL PROTECTED]> > > > > > > > Hi all, > > > > > > > > My company intranet website is a svn repository, similar to : > > > > http://svn.apache.org/repos/asf/lucene/nutch/ . > > > > > > > > Does anybody have an idea on how to let nutch do search on it? > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > Bryan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Best Regards > > > Alexander Aristov > > > > > > > > > > > > -- > > Best Regards > > Alexander Aristov > > > > > > > -- > Best Regards > Alexander Aristov > > -- Best Regards Alexander Aristov
