Hi, I have just tried to crawl the sites with my server - no problems, works as expected.
I used the crawl command with params from the Nutch how-to page. bin/nutch crawl urls -dir crawl -depth 3 -topN 50 Do you clean previouse crawled data from the disk? Generator might not produce links to re-fetch already fetched resources. There is a special policy that it won't recrawl recently crawled data untill some time passes. (configured parameter) And so generator produces no more links to fetch. Alexander 2008/11/12 Windflying <[EMAIL PROTECTED]> > Hi Alex, > > Good day. Sorry to interrupt you again. > > I fine two website, > http://svn.macosforge.org/repository/macports/ > http://svn.collab.net/repos/svn/ > > When I use my nutch to crawl them, I got: > Generator: 0 records selected for fetching, exiting ... > Stopping at depth=0 - no more URLs to fetch. > > I have configured the nutch-site.xml and crawl-urlfilter.txt. > As I can crawl http://svn.apache.org/repos/asf/lucene/nutch/ , so I assume > my configuration is ok. Do u think so? > Just make sure no more work with my nutch configuration. > > Thanks. > > -----Original Message----- > From: Alexander Aristov [mailto:[EMAIL PROTECTED] > Sent: Tuesday, 11 November 2008 11:07 PM > To: [email protected] > Subject: Re: Does anybody know how to let nutch crawl this kind of website? > > No, you do not. Forget about it then, Nutch should crawl such sites without > any problems. So you have problem with something else. > > Alexander > > 2008/11/11 Windflying <[EMAIL PROTECTED]> > > > No, it is "404 Not Found" for http://svn.smartlabs.com/robots.txt. > > Do I need to add one? Sorry for my silly questions. > > > > Thanks. > > > > -----Original Message----- > > From: Alexander Aristov [mailto:[EMAIL PROTECTED] > > Sent: Tuesday, 11 November 2008 10:41 PM > > To: [email protected] > > Subject: Re: Does anybody know how to let nutch crawl this kind of > website? > > > > The robots.txt file is available by this address > > > > http://your_host/robots.txt > > > > for example : http://svn.apache.org/robots.txt > > > > Check it and if the file is like you wrote then it's not surprisingly > that > > Nutch doesn't crawl your svn. > > > > Alexander > > > > > > 2008/11/11 Windflying <[EMAIL PROTECTED]> > > > > > I guess we don't have robots.txt in svn. Only found this file in > > > folder/usr/share/Nagios/ as following: > > > "User-agent: * > > > Disallow: /" > > > > > > What's this file for? > > > > > > -----Original Message----- > > > From: Alexander Aristov [mailto:[EMAIL PROTECTED] > > > Sent: Tuesday, 11 November 2008 4:50 PM > > > To: [email protected] > > > Subject: Re: Does anybody know how to let nutch crawl this kind of > > website? > > > > > > I don't know how to configure your svn and add XSLT. But if your svn > can > > > be > > > viewed from a brawser then it should always be crawled by Nutch. One > > note, > > > does your svn has the robots.txt file? Nutch is polite to public > > resources > > > and respects their rules. Check the file if it exists and allows > robots. > > > > > > Are you using inranet crawling or internet? There are differences in > > > configuration. > > > > > > Alexander > > > > > > 2008/11/11 Windflying <[EMAIL PROTECTED]> > > > > > > > Hi Alex, > > > > Thanks for your reply. :) > > > > > > > > Yes, you are right. I just tried to search > > > > http://svn.apache.org/repos/asf/lucene/nutch/, and it did work. > > > > > > > > But I still can not search my own svn repository site. > > > > Generator: 0 records selected for fetching, exiting... > > > > Stopping at depth=0 - no more URLs to fetch. > > > > Authentication is not a problem. I already used the https-client > > plugin. > > > > Some resources stored in this svn repository are also referenced by > > > another > > > > intranet website, and they all can be searched and indexed from that > > > > website. > > > > > > > > I am new here. What I was told is that in teh case of my company svn > > the > > > > xml > > > > files are just file/folder names, most of the useful stuff in the svn > > is > > > > just referenced by the xml. What the XML Stylesheet does is turn the > > XML > > > > into HTML so the broswers can follow the links. > > > > > > > > I guess there must be something difference inbetween NutchSVN and my > > > > company > > > > SVN, which I do not know yet. > > > > > > > > Thanks & best regards,. > > > > > > > > -----Original Message----- > > > > From: Alexander Aristov [mailto:[EMAIL PROTECTED] > > > > Sent: Tuesday, 11 November 2008 3:33 PM > > > > To: [email protected] > > > > Subject: Re: Does anybody know how to let nutch crawl this kind of > > > website? > > > > > > > > this should work in the same way as for other sites. Folders are > > regular > > > > links. If you are talking about parsing content (files in the > > repository) > > > > then you should have necessary parsers, for example the text parser, > > xml > > > > parser ... > > > > > > > > And you should give anonymouse access to svn or configure nutch to > sign > > > in. > > > > > > > > Alexander > > > > > > > > 2008/11/11 Windflying <[EMAIL PROTECTED]> > > > > > > > > > Hi all, > > > > > > > > > > My company intranet website is a svn repository, similar to : > > > > > http://svn.apache.org/repos/asf/lucene/nutch/ . > > > > > > > > > > Does anybody have an idea on how to let nutch do search on it? > > > > > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > Bryan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Best Regards > > > > Alexander Aristov > > > > > > > > > > > > > > > > > -- > > > Best Regards > > > Alexander Aristov > > > > > > > > > > > > -- > > Best Regards > > Alexander Aristov > > > > > > > -- > Best Regards > Alexander Aristov > > -- Best Regards Alexander Aristov
