The crawl command is usually used for intranet crawling which is your case. It is simpliest way to run Nutch
Difference is here http://lucene.apache.org/nutch/tutorial8.html Alex 2008/11/11 Windflying <[EMAIL PROTECTED]> > Hi Alex, > > Thanks for your notes. It does give me some light, the robots.txt and XSLT > configuration, although have no idea what they are so far. > > What's the differenct in configuration for intranet and internet? Coz I'm > using the same set of configuration, and both of search of our company > websites and http://svn.apache.org work. > I'm using the command /bin/nutch crawl. For internet crawling, do I need > using other commands? > > Thanks. > > -----Original Message----- > From: Alexander Aristov [mailto:[EMAIL PROTECTED] > Sent: Tuesday, 11 November 2008 4:50 PM > To: [email protected] > Subject: Re: Does anybody know how to let nutch crawl this kind of website? > > I don't know how to configure your svn and add XSLT. But if your svn can > be > viewed from a brawser then it should always be crawled by Nutch. One note, > does your svn has the robots.txt file? Nutch is polite to public resources > and respects their rules. Check the file if it exists and allows robots. > > Are you using inranet crawling or internet? There are differences in > configuration. > > Alexander > > 2008/11/11 Windflying <[EMAIL PROTECTED]> > > > Hi Alex, > > Thanks for your reply. :) > > > > Yes, you are right. I just tried to search > > http://svn.apache.org/repos/asf/lucene/nutch/, and it did work. > > > > But I still can not search my own svn repository site. > > Generator: 0 records selected for fetching, exiting... > > Stopping at depth=0 - no more URLs to fetch. > > Authentication is not a problem. I already used the https-client plugin. > > Some resources stored in this svn repository are also referenced by > another > > intranet website, and they all can be searched and indexed from that > > website. > > > > I am new here. What I was told is that in teh case of my company svn the > > xml > > files are just file/folder names, most of the useful stuff in the svn is > > just referenced by the xml. What the XML Stylesheet does is turn the XML > > into HTML so the broswers can follow the links. > > > > I guess there must be something difference inbetween NutchSVN and my > > company > > SVN, which I do not know yet. > > > > Thanks & best regards,. > > > > -----Original Message----- > > From: Alexander Aristov [mailto:[EMAIL PROTECTED] > > Sent: Tuesday, 11 November 2008 3:33 PM > > To: [email protected] > > Subject: Re: Does anybody know how to let nutch crawl this kind of > website? > > > > this should work in the same way as for other sites. Folders are regular > > links. If you are talking about parsing content (files in the repository) > > then you should have necessary parsers, for example the text parser, xml > > parser ... > > > > And you should give anonymouse access to svn or configure nutch to sign > in. > > > > Alexander > > > > 2008/11/11 Windflying <[EMAIL PROTECTED]> > > > > > Hi all, > > > > > > My company intranet website is a svn repository, similar to : > > > http://svn.apache.org/repos/asf/lucene/nutch/ . > > > > > > Does anybody have an idea on how to let nutch do search on it? > > > > > > > > > > > > Thanks. > > > > > > > > > > > > Bryan > > > > > > > > > > > > > > > > > > > > > > > > -- > > Best Regards > > Alexander Aristov > > > > > > > -- > Best Regards > Alexander Aristov > > -- Best Regards Alexander Aristov
