Re: Does anybody know how to let nutch crawl this kind of website?

Alexander Aristov Tue, 11 Nov 2008 04:45:02 -0800

The crawl command is usually used for intranet crawling which is your case.
It is simpliest way to run Nutch


Difference is here
http://lucene.apache.org/nutch/tutorial8.html

Alex


2008/11/11 Windflying <[EMAIL PROTECTED]>

> Hi Alex,
>
> Thanks for your notes. It does give me some light, the robots.txt and XSLT
> configuration, although have no idea what they are so far.
>
> What's the differenct in configuration for intranet and internet? Coz I'm
> using the same set of configuration, and both of search of our company
> websites and http://svn.apache.org work.
> I'm using the command /bin/nutch crawl. For internet crawling, do I need
> using other commands?
>
> Thanks.
>
> -----Original Message-----
> From: Alexander Aristov [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, 11 November 2008 4:50 PM
> To: [email protected]
> Subject: Re: Does anybody know how to let nutch crawl this kind of website?
>
>  I don't know how to configure your svn and add XSLT. But if your svn can
> be
> viewed from a brawser then it should always be crawled by Nutch. One note,
> does your svn has the robots.txt file? Nutch is polite to public resources
> and respects their rules. Check the file if it exists and allows robots.
>
> Are you using inranet crawling or internet? There are differences in
> configuration.
>
> Alexander
>
> 2008/11/11 Windflying <[EMAIL PROTECTED]>
>
> > Hi Alex,
> > Thanks for your reply. :)
> >
> > Yes, you are right. I just tried to search
> > http://svn.apache.org/repos/asf/lucene/nutch/, and it did work.
> >
> > But I still can not search my own svn repository site.
> > Generator: 0 records selected for fetching, exiting...
> > Stopping at depth=0 - no more URLs to fetch.
> > Authentication is not a problem. I already used the https-client plugin.
> > Some resources stored in this svn repository are also referenced by
> another
> > intranet website, and they all can be searched and indexed from that
> > website.
> >
> > I am new here. What I was told is that in teh case of my company svn the
> > xml
> > files are just file/folder names, most of the useful stuff in the svn is
> > just referenced by the xml. What the XML Stylesheet does is turn the XML
> > into HTML so the broswers can follow the links.
> >
> > I guess there must be something difference inbetween NutchSVN and my
> > company
> > SVN, which I do not know yet.
> >
> > Thanks & best regards,.
> >
> > -----Original Message-----
> > From: Alexander Aristov [mailto:[EMAIL PROTECTED]
> > Sent: Tuesday, 11 November 2008 3:33 PM
> > To: [email protected]
> > Subject: Re: Does anybody know how to let nutch crawl this kind of
> website?
> >
> > this should work in the same way as for other sites. Folders are regular
> > links. If you are talking about parsing content (files in the repository)
> > then you should have necessary parsers, for example the text parser, xml
> > parser ...
> >
> > And you should give anonymouse access to svn or configure nutch to sign
> in.
> >
> > Alexander
> >
> > 2008/11/11 Windflying <[EMAIL PROTECTED]>
> >
> > > Hi all,
> > >
> > > My company intranet website is a svn repository, similar to :
> > > http://svn.apache.org/repos/asf/lucene/nutch/ .
> > >
> > > Does anybody have an idea on how to let nutch do search on it?
> > >
> > >
> > >
> > > Thanks.
> > >
> > >
> > >
> > > Bryan
> > >
> > >
> > >
> > >
> > >
> > >
> >
> >
> > --
> > Best Regards
> > Alexander Aristov
> >
> >
>
>
> --
> Best Regards
> Alexander Aristov
>
>


-- 
Best Regards
Alexander Aristov

Re: Does anybody know how to let nutch crawl this kind of website?

Reply via email to