RE: Does anybody know how to let nutch crawl this kind of website?

Windflying Tue, 11 Nov 2008 04:23:09 -0800

I guess we don't have robots.txt in svn. Only found this file in
folder/usr/share/Nagios/ as following:
   "User-agent: * 
    Disallow: /"


What's this file for?

-----Original Message-----
From: Alexander Aristov [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, 11 November 2008 4:50 PM
To: [email protected]
Subject: Re: Does anybody know how to let nutch crawl this kind of website?

 I don't know how to configure your svn and add XSLT. But if your svn can be
viewed from a brawser then it should always be crawled by Nutch. One note,
does your svn has the robots.txt file? Nutch is polite to public resources
and respects their rules. Check the file if it exists and allows robots.

Are you using inranet crawling or internet? There are differences in
configuration.

Alexander

2008/11/11 Windflying <[EMAIL PROTECTED]>

> Hi Alex,
> Thanks for your reply. :)
>
> Yes, you are right. I just tried to search
> http://svn.apache.org/repos/asf/lucene/nutch/, and it did work.
>
> But I still can not search my own svn repository site.
> Generator: 0 records selected for fetching, exiting...
> Stopping at depth=0 - no more URLs to fetch.
> Authentication is not a problem. I already used the https-client plugin.
> Some resources stored in this svn repository are also referenced by
another
> intranet website, and they all can be searched and indexed from that
> website.
>
> I am new here. What I was told is that in teh case of my company svn the
> xml
> files are just file/folder names, most of the useful stuff in the svn is
> just referenced by the xml. What the XML Stylesheet does is turn the XML
> into HTML so the broswers can follow the links.
>
> I guess there must be something difference inbetween NutchSVN and my
> company
> SVN, which I do not know yet.
>
> Thanks & best regards,.
>
> -----Original Message-----
> From: Alexander Aristov [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, 11 November 2008 3:33 PM
> To: [email protected]
> Subject: Re: Does anybody know how to let nutch crawl this kind of
website?
>
> this should work in the same way as for other sites. Folders are regular
> links. If you are talking about parsing content (files in the repository)
> then you should have necessary parsers, for example the text parser, xml
> parser ...
>
> And you should give anonymouse access to svn or configure nutch to sign
in.
>
> Alexander
>
> 2008/11/11 Windflying <[EMAIL PROTECTED]>
>
> > Hi all,
> >
> > My company intranet website is a svn repository, similar to :
> > http://svn.apache.org/repos/asf/lucene/nutch/ .
> >
> > Does anybody have an idea on how to let nutch do search on it?
> >
> >
> >
> > Thanks.
> >
> >
> >
> > Bryan
> >
> >
> >
> >
> >
> >
>
>
> --
> Best Regards
> Alexander Aristov
>
>


-- 
Best Regards
Alexander Aristov

RE: Does anybody know how to let nutch crawl this kind of website?

Reply via email to