[jira] Issue Comment Edited: (NUTCH-660) Does anybody know how to let nutch crawl this kind of website?

Bryan (JIRA) Tue, 11 Nov 2008 17:40:56 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646477#action_12646477
 ]


windflying edited comment on NUTCH-660 at 11/11/08 5:38 PM:
-------------------------------------------------------

Sorry the information above is not quite clear.
I just tried to search http://svn.apache.org/repos/asf/lucene/nutch/, and it 
did work. 

But I still can not search my own svn repository site.
Also I fine other two websites,
http://svn.macosforge.org/repository/macports/
http://svn.collab.net/repos/svn/

When I use my nutch to crawl them, I got same results as well:
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.


I am new here. What I was told is that in teh case of my company svn the xml 
files are just file/folder names, most of the useful stuff in the svn is just 
referenced by the xml. What the XML Stylesheet does is turn the XML into HTML 
so the broswers can follow the links.

I guess there must be something difference inbetween NutchSVN and my company 
SVN, which I do not know yet.



I have configured the nutch-site.xml and crawl-urlfilter.txt.
As I can crawl http://svn.apache.org/repos/asf/lucene/nutch/ , so I assume my 
configuration is ok. Do u think so?
Just make sure no more work with my nutch configuration.

Thanks & best regards,.



      was (Author: windflying):
    Sorry the information above is not quite clear.
I just tried to search http://svn.apache.org/repos/asf/lucene/nutch/, and it 
did work. 

But I still can not search my own svn repository site.

Generator: 0 records selected for fetching, exiting...
Stopping at depth=0 - no more URLs to fetch.

Authentication is not a problem. I already used the https-client plugin. Some 
resources stored in this svn repository are also referenced by another intranet 
website, and they all can be searched and indexed from that website.

I am new here. What I was told is that in teh case of my company svn the xml 
files are just file/folder names, most of the useful stuff in the svn is just 
referenced by the xml. What the XML Stylesheet does is turn the XML into HTML 
so the broswers can follow the links.

I guess there must be something difference inbetween NutchSVN and my company 
SVN, which I do not know yet.

I fine other two websites,
http://svn.macosforge.org/repository/macports/
http://svn.collab.net/repos/svn/

When I use my nutch to crawl them, I got same results as well:
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.

I have configured the nutch-site.xml and crawl-urlfilter.txt.
As I can crawl http://svn.apache.org/repos/asf/lucene/nutch/ , so I assume my 
configuration is ok. Do u think so?
Just make sure no more work with my nutch configuration.

Thanks & best regards,.


  
> Does anybody know how to let nutch crawl this kind of website?
> --------------------------------------------------------------
>
>                 Key: NUTCH-660
>                 URL: https://issues.apache.org/jira/browse/NUTCH-660
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: CentOs 5.2
> Tomcat 6.0.18
> Java 1.6.0_10
> Nutch 0.9
>            Reporter: Bryan
>            Priority: Critical
>
> My company intranet website is a svn repository, similar to : 
> http://svn.apache.org/repos/asf/lucene/nutch/ .
> Does anybody have an idea on how to let nutch do search on it?
> Thanks.
> Bryan

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (NUTCH-660) Does anybody know how to let nutch crawl this kind of website?

Reply via email to