[ https://issues.apache.org/jira/browse/NUTCH-660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646477#action_12646477 ]
Bryan commented on NUTCH-660: ----------------------------- The crawl log is as following: My internal company websites includes several HTTP websites. Another one is SVN repository HTTPS websites in XML structure, using <dir> and <file> tag. The search in HTTP websites is good. The HTTPS is ok. We have some links in those HTTP websites which point to Word files under SVN website. They can be indexed. But the Nutch does not search my SVN website. If I only search the SVN website, it is always: 0 urls fetched. My nutch-site.xml is as following: <property> <name>plugin.includes</name> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|msexcel|mswor d|mspowerpoint|pdf|zip|swf|rss)|index-(basic|anchor)|query-(basic|site|url)| summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> # skip file:, ftp:, & mailto: urls -^(ftp|mailto): # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*smartlabs.com.au/ crawl started in: crawl rootUrlDir = urls threads = 10 depth = 6 Injector: starting Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl/segments/20081109182909 Generator: filtering: false Generator: topN: 2147483647 Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=0 - no more URLs to fetch. No URLs to fetch - check your seed list and URL filters. crawl finished: crawl Any help would be much appreciated. Thanks in advnce. > Does anybody know how to let nutch crawl this kind of website? > -------------------------------------------------------------- > > Key: NUTCH-660 > URL: https://issues.apache.org/jira/browse/NUTCH-660 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 0.9.0 > Environment: CentOs 5.2 > Tomcat 6.0.18 > Java 1.6.0_10 > Nutch 0.9 > Reporter: Bryan > Priority: Critical > > My company intranet website is a svn repository, similar to : > http://svn.apache.org/repos/asf/lucene/nutch/ . > Does anybody have an idea on how to let nutch do search on it? > Thanks. > Bryan -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.