Hello, I'm having issues with a source that I'm trying to fetch.
Config: Nutch 0.9, Intranet mode, https I have modified my nutch-site.xml file to include the protocol-httpclient plugin; https should not be a problem, and indeed, when I check at my logs I can see that the seed url is fetched: ************************* Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl-cmrinteract/segments/20080422104050 Generator: filtering: false Generator: topN: 2147483647 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl-cmrinteract/segments/20080422104050 Fetcher: threads: 10 fetching https://www.cmrinteract.com/clintrial/search.asp Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl-cmrinteract/crawldb CrawlDb update: segments: [crawl-cmrinteract/segments/20080422104050] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done ************************* Unfortunately, the fetcher seems to be missing all the links on the page: ************************* Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl-cmrinteract/segments/20080422104059 Generator: filtering: false Generator: topN: 2147483647 Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=1 - no more URLs to fetch. ************************* The links in the source page should not cause any problem, an example: <a href="viewform.asp?UnId=12544"><img src="images/viewbut.jpg" border="0" height="20" width="80"></a> And, finally, my configuration in the crawl-urlfilter.txt file is, I think, right: +^https://([a-z0-9]*\.)*cmrinteract.com/clintrial/search.asp +^https://([a-z0-9]*\.)*cmrinteract.com/clintrial/viewform.asp I'm really stuck... if you have any idea, please let me know. David
