Hello, I went a little deeper and found that while the source page is being fetched during the crawl process, it is not parsed! I could see that by checking the logs of a parse plugin that I made, where I see no trace of the content of the source file.
Take note that even though the source uses the https protocol, no credentials are needed. If you have an idea, please let me know, David -----Original Message----- From: POIRIER David [mailto:[EMAIL PROTECTED] Sent: mardi, 22. avril 2008 10:58 To: [email protected] Subject: Generator: 0 records selected for fetching, exiting ... Hello, I'm having issues with a source that I'm trying to fetch. Config: Nutch 0.9, Intranet mode, https I have modified my nutch-site.xml file to include the protocol-httpclient plugin; https should not be a problem, and indeed, when I check at my logs I can see that the seed url is fetched: ************************* Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl-cmrinteract/segments/20080422104050 Generator: filtering: false Generator: topN: 2147483647 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl-cmrinteract/segments/20080422104050 Fetcher: threads: 10 fetching https://www.cmrinteract.com/clintrial/search.asp Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl-cmrinteract/crawldb CrawlDb update: segments: [crawl-cmrinteract/segments/20080422104050] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done ************************* Unfortunately, the fetcher seems to be missing all the links on the page: ************************* Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl-cmrinteract/segments/20080422104059 Generator: filtering: false Generator: topN: 2147483647 Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=1 - no more URLs to fetch. ************************* The links in the source page should not cause any problem, an example: <a href="viewform.asp?UnId=12544"><img src="images/viewbut.jpg" border="0" height="20" width="80"></a> And, finally, my configuration in the crawl-urlfilter.txt file is, I think, right: +^https://([a-z0-9]*\.)*cmrinteract.com/clintrial/search.asp +^https://([a-z0-9]*\.)*cmrinteract.com/clintrial/viewform.asp I'm really stuck... if you have any idea, please let me know. David
