Hello, 

I'm having issues with a source that I'm trying to fetch.


Config: Nutch 0.9, Intranet mode, https

I have modified my nutch-site.xml file to include the
protocol-httpclient plugin; https should not be a problem, and indeed,
when I check at my logs I can see that the seed url is fetched:

*************************
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl-cmrinteract/segments/20080422104050
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl-cmrinteract/segments/20080422104050
Fetcher: threads: 10
fetching https://www.cmrinteract.com/clintrial/search.asp
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl-cmrinteract/crawldb
CrawlDb update: segments: [crawl-cmrinteract/segments/20080422104050]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
*************************

Unfortunately, the fetcher seems to be missing all the links on the
page:

*************************
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl-cmrinteract/segments/20080422104059
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
*************************

The links in the source page should not cause any problem, an example:

<a href="viewform.asp?UnId=12544"><img src="images/viewbut.jpg"
border="0" height="20" width="80"></a>


And, finally, my configuration in the crawl-urlfilter.txt file is, I
think, right:

+^https://([a-z0-9]*\.)*cmrinteract.com/clintrial/search.asp
+^https://([a-z0-9]*\.)*cmrinteract.com/clintrial/viewform.asp


I'm really stuck... if you have any idea, please let me know.

David

Reply via email to