Hello,

I went a little deeper and found that while the source page is being
fetched during the crawl process, it is not parsed! I could see that by
checking the logs of a parse plugin that I made, where I see no trace of
the content of the source file.

Take note that even though the source uses the https protocol, no
credentials are needed. 

If you have an idea, please let me know,


David






-----Original Message-----
From: POIRIER David [mailto:[EMAIL PROTECTED] 
Sent: mardi, 22. avril 2008 10:58
To: [email protected]
Subject: Generator: 0 records selected for fetching, exiting ...

Hello, 

I'm having issues with a source that I'm trying to fetch.


Config: Nutch 0.9, Intranet mode, https

I have modified my nutch-site.xml file to include the
protocol-httpclient plugin; https should not be a problem, and indeed,
when I check at my logs I can see that the seed url is fetched:

*************************
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl-cmrinteract/segments/20080422104050
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl-cmrinteract/segments/20080422104050
Fetcher: threads: 10
fetching https://www.cmrinteract.com/clintrial/search.asp
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl-cmrinteract/crawldb
CrawlDb update: segments: [crawl-cmrinteract/segments/20080422104050]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
*************************

Unfortunately, the fetcher seems to be missing all the links on the
page:

*************************
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl-cmrinteract/segments/20080422104059
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
*************************

The links in the source page should not cause any problem, an example:

<a href="viewform.asp?UnId=12544"><img src="images/viewbut.jpg"
border="0" height="20" width="80"></a>


And, finally, my configuration in the crawl-urlfilter.txt file is, I
think, right:

+^https://([a-z0-9]*\.)*cmrinteract.com/clintrial/search.asp
+^https://([a-z0-9]*\.)*cmrinteract.com/clintrial/viewform.asp


I'm really stuck... if you have any idea, please let me know.

David

Reply via email to