Dennis,

Thanks for your reply.

I did change the plugin.includes variable to eliminate the protocol-http
plugin and add the protocol-httpclient plugin instead.

The problem is afterward since the page is actually fetched... or looks
like. I think that something is wrong between the fetch and parse
processes.

If you think of something Dennis, please let me know.

David





-----Original Message-----
From: Dennis Kubes [mailto:[EMAIL PROTECTED] 
Sent: mardi, 22. avril 2008 16:04
To: [email protected]
Subject: Re: Generator: 0 records selected for fetching, exiting ...

In plugin.includes conf variable the protocol-http is loaded by default.

  I believe that the protocol-httpclient plugin needs to be loaded 
instead to parse https.

Dennis

POIRIER David wrote:
> Hello,
> 
> I went a little deeper and found that while the source page is being
> fetched during the crawl process, it is not parsed! I could see that
by
> checking the logs of a parse plugin that I made, where I see no trace
of
> the content of the source file.
> 
> Take note that even though the source uses the https protocol, no
> credentials are needed. 
> 
> If you have an idea, please let me know,
> 
> 
> David
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: POIRIER David [mailto:[EMAIL PROTECTED] 
> Sent: mardi, 22. avril 2008 10:58
> To: [email protected]
> Subject: Generator: 0 records selected for fetching, exiting ...
> 
> Hello, 
> 
> I'm having issues with a source that I'm trying to fetch.
> 
> 
> Config: Nutch 0.9, Intranet mode, https
> 
> I have modified my nutch-site.xml file to include the
> protocol-httpclient plugin; https should not be a problem, and indeed,
> when I check at my logs I can see that the seed url is fetched:
> 
> *************************
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl-cmrinteract/segments/20080422104050
> Generator: filtering: false
> Generator: topN: 2147483647
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl-cmrinteract/segments/20080422104050
> Fetcher: threads: 10
> fetching https://www.cmrinteract.com/clintrial/search.asp
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl-cmrinteract/crawldb
> CrawlDb update: segments: [crawl-cmrinteract/segments/20080422104050]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> *************************
> 
> Unfortunately, the fetcher seems to be missing all the links on the
> page:
> 
> *************************
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl-cmrinteract/segments/20080422104059
> Generator: filtering: false
> Generator: topN: 2147483647
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=1 - no more URLs to fetch.
> *************************
> 
> The links in the source page should not cause any problem, an example:
> 
> <a href="viewform.asp?UnId=12544"><img src="images/viewbut.jpg"
> border="0" height="20" width="80"></a>
> 
> 
> And, finally, my configuration in the crawl-urlfilter.txt file is, I
> think, right:
> 
> +^https://([a-z0-9]*\.)*cmrinteract.com/clintrial/search.asp
> +^https://([a-z0-9]*\.)*cmrinteract.com/clintrial/viewform.asp
> 
> 
> I'm really stuck... if you have any idea, please let me know.
> 
> David

Reply via email to