Dennis,
Thanks for your reply.
I did change the plugin.includes variable to eliminate the protocol-http
plugin and add the protocol-httpclient plugin instead.
The problem is afterward since the page is actually fetched... or looks
like. I think that something is wrong between the fetch and parse
processes.
If you think of something Dennis, please let me know.
David
-----Original Message-----
From: Dennis Kubes [mailto:[EMAIL PROTECTED]
Sent: mardi, 22. avril 2008 16:04
To: [email protected]
Subject: Re: Generator: 0 records selected for fetching, exiting ...
In plugin.includes conf variable the protocol-http is loaded by default.
I believe that the protocol-httpclient plugin needs to be loaded
instead to parse https.
Dennis
POIRIER David wrote:
Hello,
I went a little deeper and found that while the source page is being
fetched during the crawl process, it is not parsed! I could see that
by
checking the logs of a parse plugin that I made, where I see no trace
of
the content of the source file.
Take note that even though the source uses the https protocol, no
credentials are needed.
If you have an idea, please let me know,
David
-----Original Message-----
From: POIRIER David [mailto:[EMAIL PROTECTED]
Sent: mardi, 22. avril 2008 10:58
To: [email protected]
Subject: Generator: 0 records selected for fetching, exiting ...
Hello,
I'm having issues with a source that I'm trying to fetch.
Config: Nutch 0.9, Intranet mode, https
I have modified my nutch-site.xml file to include the
protocol-httpclient plugin; https should not be a problem, and indeed,
when I check at my logs I can see that the seed url is fetched:
*************************
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl-cmrinteract/segments/20080422104050
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl-cmrinteract/segments/20080422104050
Fetcher: threads: 10
fetching https://www.cmrinteract.com/clintrial/search.asp
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl-cmrinteract/crawldb
CrawlDb update: segments: [crawl-cmrinteract/segments/20080422104050]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
*************************
Unfortunately, the fetcher seems to be missing all the links on the
page:
*************************
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl-cmrinteract/segments/20080422104059
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
*************************
The links in the source page should not cause any problem, an example:
<a href="viewform.asp?UnId=12544"><img src="images/viewbut.jpg"
border="0" height="20" width="80"></a>
And, finally, my configuration in the crawl-urlfilter.txt file is, I
think, right:
+^https://([a-z0-9]*\.)*cmrinteract.com/clintrial/search.asp
+^https://([a-z0-9]*\.)*cmrinteract.com/clintrial/viewform.asp
I'm really stuck... if you have any idea, please let me know.
David