Dennis, Thanks for your reply.
I did change the plugin.includes variable to eliminate the protocol-http plugin and add the protocol-httpclient plugin instead. The problem is afterward since the page is actually fetched... or looks like. I think that something is wrong between the fetch and parse processes. If you think of something Dennis, please let me know. David -----Original Message----- From: Dennis Kubes [mailto:[EMAIL PROTECTED] Sent: mardi, 22. avril 2008 16:04 To: [email protected] Subject: Re: Generator: 0 records selected for fetching, exiting ... In plugin.includes conf variable the protocol-http is loaded by default. I believe that the protocol-httpclient plugin needs to be loaded instead to parse https. Dennis POIRIER David wrote: > Hello, > > I went a little deeper and found that while the source page is being > fetched during the crawl process, it is not parsed! I could see that by > checking the logs of a parse plugin that I made, where I see no trace of > the content of the source file. > > Take note that even though the source uses the https protocol, no > credentials are needed. > > If you have an idea, please let me know, > > > David > > > > > > > -----Original Message----- > From: POIRIER David [mailto:[EMAIL PROTECTED] > Sent: mardi, 22. avril 2008 10:58 > To: [email protected] > Subject: Generator: 0 records selected for fetching, exiting ... > > Hello, > > I'm having issues with a source that I'm trying to fetch. > > > Config: Nutch 0.9, Intranet mode, https > > I have modified my nutch-site.xml file to include the > protocol-httpclient plugin; https should not be a problem, and indeed, > when I check at my logs I can see that the seed url is fetched: > > ************************* > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: crawl-cmrinteract/segments/20080422104050 > Generator: filtering: false > Generator: topN: 2147483647 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls by host, for politeness. > Generator: done. > Fetcher: starting > Fetcher: segment: crawl-cmrinteract/segments/20080422104050 > Fetcher: threads: 10 > fetching https://www.cmrinteract.com/clintrial/search.asp > Fetcher: done > CrawlDb update: starting > CrawlDb update: db: crawl-cmrinteract/crawldb > CrawlDb update: segments: [crawl-cmrinteract/segments/20080422104050] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > CrawlDb update: Merging segment data into db. > CrawlDb update: done > ************************* > > Unfortunately, the fetcher seems to be missing all the links on the > page: > > ************************* > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: crawl-cmrinteract/segments/20080422104059 > Generator: filtering: false > Generator: topN: 2147483647 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: 0 records selected for fetching, exiting ... > Stopping at depth=1 - no more URLs to fetch. > ************************* > > The links in the source page should not cause any problem, an example: > > <a href="viewform.asp?UnId=12544"><img src="images/viewbut.jpg" > border="0" height="20" width="80"></a> > > > And, finally, my configuration in the crawl-urlfilter.txt file is, I > think, right: > > +^https://([a-z0-9]*\.)*cmrinteract.com/clintrial/search.asp > +^https://([a-z0-9]*\.)*cmrinteract.com/clintrial/viewform.asp > > > I'm really stuck... if you have any idea, please let me know. > > David
