Dennis, It didn't work either, even though I correctly created and configured the required prefix-urlfilter.txt file (in nutch-default.txt):
********************** <property> <name>urlfilter.prefix.file</name> <value>prefix-urlfilter.txt</value> <description>Name of file on CLASSPATH containing url prefixes used by urlfilter-prefix (PrefixURLFilter) plugin.</description> </property> ********************** prefix-urlfilter.txt file content: ********************** # The default url filter. # Better for whole-internet crawling. # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file: ftp: and mailto: urls -(file|ftp|mailto):.* # skip image and other suffixes we can't yet parse -.*\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz |mov|MOV|exe) # skip URLs containing certain characters as probable queries, etc. # [EMAIL PROTECTED] # accept anything else +.* ********************** You can observe that I commented the [EMAIL PROTECTED] line to make sure that the ? symbol is accepted. But when I try to fetch, nothing gets through: ********************** Injector: starting Injector: crawlDb: crawl-novartis/crawldb Injector: urlDir: urls-novartis Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl-novartis/segments/20080423085835 Generator: filtering: false Generator: topN: 2147483647 Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=0 - no more URLs to fetch. No URLs to fetch - check your seed list and URL filters. ********************** One question: I might be mistaking, but I have the impression that url filtering plugins are actually "ANDed" to the crawl command url filtering process, configured through the crawl-urlfilter.txt file. Is that right? To be honest, I don't fully understand how they interact... Again, thank you, David -----Original Message----- From: Dennis Kubes [mailto:[EMAIL PROTECTED] Sent: mardi, 22. avril 2008 19:23 To: [email protected] Subject: Re: Generator: 0 records selected for fetching, exiting ... Instead of using the regex filter on this try using the prefix-urlfilter. Also make sure that your agent name is set in the configuration. Dennis POIRIER David wrote: > Dennis, > > Thanks for your reply. > > I did change the plugin.includes variable to eliminate the protocol-http > plugin and add the protocol-httpclient plugin instead. > > The problem is afterward since the page is actually fetched... or looks > like. I think that something is wrong between the fetch and parse > processes. > > If you think of something Dennis, please let me know. > > David > > > > > > -----Original Message----- > From: Dennis Kubes [mailto:[EMAIL PROTECTED] > Sent: mardi, 22. avril 2008 16:04 > To: [email protected] > Subject: Re: Generator: 0 records selected for fetching, exiting ... > > In plugin.includes conf variable the protocol-http is loaded by default. > > I believe that the protocol-httpclient plugin needs to be loaded > instead to parse https. > > Dennis > > POIRIER David wrote: >> Hello, >> >> I went a little deeper and found that while the source page is being >> fetched during the crawl process, it is not parsed! I could see that > by >> checking the logs of a parse plugin that I made, where I see no trace > of >> the content of the source file. >> >> Take note that even though the source uses the https protocol, no >> credentials are needed. >> >> If you have an idea, please let me know, >> >> >> David >> >> >> >> >> >> >> -----Original Message----- >> From: POIRIER David [mailto:[EMAIL PROTECTED] >> Sent: mardi, 22. avril 2008 10:58 >> To: [email protected] >> Subject: Generator: 0 records selected for fetching, exiting ... >> >> Hello, >> >> I'm having issues with a source that I'm trying to fetch. >> >> >> Config: Nutch 0.9, Intranet mode, https >> >> I have modified my nutch-site.xml file to include the >> protocol-httpclient plugin; https should not be a problem, and indeed, >> when I check at my logs I can see that the seed url is fetched: >> >> ************************* >> Generator: Selecting best-scoring urls due for fetch. >> Generator: starting >> Generator: segment: crawl-cmrinteract/segments/20080422104050 >> Generator: filtering: false >> Generator: topN: 2147483647 >> Generator: jobtracker is 'local', generating exactly one partition. >> Generator: Partitioning selected urls by host, for politeness. >> Generator: done. >> Fetcher: starting >> Fetcher: segment: crawl-cmrinteract/segments/20080422104050 >> Fetcher: threads: 10 >> fetching https://www.cmrinteract.com/clintrial/search.asp >> Fetcher: done >> CrawlDb update: starting >> CrawlDb update: db: crawl-cmrinteract/crawldb >> CrawlDb update: segments: [crawl-cmrinteract/segments/20080422104050] >> CrawlDb update: additions allowed: true >> CrawlDb update: URL normalizing: true >> CrawlDb update: URL filtering: true >> CrawlDb update: Merging segment data into db. >> CrawlDb update: done >> ************************* >> >> Unfortunately, the fetcher seems to be missing all the links on the >> page: >> >> ************************* >> Generator: Selecting best-scoring urls due for fetch. >> Generator: starting >> Generator: segment: crawl-cmrinteract/segments/20080422104059 >> Generator: filtering: false >> Generator: topN: 2147483647 >> Generator: jobtracker is 'local', generating exactly one partition. >> Generator: 0 records selected for fetching, exiting ... >> Stopping at depth=1 - no more URLs to fetch. >> ************************* >> >> The links in the source page should not cause any problem, an example: >> >> <a href="viewform.asp?UnId=12544"><img src="images/viewbut.jpg" >> border="0" height="20" width="80"></a> >> >> >> And, finally, my configuration in the crawl-urlfilter.txt file is, I >> think, right: >> >> +^https://([a-z0-9]*\.)*cmrinteract.com/clintrial/search.asp >> +^https://([a-z0-9]*\.)*cmrinteract.com/clintrial/viewform.asp >> >> >> I'm really stuck... if you have any idea, please let me know. >> >> David
