2008/4/23 POIRIER David <[EMAIL PROTECTED]>: > Dennis, > > It didn't work either, even though I correctly created and configured > the required prefix-urlfilter.txt file (in nutch-default.txt): > > ********************** > <property> > <name>urlfilter.prefix.file</name> > <value>prefix-urlfilter.txt</value> > <description>Name of file on CLASSPATH containing url prefixes > used by urlfilter-prefix (PrefixURLFilter) plugin.</description> > </property> > ********************** > > > > prefix-urlfilter.txt file content: > > ********************** > # The default url filter. > # Better for whole-internet crawling. > > # Each non-comment, non-blank line contains a regular expression > # prefixed by '+' or '-'. The first matching pattern in the file > # determines whether a URL is included or ignored. If no pattern > # matches, the URL is ignored. > > # skip file: ftp: and mailto: urls > -(file|ftp|mailto):.* > > # skip image and other suffixes we can't yet parse > -.*\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz > |mov|MOV|exe) > > # skip URLs containing certain characters as probable queries, etc. > # [EMAIL PROTECTED] > > # accept anything else > +.* > **********************
prefix-urlfilter.txt would like this : http:// ftp:// file:// it is a prefix filter not a regex filter :) > > You can observe that I commented the [EMAIL PROTECTED] line to make sure that > the ? symbol is accepted. > > But when I try to fetch, nothing gets through: > > ********************** > Injector: starting > Injector: crawlDb: crawl-novartis/crawldb > Injector: urlDir: urls-novartis > Injector: Converting injected urls to crawl db entries. > Injector: Merging injected urls into crawl db. > Injector: done > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: crawl-novartis/segments/20080423085835 > Generator: filtering: false > Generator: topN: 2147483647 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: 0 records selected for fetching, exiting ... > Stopping at depth=0 - no more URLs to fetch. > No URLs to fetch - check your seed list and URL filters. > ********************** > > One question: I might be mistaking, but I have the impression that url > filtering plugins are actually "ANDed" to the crawl command url > filtering process, configured through the crawl-urlfilter.txt file. Is > that right? To be honest, I don't fully understand how they interact... > > > Again, thank you, > > David > > > > > -----Original Message----- > From: Dennis Kubes [mailto:[EMAIL PROTECTED] > Sent: mardi, 22. avril 2008 19:23 > To: [email protected] > Subject: Re: Generator: 0 records selected for fetching, exiting ... > > Instead of using the regex filter on this try using the > prefix-urlfilter. Also make sure that your agent name is set in the > configuration. > > Dennis > > POIRIER David wrote: > > Dennis, > > > > Thanks for your reply. > > > > I did change the plugin.includes variable to eliminate the > protocol-http > > plugin and add the protocol-httpclient plugin instead. > > > > The problem is afterward since the page is actually fetched... or > looks > > like. I think that something is wrong between the fetch and parse > > processes. > > > > If you think of something Dennis, please let me know. > > > > David > > > > > > > > > > > > -----Original Message----- > > From: Dennis Kubes [mailto:[EMAIL PROTECTED] > > Sent: mardi, 22. avril 2008 16:04 > > To: [email protected] > > Subject: Re: Generator: 0 records selected for fetching, exiting ... > > > > In plugin.includes conf variable the protocol-http is loaded by > default. > > > > I believe that the protocol-httpclient plugin needs to be loaded > > instead to parse https. > > > > Dennis > > > > POIRIER David wrote: > >> Hello, > >> > >> I went a little deeper and found that while the source page is being > >> fetched during the crawl process, it is not parsed! I could see that > > by > >> checking the logs of a parse plugin that I made, where I see no trace > > of > >> the content of the source file. > >> > >> Take note that even though the source uses the https protocol, no > >> credentials are needed. > >> > >> If you have an idea, please let me know, > >> > >> > >> David > >> > >> > >> > >> > >> > >> > >> -----Original Message----- > >> From: POIRIER David [mailto:[EMAIL PROTECTED] > >> Sent: mardi, 22. avril 2008 10:58 > >> To: [email protected] > >> Subject: Generator: 0 records selected for fetching, exiting ... > >> > >> Hello, > >> > >> I'm having issues with a source that I'm trying to fetch. > >> > >> > >> Config: Nutch 0.9, Intranet mode, https > >> > >> I have modified my nutch-site.xml file to include the > >> protocol-httpclient plugin; https should not be a problem, and > indeed, > >> when I check at my logs I can see that the seed url is fetched: > >> > >> ************************* > >> Generator: Selecting best-scoring urls due for fetch. > >> Generator: starting > >> Generator: segment: crawl-cmrinteract/segments/20080422104050 > >> Generator: filtering: false > >> Generator: topN: 2147483647 > >> Generator: jobtracker is 'local', generating exactly one partition. > >> Generator: Partitioning selected urls by host, for politeness. > >> Generator: done. > >> Fetcher: starting > >> Fetcher: segment: crawl-cmrinteract/segments/20080422104050 > >> Fetcher: threads: 10 > >> fetching https://www.cmrinteract.com/clintrial/search.asp > >> Fetcher: done > >> CrawlDb update: starting > >> CrawlDb update: db: crawl-cmrinteract/crawldb > >> CrawlDb update: segments: [crawl-cmrinteract/segments/20080422104050] > >> CrawlDb update: additions allowed: true > >> CrawlDb update: URL normalizing: true > >> CrawlDb update: URL filtering: true > >> CrawlDb update: Merging segment data into db. > >> CrawlDb update: done > >> ************************* > >> > >> Unfortunately, the fetcher seems to be missing all the links on the > >> page: > >> > >> ************************* > >> Generator: Selecting best-scoring urls due for fetch. > >> Generator: starting > >> Generator: segment: crawl-cmrinteract/segments/20080422104059 > >> Generator: filtering: false > >> Generator: topN: 2147483647 > >> Generator: jobtracker is 'local', generating exactly one partition. > >> Generator: 0 records selected for fetching, exiting ... > >> Stopping at depth=1 - no more URLs to fetch. > >> ************************* > >> > >> The links in the source page should not cause any problem, an > example: > >> > >> <a href="viewform.asp?UnId=12544"><img src="images/viewbut.jpg" > >> border="0" height="20" width="80"></a> > >> > >> > >> And, finally, my configuration in the crawl-urlfilter.txt file is, I > >> think, right: > >> > >> +^https://([a-z0-9]*\.)*cmrinteract.com/clintrial/search.asp > >> +^https://([a-z0-9]*\.)*cmrinteract.com/clintrial/viewform.asp > >> > >> > >> I'm really stuck... if you have any idea, please let me know. > >> > >> David >
