2008/4/23 POIRIER David <[EMAIL PROTECTED]>:

> Dennis,
>
> It didn't work either, even though I correctly created and configured
> the required prefix-urlfilter.txt file (in nutch-default.txt):
>
> **********************
> <property>
>  <name>urlfilter.prefix.file</name>
>  <value>prefix-urlfilter.txt</value>
>  <description>Name of file on CLASSPATH containing url prefixes
>  used by urlfilter-prefix (PrefixURLFilter) plugin.</description>
> </property>
> **********************
>
>
>
> prefix-urlfilter.txt file content:
>
> **********************
> # The default url filter.
> # Better for whole-internet crawling.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file: ftp: and mailto: urls
> -(file|ftp|mailto):.*
>
> # skip image and other suffixes we can't yet parse
> -.*\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz
> |mov|MOV|exe)
>
> # skip URLs containing certain characters as probable queries, etc.
> # [EMAIL PROTECTED]
>
> # accept anything else
> +.*
> **********************


prefix-urlfilter.txt would like this :

http://
ftp://
file://


it is a prefix filter not a regex filter :)


>
> You can observe that I commented the [EMAIL PROTECTED] line to make sure that
> the ? symbol is accepted.
>
> But when I try to fetch, nothing gets through:
>
> **********************
> Injector: starting
> Injector: crawlDb: crawl-novartis/crawldb
> Injector: urlDir: urls-novartis
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl-novartis/segments/20080423085835
> Generator: filtering: false
> Generator: topN: 2147483647
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=0 - no more URLs to fetch.
> No URLs to fetch - check your seed list and URL filters.
> **********************
>
> One question: I might be mistaking, but I have the impression that url
> filtering plugins are actually "ANDed" to the crawl command url
> filtering process, configured through the crawl-urlfilter.txt file. Is
> that right? To be honest, I don't fully understand how they interact...
>
>
> Again, thank you,
>
> David
>
>
>
>
> -----Original Message-----
> From: Dennis Kubes [mailto:[EMAIL PROTECTED]
> Sent: mardi, 22. avril 2008 19:23
> To: [email protected]
> Subject: Re: Generator: 0 records selected for fetching, exiting ...
>
> Instead of using the regex filter on this try using the
> prefix-urlfilter.  Also make sure that your agent name is set in the
> configuration.
>
> Dennis
>
> POIRIER David wrote:
> > Dennis,
> >
> > Thanks for your reply.
> >
> > I did change the plugin.includes variable to eliminate the
> protocol-http
> > plugin and add the protocol-httpclient plugin instead.
> >
> > The problem is afterward since the page is actually fetched... or
> looks
> > like. I think that something is wrong between the fetch and parse
> > processes.
> >
> > If you think of something Dennis, please let me know.
> >
> > David
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Dennis Kubes [mailto:[EMAIL PROTECTED]
> > Sent: mardi, 22. avril 2008 16:04
> > To: [email protected]
> > Subject: Re: Generator: 0 records selected for fetching, exiting ...
> >
> > In plugin.includes conf variable the protocol-http is loaded by
> default.
> >
> >   I believe that the protocol-httpclient plugin needs to be loaded
> > instead to parse https.
> >
> > Dennis
> >
> > POIRIER David wrote:
> >> Hello,
> >>
> >> I went a little deeper and found that while the source page is being
> >> fetched during the crawl process, it is not parsed! I could see that
> > by
> >> checking the logs of a parse plugin that I made, where I see no trace
> > of
> >> the content of the source file.
> >>
> >> Take note that even though the source uses the https protocol, no
> >> credentials are needed.
> >>
> >> If you have an idea, please let me know,
> >>
> >>
> >> David
> >>
> >>
> >>
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: POIRIER David [mailto:[EMAIL PROTECTED]
> >> Sent: mardi, 22. avril 2008 10:58
> >> To: [email protected]
> >> Subject: Generator: 0 records selected for fetching, exiting ...
> >>
> >> Hello,
> >>
> >> I'm having issues with a source that I'm trying to fetch.
> >>
> >>
> >> Config: Nutch 0.9, Intranet mode, https
> >>
> >> I have modified my nutch-site.xml file to include the
> >> protocol-httpclient plugin; https should not be a problem, and
> indeed,
> >> when I check at my logs I can see that the seed url is fetched:
> >>
> >> *************************
> >> Generator: Selecting best-scoring urls due for fetch.
> >> Generator: starting
> >> Generator: segment: crawl-cmrinteract/segments/20080422104050
> >> Generator: filtering: false
> >> Generator: topN: 2147483647
> >> Generator: jobtracker is 'local', generating exactly one partition.
> >> Generator: Partitioning selected urls by host, for politeness.
> >> Generator: done.
> >> Fetcher: starting
> >> Fetcher: segment: crawl-cmrinteract/segments/20080422104050
> >> Fetcher: threads: 10
> >> fetching https://www.cmrinteract.com/clintrial/search.asp
> >> Fetcher: done
> >> CrawlDb update: starting
> >> CrawlDb update: db: crawl-cmrinteract/crawldb
> >> CrawlDb update: segments: [crawl-cmrinteract/segments/20080422104050]
> >> CrawlDb update: additions allowed: true
> >> CrawlDb update: URL normalizing: true
> >> CrawlDb update: URL filtering: true
> >> CrawlDb update: Merging segment data into db.
> >> CrawlDb update: done
> >> *************************
> >>
> >> Unfortunately, the fetcher seems to be missing all the links on the
> >> page:
> >>
> >> *************************
> >> Generator: Selecting best-scoring urls due for fetch.
> >> Generator: starting
> >> Generator: segment: crawl-cmrinteract/segments/20080422104059
> >> Generator: filtering: false
> >> Generator: topN: 2147483647
> >> Generator: jobtracker is 'local', generating exactly one partition.
> >> Generator: 0 records selected for fetching, exiting ...
> >> Stopping at depth=1 - no more URLs to fetch.
> >> *************************
> >>
> >> The links in the source page should not cause any problem, an
> example:
> >>
> >> <a href="viewform.asp?UnId=12544"><img src="images/viewbut.jpg"
> >> border="0" height="20" width="80"></a>
> >>
> >>
> >> And, finally, my configuration in the crawl-urlfilter.txt file is, I
> >> think, right:
> >>
> >> +^https://([a-z0-9]*\.)*cmrinteract.com/clintrial/search.asp
> >> +^https://([a-z0-9]*\.)*cmrinteract.com/clintrial/viewform.asp
> >>
> >>
> >> I'm really stuck... if you have any idea, please let me know.
> >>
> >> David
>

Reply via email to