Dennis,

It didn't work either, even though I correctly created and configured
the required prefix-urlfilter.txt file (in nutch-default.txt):

**********************
<property>
  <name>urlfilter.prefix.file</name>
  <value>prefix-urlfilter.txt</value>
  <description>Name of file on CLASSPATH containing url prefixes
  used by urlfilter-prefix (PrefixURLFilter) plugin.</description>
</property>
**********************



prefix-urlfilter.txt file content:

**********************
# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-(file|ftp|mailto):.*

# skip image and other suffixes we can't yet parse
-.*\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz
|mov|MOV|exe)

# skip URLs containing certain characters as probable queries, etc.
# [EMAIL PROTECTED]

# accept anything else
+.*
**********************

You can observe that I commented the [EMAIL PROTECTED] line to make sure that
the ? symbol is accepted.

But when I try to fetch, nothing gets through:

**********************
Injector: starting
Injector: crawlDb: crawl-novartis/crawldb
Injector: urlDir: urls-novartis
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl-novartis/segments/20080423085835
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
**********************

One question: I might be mistaking, but I have the impression that url
filtering plugins are actually "ANDed" to the crawl command url
filtering process, configured through the crawl-urlfilter.txt file. Is
that right? To be honest, I don't fully understand how they interact...

 
Again, thank you,

David




-----Original Message-----
From: Dennis Kubes [mailto:[EMAIL PROTECTED] 
Sent: mardi, 22. avril 2008 19:23
To: [email protected]
Subject: Re: Generator: 0 records selected for fetching, exiting ...

Instead of using the regex filter on this try using the 
prefix-urlfilter.  Also make sure that your agent name is set in the 
configuration.

Dennis

POIRIER David wrote:
> Dennis,
> 
> Thanks for your reply.
> 
> I did change the plugin.includes variable to eliminate the
protocol-http
> plugin and add the protocol-httpclient plugin instead.
> 
> The problem is afterward since the page is actually fetched... or
looks
> like. I think that something is wrong between the fetch and parse
> processes.
> 
> If you think of something Dennis, please let me know.
> 
> David
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Dennis Kubes [mailto:[EMAIL PROTECTED] 
> Sent: mardi, 22. avril 2008 16:04
> To: [email protected]
> Subject: Re: Generator: 0 records selected for fetching, exiting ...
> 
> In plugin.includes conf variable the protocol-http is loaded by
default.
> 
>   I believe that the protocol-httpclient plugin needs to be loaded 
> instead to parse https.
> 
> Dennis
> 
> POIRIER David wrote:
>> Hello,
>>
>> I went a little deeper and found that while the source page is being
>> fetched during the crawl process, it is not parsed! I could see that
> by
>> checking the logs of a parse plugin that I made, where I see no trace
> of
>> the content of the source file.
>>
>> Take note that even though the source uses the https protocol, no
>> credentials are needed. 
>>
>> If you have an idea, please let me know,
>>
>>
>> David
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: POIRIER David [mailto:[EMAIL PROTECTED] 
>> Sent: mardi, 22. avril 2008 10:58
>> To: [email protected]
>> Subject: Generator: 0 records selected for fetching, exiting ...
>>
>> Hello, 
>>
>> I'm having issues with a source that I'm trying to fetch.
>>
>>
>> Config: Nutch 0.9, Intranet mode, https
>>
>> I have modified my nutch-site.xml file to include the
>> protocol-httpclient plugin; https should not be a problem, and
indeed,
>> when I check at my logs I can see that the seed url is fetched:
>>
>> *************************
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: starting
>> Generator: segment: crawl-cmrinteract/segments/20080422104050
>> Generator: filtering: false
>> Generator: topN: 2147483647
>> Generator: jobtracker is 'local', generating exactly one partition.
>> Generator: Partitioning selected urls by host, for politeness.
>> Generator: done.
>> Fetcher: starting
>> Fetcher: segment: crawl-cmrinteract/segments/20080422104050
>> Fetcher: threads: 10
>> fetching https://www.cmrinteract.com/clintrial/search.asp
>> Fetcher: done
>> CrawlDb update: starting
>> CrawlDb update: db: crawl-cmrinteract/crawldb
>> CrawlDb update: segments: [crawl-cmrinteract/segments/20080422104050]
>> CrawlDb update: additions allowed: true
>> CrawlDb update: URL normalizing: true
>> CrawlDb update: URL filtering: true
>> CrawlDb update: Merging segment data into db.
>> CrawlDb update: done
>> *************************
>>
>> Unfortunately, the fetcher seems to be missing all the links on the
>> page:
>>
>> *************************
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: starting
>> Generator: segment: crawl-cmrinteract/segments/20080422104059
>> Generator: filtering: false
>> Generator: topN: 2147483647
>> Generator: jobtracker is 'local', generating exactly one partition.
>> Generator: 0 records selected for fetching, exiting ...
>> Stopping at depth=1 - no more URLs to fetch.
>> *************************
>>
>> The links in the source page should not cause any problem, an
example:
>>
>> <a href="viewform.asp?UnId=12544"><img src="images/viewbut.jpg"
>> border="0" height="20" width="80"></a>
>>
>>
>> And, finally, my configuration in the crawl-urlfilter.txt file is, I
>> think, right:
>>
>> +^https://([a-z0-9]*\.)*cmrinteract.com/clintrial/search.asp
>> +^https://([a-z0-9]*\.)*cmrinteract.com/clintrial/viewform.asp
>>
>>
>> I'm really stuck... if you have any idea, please let me know.
>>
>> David

Reply via email to