Dennis,

I changed the log4j config like you suggested. I also set the the
following two parameters:

<property>
  <name>fetcher.verbose</name>
  <value>true</value>
  <description>If true, fetcher will log more verbosely.</description>
</property>
        
<property>
  <name>http.verbose</name>
  <value>true</value>
  <description>If true, HTTP will log more verbosely.</description>
</property>

I then restarted the crawling process and... got no error. What I found:

2008-04-25 14:27:49,675 WARN  regex.RegexURLNormalizer - can't find
rules for scope 'partition', using default
2008-04-25 14:27:54,222 WARN  regex.RegexURLNormalizer - can't find
rules for scope 'crawldb', using default
2008-04-25 14:27:54,754 WARN  regex.RegexURLNormalizer - can't find
rules for scope 'crawldb', using default
2008-04-25 14:27:58,582 WARN  crawl.Generator - Generator: 0 records
selected for fetching, exiting ...

By the way, I didn't see any changes in the logs about the fetcher. 

But out of that... I got nothing. If you, or anybody, have an idea.

Thank you,

David





-----Original Message-----
From: Dennis Kubes [mailto:[EMAIL PROTECTED] 
Sent: mercredi, 23. avril 2008 17:02
To: [email protected]
Subject: Re: Generator: 0 records selected for fetching, exiting ...

As I see it the reasons for this not being fetched could be.  All urls 
being filtered for some reason.  Agent name not being set and therefore 
fetching being ignored.  Errors in the fetching process due to dns 
problems.  Errors in parsing due to some unknown reason.  Or since this 
is https, some errors in https authentication or connection.

We have changed the filters, that doesn't seem to help.  You said agent 
name was there.  If not there would be an error in the log file stating 
it wasn't set.  Fetching process errors, parsing process errors, and 
https errors should all show up in the log file.  My best advice from 
here is to change log4j.properties to log everything and then see what 
errors are occurring.

Dennis

POIRIER David wrote:
> Dennis,
> 
> It didn't work either, even though I correctly created and configured
> the required prefix-urlfilter.txt file (in nutch-default.txt):
> 
> **********************
> <property>
>   <name>urlfilter.prefix.file</name>
>   <value>prefix-urlfilter.txt</value>
>   <description>Name of file on CLASSPATH containing url prefixes
>   used by urlfilter-prefix (PrefixURLFilter) plugin.</description>
> </property>
> **********************
> 
> 
> 
> prefix-urlfilter.txt file content:
> 
> **********************
> # The default url filter.
> # Better for whole-internet crawling.
> 
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
> 
> # skip file: ftp: and mailto: urls
> -(file|ftp|mailto):.*
> 
> # skip image and other suffixes we can't yet parse
>
-.*\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz
> |mov|MOV|exe)
> 
> # skip URLs containing certain characters as probable queries, etc.
> # [EMAIL PROTECTED]
> 
> # accept anything else
> +.*
> **********************
> 
> You can observe that I commented the [EMAIL PROTECTED] line to make sure
that
> the ? symbol is accepted.
> 
> But when I try to fetch, nothing gets through:
> 
> **********************
> Injector: starting
> Injector: crawlDb: crawl-novartis/crawldb
> Injector: urlDir: urls-novartis
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl-novartis/segments/20080423085835
> Generator: filtering: false
> Generator: topN: 2147483647
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=0 - no more URLs to fetch.
> No URLs to fetch - check your seed list and URL filters.
> **********************
> 
> One question: I might be mistaking, but I have the impression that url
> filtering plugins are actually "ANDed" to the crawl command url
> filtering process, configured through the crawl-urlfilter.txt file. Is
> that right? To be honest, I don't fully understand how they
interact...

That is correct, any enabled plugins are run sequentially.

Dennis
> 
>  
> Again, thank you,
> 
> David
> 
> 
> 
> 
> -----Original Message-----
> From: Dennis Kubes [mailto:[EMAIL PROTECTED] 
> Sent: mardi, 22. avril 2008 19:23
> To: [email protected]
> Subject: Re: Generator: 0 records selected for fetching, exiting ...
> 
> Instead of using the regex filter on this try using the 
> prefix-urlfilter.  Also make sure that your agent name is set in the 
> configuration.
> 
> Dennis
> 
> POIRIER David wrote:
>> Dennis,
>>
>> Thanks for your reply.
>>
>> I did change the plugin.includes variable to eliminate the
> protocol-http
>> plugin and add the protocol-httpclient plugin instead.
>>
>> The problem is afterward since the page is actually fetched... or
> looks
>> like. I think that something is wrong between the fetch and parse
>> processes.
>>
>> If you think of something Dennis, please let me know.
>>
>> David
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Dennis Kubes [mailto:[EMAIL PROTECTED] 
>> Sent: mardi, 22. avril 2008 16:04
>> To: [email protected]
>> Subject: Re: Generator: 0 records selected for fetching, exiting ...
>>
>> In plugin.includes conf variable the protocol-http is loaded by
> default.
>>   I believe that the protocol-httpclient plugin needs to be loaded 
>> instead to parse https.
>>
>> Dennis
>>
>> POIRIER David wrote:
>>> Hello,
>>>
>>> I went a little deeper and found that while the source page is being
>>> fetched during the crawl process, it is not parsed! I could see that
>> by
>>> checking the logs of a parse plugin that I made, where I see no
trace
>> of
>>> the content of the source file.
>>>
>>> Take note that even though the source uses the https protocol, no
>>> credentials are needed. 
>>>
>>> If you have an idea, please let me know,
>>>
>>>
>>> David
>>>
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: POIRIER David [mailto:[EMAIL PROTECTED] 
>>> Sent: mardi, 22. avril 2008 10:58
>>> To: [email protected]
>>> Subject: Generator: 0 records selected for fetching, exiting ...
>>>
>>> Hello, 
>>>
>>> I'm having issues with a source that I'm trying to fetch.
>>>
>>>
>>> Config: Nutch 0.9, Intranet mode, https
>>>
>>> I have modified my nutch-site.xml file to include the
>>> protocol-httpclient plugin; https should not be a problem, and
> indeed,
>>> when I check at my logs I can see that the seed url is fetched:
>>>
>>> *************************
>>> Generator: Selecting best-scoring urls due for fetch.
>>> Generator: starting
>>> Generator: segment: crawl-cmrinteract/segments/20080422104050
>>> Generator: filtering: false
>>> Generator: topN: 2147483647
>>> Generator: jobtracker is 'local', generating exactly one partition.
>>> Generator: Partitioning selected urls by host, for politeness.
>>> Generator: done.
>>> Fetcher: starting
>>> Fetcher: segment: crawl-cmrinteract/segments/20080422104050
>>> Fetcher: threads: 10
>>> fetching https://www.cmrinteract.com/clintrial/search.asp
>>> Fetcher: done
>>> CrawlDb update: starting
>>> CrawlDb update: db: crawl-cmrinteract/crawldb
>>> CrawlDb update: segments:
[crawl-cmrinteract/segments/20080422104050]
>>> CrawlDb update: additions allowed: true
>>> CrawlDb update: URL normalizing: true
>>> CrawlDb update: URL filtering: true
>>> CrawlDb update: Merging segment data into db.
>>> CrawlDb update: done
>>> *************************
>>>
>>> Unfortunately, the fetcher seems to be missing all the links on the
>>> page:
>>>
>>> *************************
>>> Generator: Selecting best-scoring urls due for fetch.
>>> Generator: starting
>>> Generator: segment: crawl-cmrinteract/segments/20080422104059
>>> Generator: filtering: false
>>> Generator: topN: 2147483647
>>> Generator: jobtracker is 'local', generating exactly one partition.
>>> Generator: 0 records selected for fetching, exiting ...
>>> Stopping at depth=1 - no more URLs to fetch.
>>> *************************
>>>
>>> The links in the source page should not cause any problem, an
> example:
>>> <a href="viewform.asp?UnId=12544"><img src="images/viewbut.jpg"
>>> border="0" height="20" width="80"></a>
>>>
>>>
>>> And, finally, my configuration in the crawl-urlfilter.txt file is, I
>>> think, right:
>>>
>>> +^https://([a-z0-9]*\.)*cmrinteract.com/clintrial/search.asp
>>> +^https://([a-z0-9]*\.)*cmrinteract.com/clintrial/viewform.asp
>>>
>>>
>>> I'm really stuck... if you have any idea, please let me know.
>>>
>>> David

Reply via email to