As I see it the reasons for this not being fetched could be. All urls being filtered for some reason. Agent name not being set and therefore fetching being ignored. Errors in the fetching process due to dns problems. Errors in parsing due to some unknown reason. Or since this is https, some errors in https authentication or connection.

We have changed the filters, that doesn't seem to help. You said agent name was there. If not there would be an error in the log file stating it wasn't set. Fetching process errors, parsing process errors, and https errors should all show up in the log file. My best advice from here is to change log4j.properties to log everything and then see what errors are occurring.

Dennis

POIRIER David wrote:
Dennis,

It didn't work either, even though I correctly created and configured
the required prefix-urlfilter.txt file (in nutch-default.txt):

**********************
<property>
  <name>urlfilter.prefix.file</name>
  <value>prefix-urlfilter.txt</value>
  <description>Name of file on CLASSPATH containing url prefixes
  used by urlfilter-prefix (PrefixURLFilter) plugin.</description>
</property>
**********************



prefix-urlfilter.txt file content:

**********************
# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-(file|ftp|mailto):.*

# skip image and other suffixes we can't yet parse
-.*\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz
|mov|MOV|exe)

# skip URLs containing certain characters as probable queries, etc.
# [EMAIL PROTECTED]

# accept anything else
+.*
**********************

You can observe that I commented the [EMAIL PROTECTED] line to make sure that
the ? symbol is accepted.

But when I try to fetch, nothing gets through:

**********************
Injector: starting
Injector: crawlDb: crawl-novartis/crawldb
Injector: urlDir: urls-novartis
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl-novartis/segments/20080423085835
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
**********************

One question: I might be mistaking, but I have the impression that url
filtering plugins are actually "ANDed" to the crawl command url
filtering process, configured through the crawl-urlfilter.txt file. Is
that right? To be honest, I don't fully understand how they interact...

That is correct, any enabled plugins are run sequentially.

Dennis

Again, thank you,

David




-----Original Message-----
From: Dennis Kubes [mailto:[EMAIL PROTECTED] Sent: mardi, 22. avril 2008 19:23
To: [email protected]
Subject: Re: Generator: 0 records selected for fetching, exiting ...

Instead of using the regex filter on this try using the prefix-urlfilter. Also make sure that your agent name is set in the configuration.

Dennis

POIRIER David wrote:
Dennis,

Thanks for your reply.

I did change the plugin.includes variable to eliminate the
protocol-http
plugin and add the protocol-httpclient plugin instead.

The problem is afterward since the page is actually fetched... or
looks
like. I think that something is wrong between the fetch and parse
processes.

If you think of something Dennis, please let me know.

David





-----Original Message-----
From: Dennis Kubes [mailto:[EMAIL PROTECTED] Sent: mardi, 22. avril 2008 16:04
To: [email protected]
Subject: Re: Generator: 0 records selected for fetching, exiting ...

In plugin.includes conf variable the protocol-http is loaded by
default.
I believe that the protocol-httpclient plugin needs to be loaded instead to parse https.

Dennis

POIRIER David wrote:
Hello,

I went a little deeper and found that while the source page is being
fetched during the crawl process, it is not parsed! I could see that
by
checking the logs of a parse plugin that I made, where I see no trace
of
the content of the source file.

Take note that even though the source uses the https protocol, no
credentials are needed.
If you have an idea, please let me know,


David






-----Original Message-----
From: POIRIER David [mailto:[EMAIL PROTECTED] Sent: mardi, 22. avril 2008 10:58
To: [email protected]
Subject: Generator: 0 records selected for fetching, exiting ...

Hello,
I'm having issues with a source that I'm trying to fetch.


Config: Nutch 0.9, Intranet mode, https

I have modified my nutch-site.xml file to include the
protocol-httpclient plugin; https should not be a problem, and
indeed,
when I check at my logs I can see that the seed url is fetched:

*************************
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl-cmrinteract/segments/20080422104050
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl-cmrinteract/segments/20080422104050
Fetcher: threads: 10
fetching https://www.cmrinteract.com/clintrial/search.asp
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl-cmrinteract/crawldb
CrawlDb update: segments: [crawl-cmrinteract/segments/20080422104050]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
*************************

Unfortunately, the fetcher seems to be missing all the links on the
page:

*************************
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl-cmrinteract/segments/20080422104059
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
*************************

The links in the source page should not cause any problem, an
example:
<a href="viewform.asp?UnId=12544"><img src="images/viewbut.jpg"
border="0" height="20" width="80"></a>


And, finally, my configuration in the crawl-urlfilter.txt file is, I
think, right:

+^https://([a-z0-9]*\.)*cmrinteract.com/clintrial/search.asp
+^https://([a-z0-9]*\.)*cmrinteract.com/clintrial/viewform.asp


I'm really stuck... if you have any idea, please let me know.

David

Reply via email to