Ed,
I suspect the problem is with the ContentType that your server
is responding with. Using the command line utility curl this is what I
get for my server:
[EMAIL PROTECTED] conf]# curl -i http://www.aarp.org/ |less
7 15949 7 1255 0 0 10286 0 0:00:01 0:00:00 0:00:01
10286HTTP/1.1 200 OK
Date: Thu, 15 Sep 2005 14:51:56 GMT
Server: Apache
Content-Length: 15949
Content-Type: text/html
What Content-Type are you getting back for planet.abc.com?
Jake.
-----Original Message-----
From: Edward Quick [mailto:[EMAIL PROTECTED]
Sent: Thursday, September 15, 2005 10:49 AM
To: [email protected]
Subject: can't parse https
Hi,
I'm trying to run a crawl on our DMZ intranet using the root url
https://planet.abc.com/
However nutch says it can't parse https. I have loaded
protocol-httpclient,
and set crawl-urlfilter.txt to
+^https://planet.abc.com/
In nutch-site.xml I have:
<property>
<name>plugin.includes</name>
<value>protocol-(httpclient|http|file|ftp|file)|urlfilter-regex|parse-(t
ext|html|js|msword|pdf|rss|ext)|index-(basic|more)|query-(basic|site|url
|more)</value>
<description></description>
</property>
And this is what the crawl.log gives me:
050915 154033 Overall processing: Sorted 1 entries in 0.0090 seconds.
050915 154033 Overall processing: Sorted 0.0090 entries/second
050915 154033 FetchListTool completed
050915 154033 logging at INFO
050915 154034 fetching https://planet.abc.com/
050915 154034 http.proxy.host = null
050915 154034 http.proxy.port = 8080
050915 154034 http.timeout = 10000
050915 154034 http.content.limit = -1
050915 154034 http.agent = NutchCVS/0.7 (Nutch;
http://lucene.apache.org/nutch/bot.html; [email protected])
050915 154034 http.auth.ntlm.username =
050915 154034 fetcher.server.delay = 1000
050915 154034 http.max.delays = 100
050915 154035 Configured Client
050915 154042 fetch okay, but can't parse https://planet.abc.com/,
reason:
failed(2,0): No external command defined for contentType:
Can anyone help me out please?
Thanks,
Ed.