RE: can't parse https

Vanderdray, Jake Thu, 15 Sep 2005 07:54:48 -0700

Ed,

        I suspect the problem is with the ContentType that your server
is responding with.  Using the command line utility curl this is what I
get for my server:


[EMAIL PROTECTED] conf]# curl -i http://www.aarp.org/ |less

  7 15949    7  1255    0     0  10286      0  0:00:01  0:00:00  0:00:01
10286HTTP/1.1 200 OK
Date: Thu, 15 Sep 2005 14:51:56 GMT
Server: Apache
Content-Length: 15949
Content-Type: text/html

        What Content-Type are you getting back for planet.abc.com?

Jake.

-----Original Message-----
From: Edward Quick [mailto:[EMAIL PROTECTED] 
Sent: Thursday, September 15, 2005 10:49 AM
To: [email protected]
Subject: can't parse https

Hi,

I'm trying to run a crawl on our DMZ intranet using the root url 
https://planet.abc.com/
However nutch says it can't parse https. I have loaded
protocol-httpclient, 
and set crawl-urlfilter.txt to

+^https://planet.abc.com/

In nutch-site.xml I have:

<property>
  <name>plugin.includes</name>
  
<value>protocol-(httpclient|http|file|ftp|file)|urlfilter-regex|parse-(t
ext|html|js|msword|pdf|rss|ext)|index-(basic|more)|query-(basic|site|url
|more)</value>
  <description></description>
</property>

And this is what the crawl.log gives me:

050915 154033 Overall processing: Sorted 1 entries in 0.0090 seconds.
050915 154033 Overall processing: Sorted 0.0090 entries/second
050915 154033 FetchListTool completed
050915 154033 logging at INFO
050915 154034 fetching https://planet.abc.com/
050915 154034 http.proxy.host = null
050915 154034 http.proxy.port = 8080
050915 154034 http.timeout = 10000
050915 154034 http.content.limit = -1
050915 154034 http.agent = NutchCVS/0.7 (Nutch; 
http://lucene.apache.org/nutch/bot.html; [email protected])
050915 154034 http.auth.ntlm.username =
050915 154034 fetcher.server.delay = 1000
050915 154034 http.max.delays = 100
050915 154035 Configured Client
050915 154042 fetch okay, but can't parse https://planet.abc.com/,
reason: 
failed(2,0): No external command defined for contentType:


Can anyone help me out please?

Thanks,

Ed.

RE: can't parse https

Reply via email to