Thanks Jake. I tried curl and got:

$ curl -i https://planet.abc.com
HTTP/1.1 302 302 Redirect
Server: Lotus-Domino/0
Date: Thu, 15 Sep 2005 16:35:37 GMT
Connection: close
Location: https://auth.abc.com/obrareq.cgi?t%3D0%20o%3D%20no%3D%20r%3D%20nr%3D
%20wu%3D%2F%20wh%3Dplanet.abc.com%20wo%3D1%20wa%3D15%20ws%3D20041005T1031537
0719%20rh%3Dhttps%3A%2F%2Fplanet.abc.com%20ru%3D%252F

HTTP/1.1 302 Found
Server: Lotus-Domino/0
Date: Thu, 15 Sep 2005 16:31:40 GMT
Location: general/aptrix/bani.nsf/content/planet+home
Connection: close
Content-Type: text/html
Content-Length: 329

<HTML><HEAD><TITLE>Redirection</TITLE></HEAD><BODY><H1>Redirection</H1>This docu ment can be found<A HREF="general/aptrix/bani.nsf/content/planet+home">elsewhe re.</A><P>You see this messagebecause your browser does not support automaticred irection handling. <P><HR><ADDRESS><A HREF="/">Lotus-Domino 0</A></ADDRESS></BOD
Y></HTML>$


As you can see the content type is text/html, so I wonder if the problem is with the redirect? I should be able to use nutch to crawl https though shouldn't I?

Ed.
Ed,

        I suspect the problem is with the ContentType that your server
is responding with.  Using the command line utility curl this is what I
get for my server:

[EMAIL PROTECTED] conf]# curl -i http://www.aarp.org/ |less

  7 15949    7  1255    0     0  10286      0  0:00:01  0:00:00  0:00:01
10286HTTP/1.1 200 OK
Date: Thu, 15 Sep 2005 14:51:56 GMT
Server: Apache
Content-Length: 15949
Content-Type: text/html

        What Content-Type are you getting back for planet.abc.com?

Jake.

-----Original Message-----
From: Edward Quick [mailto:[EMAIL PROTECTED]
Sent: Thursday, September 15, 2005 10:49 AM
To: [email protected]
Subject: can't parse https

Hi,

I'm trying to run a crawl on our DMZ intranet using the root url
https://planet.abc.com/
However nutch says it can't parse https. I have loaded
protocol-httpclient,
and set crawl-urlfilter.txt to

+^https://planet.abc.com/

In nutch-site.xml I have:

<property>
  <name>plugin.includes</name>

<value>protocol-(httpclient|http|file|ftp|file)|urlfilter-regex|parse-(t
ext|html|js|msword|pdf|rss|ext)|index-(basic|more)|query-(basic|site|url
|more)</value>
  <description></description>
</property>

And this is what the crawl.log gives me:

050915 154033 Overall processing: Sorted 1 entries in 0.0090 seconds.
050915 154033 Overall processing: Sorted 0.0090 entries/second
050915 154033 FetchListTool completed
050915 154033 logging at INFO
050915 154034 fetching https://planet.abc.com/
050915 154034 http.proxy.host = null
050915 154034 http.proxy.port = 8080
050915 154034 http.timeout = 10000
050915 154034 http.content.limit = -1
050915 154034 http.agent = NutchCVS/0.7 (Nutch;
http://lucene.apache.org/nutch/bot.html; [email protected])
050915 154034 http.auth.ntlm.username =
050915 154034 fetcher.server.delay = 1000
050915 154034 http.max.delays = 100
050915 154035 Configured Client
050915 154042 fetch okay, but can't parse https://planet.abc.com/,
reason:
failed(2,0): No external command defined for contentType:


Can anyone help me out please?

Thanks,

Ed.




Reply via email to