Ed,

        By default URLs that look like queries don't get crawled (URLs
that contains ?, *, !, @ and =).  You may need to go into
conf/crawl-urlfilter.txt and change this line:

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

        At a minimum you'll need to take out the '?', but you might want
to just comment out the whole line and see how that works for you.

Good Luck,
Jake.

-----Original Message-----
From: Edward Quick [mailto:[EMAIL PROTECTED] 
Sent: Thursday, September 15, 2005 3:06 PM
To: [email protected]
Subject: RE: can't parse https

Thanks Jake. I tried curl and got:

$ curl -i https://planet.abc.com
HTTP/1.1 302 302 Redirect
Server: Lotus-Domino/0
Date: Thu, 15 Sep 2005 16:35:37 GMT
Connection: close
Location: 
https://auth.abc.com/obrareq.cgi?t%3D0%20o%3D%20no%3D%20r%3D%20nr%3D
%20wu%3D%2F%20wh%3Dplanet.abc.com%20wo%3D1%20wa%3D15%20ws%3D20041005T103
1537
0719%20rh%3Dhttps%3A%2F%2Fplanet.abc.com%20ru%3D%252F

HTTP/1.1 302 Found
Server: Lotus-Domino/0
Date: Thu, 15 Sep 2005 16:31:40 GMT
Location: general/aptrix/bani.nsf/content/planet+home
Connection: close
Content-Type: text/html
Content-Length: 329

<HTML><HEAD><TITLE>Redirection</TITLE></HEAD><BODY><H1>Redirection</H1>T
his 
docu
ment can be found<A 
HREF="general/aptrix/bani.nsf/content/planet+home">elsewhe
re.</A><P>You see this messagebecause your browser does not support 
automaticred
irection handling. <P><HR><ADDRESS><A HREF="/">Lotus-Domino 
0</A></ADDRESS></BOD
Y></HTML>$


As you can see the content type is text/html, so I wonder if the problem
is 
with the redirect? I should be able to use nutch to crawl https though 
shouldn't I?

Ed.
>Ed,
>
>       I suspect the problem is with the ContentType that your server
>is responding with.  Using the command line utility curl this is what I
>get for my server:
>
>[EMAIL PROTECTED] conf]# curl -i http://www.aarp.org/ |less
>
>   7 15949    7  1255    0     0  10286      0  0:00:01  0:00:00
0:00:01
>10286HTTP/1.1 200 OK
>Date: Thu, 15 Sep 2005 14:51:56 GMT
>Server: Apache
>Content-Length: 15949
>Content-Type: text/html
>
>       What Content-Type are you getting back for planet.abc.com?
>
>Jake.
>
>-----Original Message-----
>From: Edward Quick [mailto:[EMAIL PROTECTED]
>Sent: Thursday, September 15, 2005 10:49 AM
>To: [email protected]
>Subject: can't parse https
>
>Hi,
>
>I'm trying to run a crawl on our DMZ intranet using the root url
>https://planet.abc.com/
>However nutch says it can't parse https. I have loaded
>protocol-httpclient,
>and set crawl-urlfilter.txt to
>
>+^https://planet.abc.com/
>
>In nutch-site.xml I have:
>
><property>
>   <name>plugin.includes</name>
>
><value>protocol-(httpclient|http|file|ftp|file)|urlfilter-regex|parse-(
t
>ext|html|js|msword|pdf|rss|ext)|index-(basic|more)|query-(basic|site|ur
l
>|more)</value>
>   <description></description>
></property>
>
>And this is what the crawl.log gives me:
>
>050915 154033 Overall processing: Sorted 1 entries in 0.0090 seconds.
>050915 154033 Overall processing: Sorted 0.0090 entries/second
>050915 154033 FetchListTool completed
>050915 154033 logging at INFO
>050915 154034 fetching https://planet.abc.com/
>050915 154034 http.proxy.host = null
>050915 154034 http.proxy.port = 8080
>050915 154034 http.timeout = 10000
>050915 154034 http.content.limit = -1
>050915 154034 http.agent = NutchCVS/0.7 (Nutch;
>http://lucene.apache.org/nutch/bot.html; [email protected])
>050915 154034 http.auth.ntlm.username =
>050915 154034 fetcher.server.delay = 1000
>050915 154034 http.max.delays = 100
>050915 154035 Configured Client
>050915 154042 fetch okay, but can't parse https://planet.abc.com/,
>reason:
>failed(2,0): No external command defined for contentType:
>
>
>Can anyone help me out please?
>
>Thanks,
>
>Ed.
>
>


Reply via email to