I'm using last Sunday's 3.2b4 build, on FreeBSD 4.3-RELEASE.  I have plenty 
of memory and disk space.

My indexing run is abruptly terminating after only indexing a few 
pages.  At first I thought I might have reached a limit; there were 38431 
lines with links in them.  I tried setting different values for 
max_doc_size, up to 4000000, made no difference.  I reduced the page I was 
indexing to just 50 links, it still crapped out, until I removed one link 
in the middle of the list.  But this link by itself gets indexed correctly.

I'm puzzled!  Below is the output when I run it with -vv.  The only 
difference I can see between this and a successful run is the line "+ size 
= 6614"


ht://dig Start Time: Wed Oct 17 12:41:53 2001

New server: www.citynews.com, 80
  - Persistent connections: enabled
  - HEAD before GET: disabled
  - Timeout: 30
  - Connection space: 0
  - Max Documents: -1
  - TCP retries: 1
  - TCP wait time: 5
Trying to retrieve robots.txt file
Parsing robots.txt file using myname = htdig
Found 'user-agent' line: *
Found 'disallow' line: /cgi-bin/
Pattern: /cgi-bin/
pick: www.citynews.com, # servers = 1
0:2:0:http://www.citynews.com/adlist/:
title: CityNews Free Photo Classifieds and Chat for US and World Cities
META Description: free classifieds,photo classified ads,community 
calendar,and chat rooms for north america and world cities

url rejected: (level 1)http://www.citynews.com/css/citynews.css

url rejected: (level 1)http://www.citynews.com/advertising.html

    Rejected: item in exclude list
url rejected: (level 1)http://www.burstnet.com/ads/ad1847a-map.cgi

url rejected: (level 1)http://www.citynews.com/banners.html

    pushing http://kc.citynews.com/5596.html

New server: kc.citynews.com, 80
  - Persistent connections: enabled
  - HEAD before GET: disabled
  - Timeout: 30
  - Connection space: 0
  - Max Documents: -1
  - TCP retries: 1
  - TCP wait time: 5
Trying to retrieve robots.txt file
Parsing robots.txt file using myname = htdig
Found 'user-agent' line: *
Found 'disallow' line: /cgi-bin/
Pattern: /cgi-bin/
+
    pushing http://london.citynews.com/17220.html

New server: london.citynews.com, 80
  - Persistent connections: enabled
  - HEAD before GET: disabled
  - Timeout: 30
  - Connection space: 0
  - Max Documents: -1
  - TCP retries: 1
  - TCP wait time: 5
Trying to retrieve robots.txt file
Parsing robots.txt file using myname = htdig
Found 'user-agent' line: *
Found 'disallow' line: /cgi-bin/
Pattern: /cgi-bin/
+ size = 6614
pick: london.citynews.com, # servers = 3
1:4:1:http://london.citynews.com/17220.html:  size = 3847
pick: kc.citynews.com, # servers = 3
2:3:1:http://kc.citynews.com/5596.html:
title: Journal of Geocryology

    Rejected: item in exclude list
url rejected: (level 1)http://www.burstnet.com/ads/ad1847a-map.cgi

url rejected: (level 1)http://www.citynews.com/banners.html

url rejected: (level 1)http://www.citynews.com/about.html

url rejected: (level 1)http://www.recommend-it.com/p.e?677339

    Rejected: item in exclude list
url rejected: (level 1)http://kc.citynews.com/cgi-bin/pmail.cgi/5596/kc

url rejected: (level 1)http://kc.citynews.com/ads4.html

url rejected: (level 1)http://kc.citynews.com/
  size = 3845
pick: www.citynews.com, # servers = 3
pick: london.citynews.com, # servers = 3
pick: kc.citynews.com, # servers = 3
pick: www.citynews.com, # servers = 3
htdig: Run complete
htdig: 3 servers seen:
htdig:     kc.citynews.com:80 1 document
htdig:     london.citynews.com:80 1 document
htdig:     www.citynews.com:80 1 document

HTTP statistics
===============
  Persistent connections    : Yes
  HEAD call before GET      : No
  Connections opened        : 6
  Connections closed        : 5
  Changes of server         : 2
  HTTP Requests             : 6
  HTTP KBytes requested     : 2.28223
  HTTP Average request time : 0.166667 secs


_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to