I am trying to index a website using htdig and I am having a hard
time understanding why some of my links are being followed and others
aren't.  

The site that I am trying to index is http://www.law.upenn.edu/

Links on the front page are followed properly.  One of those links leads
to  http://www.law.upenn.edu/departments/, which htdig "pushes" and then
requests.  htdig then fails to follow the links in that second document
but I can't figure out why -- it doesn't seem to be rejecting them, just
silently ignoring them.  

I have increased htdig's verbose output to -vvv and have
posted two segments of the generated log here:

http://faculty.law.upenn.edu/~mwsnyder/log1.txt
http://faculty.law.upenn.edu/~mwsnyder/log2.txt

I am running htdig-3.1.6.

These are the possibly relevant config options:

database_dir:           /usr/local/htdig/db
start_url:              http://www.law.upenn.edu/
limit_urls_to:          ${start_url}
exclude_urls:           /cgi-bin/ .cgi /bll/ulc/
bad_extensions:         .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif \
        .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi .css .pdf
max_head_length:        10000
max_doc_size:           200000
no_excerpt_show_top:    true
search_algorithm:       exact:1 synonyms:0.5 endings:0.1


Can anyone tell me how to convince htdig to follow the links within
http://www.law.upenn.edu/departments ?  Thanks.

-- 
Matthew Snyder
University of Pennsylvania Law School


-------------------------------------------------------
This SF.NET email is sponsored by: eBay
Great deals on office technology -- on eBay now! Click here:
http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to