Re: [htdig] 3.2.0b3 on BSDI, light at the end of tunnel;)

Joe R. Jah Mon, 19 Feb 2001 16:15:02 -0800
On Mon, 19 Feb 2001, Geoff Hutchison wrote:

> Date: Mon, 19 Feb 2001 18:25:23 -0500 (EST)
> From: Geoff Hutchison <[EMAIL PROTECTED]>
> To: Gilles Detillieux <[EMAIL PROTECTED]>
> Cc: [EMAIL PROTECTED],
>     "ht://Dig mailing list" <[EMAIL PROTECTED]>
> Subject: Re: [htdig] 3.2.0b3 on BSDI, light at the end of tunnel;)
> 
> On Mon, 19 Feb 2001, Gilles Detillieux wrote:
> 
> > But what did lines 62 and 63 look like before?  It's perfectly valid to
> > start a line with "#", as long as the previous line wasn't an incomplete
> > definition ending with a "\" at the end of the line.  This is what we
> > need to know.  Was the code choking on valid syntax or not???
> 
> Or more to the point, can we get a copy of the config file that was
> causing problems? It's one thing if you can index and quite another if we
> have a bug that you exposed that needs to get fixed.

I just checked the original htdig.conf from the source tree.  It does not
have the problem.  When I first tried a 3.2.0bx I copied the conf file
from the source tree to the conf folder; I then appended my 3.1.5 conf
file to it.  Then I commented out duplicates, without meticulously placing
"#"'s at start of lines;( 
Any way, My bad;((  I have attached the corrected conf file just in case.

Regards,

Joe
-- 
     _/   _/_/_/       _/              ____________    __o
     _/   _/   _/      _/         ______________     _-\<,_
 _/  _/   _/_/_/   _/  _/                     ......(_)/ (_)
  _/_/ oe _/   _/.  _/_/ ah        [EMAIL PROTECTED]


#
# Example config file for ht://Dig.
#
# This configuration file is used by all the programs that make up ht://Dig.
# Please refer to the attribute reference manual for more details on what
# can be put into this file.  (http://www.htdig.org/confindex.html)
# Note that most attributes have very reasonable default values so you
# really only have to add attributes here if you want to change the defaults.
#
# What follows are some of the common attributes you might want to change.
#

#
# Specify where the database files need to go.  Make sure that there is
# plenty of free disk space available for the databases.  They can get
# pretty big.
#
# database_dir:         @DATABASE_DIR@

#
# This specifies the URL where the robot (htdig) will start.  You can specify
# multiple URLs here.  Just separate them by some whitespace.
# The example here will cause the ht://Dig homepage and related pages to be
# indexed.
# You could also index all the URLs in a file like so:
# start_url:           `${common_dir}/start.url`
#
# start_url:            http://www.htdig.org/

#
# This attribute limits the scope of the indexing process.  The default is to
# set it to the same as the start_url above.  This way only pages that are on
# the sites specified in the start_url attribute will be indexed and it will
# reject any URLs that go outside of those sites.
#
# Keep in mind that the value for this attribute is just a list of string
# patterns. As long as URLs contain at least one of the patterns it will be
# seen as part of the scope of the index.
#
# limit_urls_to:                ${start_url}

#
# If there are particular pages that you definitely do NOT want to index, you
# can use the exclude_urls attribute.  The value is a list of string patterns.
# If a URL matches any of the patterns, it will NOT be indexed.  This is
# useful to exclude things like virtual web trees or database accesses.  By
# default, all CGI URLs will be excluded.  (Note that the /cgi-bin/ convention
# may not work on your web server.  Check the  path prefix used on your web
# server.)
#
# exclude_urls:         /cgi-bin/ .cgi

#
# Since ht://Dig does not (and cannot) parse every document type, this 
# attribute is a list of strings (extensions) that will be ignored during 
# indexing. These are *only* checked at the end of a URL, whereas 
# exclude_url patterns are matched anywhere.
#
# Also keep in mind that while other attributes allow regex, these must be 
# actual strings.
#
# bad_extensions:               .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif \
#                 .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi

#
# The string htdig will send in every request to identify the robot.  Change
# this to your email address.
#
# maintainer:           [EMAIL PROTECTED]

#
# The excerpts that are displayed in long results rely on stored information
# in the index databases.  The compiled default only stores 512 characters of
# text from each document (this excludes any HTML markup...)  If you plan on
# using the excerpts you probably want to make this larger.  The only concern
# here is that more disk space is going to be needed to store the additional
# information.  Since disk space is cheap (! :-)) you might want to set this
# to a value so that a large percentage of the documents that you are going
# to be indexing are stored completely in the database.  At SDSU we found
# that by setting this value to about 50k the index would get 97% of all
# documents completely and only 3% was cut off at 50k.  You probably want to
# experiment with this value.
# Note that if you want to set this value low, you probably want to set the
# excerpt_show_top attribute to false so that the top excerpt_length characters
# of the document are always shown.
#
# max_head_length:      10000

#
# To limit network connections, ht://Dig will only pull up to a certain limit
# of bytes. This prevents the indexing from dying because the server keeps
# sending information. However, several FAQs happen because people have files
# bigger than the default limit of 100KB. This sets the default a bit higher.
# (see <http://www.htdig.org/FAQ.html> for more)
#
# max_doc_size:         200000

#
# Most people expect some sort of excerpt in results. By default, if the 
# search words aren't found in context in the stored excerpt, htsearch shows 
# the text defined in the no_excerpt_text attribute:
# (None of the search words were found in the top of this document.)
# This attribute instead will show the top of the excerpt.
#
# no_excerpt_show_top:  true

#
# Depending on your needs, you might want to enable some of the fuzzy search
# algorithms.  There are several to choose from and you can use them in any
# combination you feel comfortable with.  Each algorithm will get a weight
# assigned to it so that in combinations of algorithms, certain algorithms get
# preference over others.  Note that the weights only affect the ranking of
# the results, not the actual searching.
# The available algorithms are:
#       accents
#       exact
#       endings
#       metaphone
#       prefix
#       regex
#       soundex
#       speling [sic]
#       substring
#       synonyms
# By default only the "exact" algorithm is used with weight 1.
# Note that if you are going to use the endings, metaphone, soundex, accents,
# or synonyms algorithms, you will need to run htfuzzy to generate
# the databases they use.
#
# search_algorithm:     exact:1 synonyms:0.5 endings:0.1

#
# The following are the templates used in the builtin search results
# The default is to use compiled versions of these files, which produces
# slightly faster results. However, uncommenting these lines makes it
# very easy to change the format of search results.
# See <http://www.htdig.org/hts_templates.html for more details.
#
# template_map: Long long ${common_dir}/long.html \
#               Short short ${common_dir}/short.html
# template_name: long

#
# The following are used to change the text for the page index.
# The defaults are just boring text numbers.  These images spice
# up the result pages quite a bit.  (Feel free to do whatever, though)
#
# next_page_text:               <img src="/Search/Images/buttonr.gif" border="0" 
align="middle" width="30" height="30" alt="next">
# no_next_page_text:
# prev_page_text:               <img src="/Search/Images/buttonl.gif" border="0" 
align="middle" width="30" height="30" alt="prev">
# no_prev_page_text:
# page_number_text:     '<img src="/Search/Images/button1.gif" border="0" 
align="middle" width="30" height="30" alt="1">' \
#                        '<img src="/Search/Images/button2.gif" border="0" 
align="middle" width="30" height="30" alt="2">' \
#                        '<img src="/Search/Images/button3.gif" border="0" 
align="middle" width="30" height="30" alt="3">' \
#                        '<img src="/Search/Images/button4.gif" border="0" 
align="middle" width="30" height="30" alt="4">' \
#                        '<img src="/Search/Images/button5.gif" border="0" 
align="middle" width="30" height="30" alt="5">' \
#                        '<img src="/Search/Images/button6.gif" border="0" 
align="middle" width="30" height="30" alt="6">' \
#                        '<img src="/Search/Images/button7.gif" border="0" 
align="middle" width="30" height="30" alt="7">' \
#                        '<img src="/Search/Images/button8.gif" border="0" 
align="middle" width="30" height="30" alt="8">' \
#                        '<img src="/Search/Images/button9.gif" border="0" 
align="middle" width="30" height="30" alt="9">' \
#                        '<img src="/Search/Images/button10.gif" border="0" 
align="middle" width="30" height="30" alt="10">'
#
# To make the current page stand out, we will put a border around the
# image for that page.
#
# no_page_number_text:  '<img src="/Search/Images/button1.gif" border="2" 
align="middle" width="30" height="30" alt="1">' \
#                        '<img src="/Search/Images/button2.gif" border="2" 
align="middle" width="30" height="30" alt="2">' \
#                        '<img src="/Search/Images/button3.gif" border="2" 
align="middle" width="30" height="30" alt="3">' \
#                        '<img src="/Search/Images/button4.gif" border="2" 
align="middle" width="30" height="30" alt="4">' \
#                        '<img src="/Search/Images/button5.gif" border="2" 
align="middle" width="30" height="30" alt="5">' \
#                        '<img src="/Search/Images/button6.gif" border="2" 
align="middle" width="30" height="30" alt="6">' \
#                        '<img src="/Search/Images/button7.gif" border="2" 
align="middle" width="30" height="30" alt="7">' \
#                        '<img src="/Search/Images/button8.gif" border="2" 
align="middle" width="30" height="30" alt="8">' \
#                        '<img src="/Search/Images/button9.gif" border="2" 
align="middle" width="30" height="30" alt="9">' \
#                        '<img src="/Search/Images/button10.gif" border="2" 
align="middle" width="30" height="30" alt="10">'

# local variables:
# mode: text
# eval: (if (eq window-system 'x) (progn (setq font-lock-keywords (list '("^#.*" . 
font-lock-keyword-face) '("^[a-zA-Z][^ :]+" . font-lock-function-name-face) '("[+$]*:" 
. font-lock-comment-face) )) (font-lock-mode)))
# end:

#
# Example config file for ht://Dig.
# Last modified 2-Sep-1996 by Andrew Scherpbier
#
# This configuration file is used by all the programs that make up ht://Dig.
# Please refer to the attribute reference manual for more details on what
# can be put into this file.  (http://htdig.sdsu.edu/configfile.html)
# Note that most attributes have very reasonable default values so you
# really only have to add attributes here if you want to change the defaults.
#
# What follows are some of the common attributes you might want to change.
#

#
# Specify where the database files need to go.  Make sure that there is
# plenty of free disk space available for the databases.  They can get
# pretty big.
#
database_dir:           /Search/db

#
# This specifies the URL where the robot (htdig) will start.  You can specify
# multiple URLs here.  Just separate them by some whitespace.
# The example here will cause the ht://Dig homepage and related pages to be
# indexed.
#
start_url:              http://www.ccsf.cc.ca.us/

#
# This attribute limits the scope of the indexing process.  The default is to
# set it to the same as the start_url above.  This way only pages that are on
# the sites specified in the start_url attribute will be indexed and it will
# reject any URLs that go outside of those sites.
#
# Keep in mind that the value for this attribute is just a list of string
# patterns. As long as URLs contain at least one of the patterns it will be
# seen as part of the scope of the index.
#
limit_urls_to:          ${start_url}

#
# Access certain URLs on the local filesystem.
# For example, local_urls: http://www.foo.com/=/usr/www/htdocs/
#
local_urls:             
http://www.ccsf.cc.ca.us/Associated_Students/=/Organizations/Associated_Students/\
                        http://www.ccsf.cc.ca.us/Campuses/=/dptweb/Campuses/\ 
                        http://www.ccsf.cc.ca.us/Catalog/=/dptweb/Catalog/\
                        http://www.ccsf.cc.ca.us/Channel_52/=/Departments/Channel_52/\
                        
http://www.ccsf.cc.ca.us/Continuing_Education/=/Services/Continuing_Education/

#
# If there are particular pages that you definately do NOT want to index, you
# can use the exclude_urls attribute.  The value is a list of string patterns.
# If a URL matches any of the patterns, it will NOT be indexed.  This is
# useful to exclude things like virtual web trees or database accesses.  By
# default, all CGI URLs will be excluded.  (Note that the /cgi-bin/ convention
# may not work on your web server.  Check the  path prefix used on your web
# server.)
#
exclude_urls:   /cgi-bin/ /title3-cgi/ /Guardsman/ .shtml/ ?


#
# Max keywords,
# max_keywords:                 "-1"
#
max_keywords:           11


#
# This is a weight of "how important" a page is, based on the number of URLs pointing 
to it. It's actually multiplied
# by the ratio of the incoming URLs (backlinks) and outgoing URLs, to balance out 
pages with lots of links to pages
# that link back to them. This factor can be changed without changing the database in 
any way. The default may be
# a bit high. 
#
backlink_factor:        0

#
# Since ht://Dig does not (and cannot) parse every document type, this
# attribute is a list of strings (extensions) that will be ignored during
# indexing. These are *only* checked at the end of a URL, whereas
# exclude_url patterns are matched anywhere.
#
bad_extensions:         .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif \
                        .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi

#
# This factor, like backlink_factor can be changed without modifing the database. It 
gives higher rankings to newer
# documents and lower rankings to older documents. Before setting this factor, it's 
advised to make sure your
# servers are returning accurate dates (check the dates returned in the long format). 
#
date_factor:            0

# Plain old "descriptions" are the text of a link pointing to a document. This factor 
gives weight to the words of
# these descriptions of the document. Not surprisingly, these can be pretty accurate 
summaries of a document's
# content. See also title_factor or text_factor. Changing this factor will require 
updating your database. 
#description_factor:              "150"
#
description_factor:     1

# This is a factor which will be used to multiply the weight of words between
# <h1> and </h1> tags. It is used to assign the level of importance to certain
# headers. Setting a factor to 0 will cause words in this heading to be ignored.
# The number may be a floating point number. See also the title_factor and
# text_factor attributes.
heading_factor_1:       5
heading_factor_2:       4
heading_factor_3:       3
heading_factor_4:       0
heading_factor_5:       0
heading_factor_6:       0

# This is a factor which will be used to multiply the weight of words in the list of
# keywords of a document. The number may be a floating point number. See
# also the title_factor and text_factorattributes.
keywords_factor:        100

#
# This is a factor which will be used to multiply the weight of words in any META 
description tags in a document.
# The number may be a floating point number. See also the title_factor and 
text_factorattributes. 
#meta_description_factor:               "50"
#
meta_description_factor:        20

# This is a factor which will be used to multiply the weight of words that are not
# in any special part of a document. Setting a factor to 0 will cause normal
# words to be ignored. The number may be a floating point number. See also
# the heading_factor_[1-6], title_factor, and keyword_factor attributes.
text_factor:            1

# This is a factor which will be used to multiply the weight of words in the title
# of a document. Setting a factor to 0 will cause words in the title to be
# ignored. The number may be a floating point number. See also the
# heading_factor_[1-6] attribute.
title_factor:           100

#
# Depending on your needs, you might want to enable some of the fuzzy search
# algorithms.  There are several to choose from and you can use them in any
# combination you feel comfortable with.  Each algorithm will get a weight
# assigned to it so that in combinations of algorithms, certain algorithms get
# preference over others.  Note that the weights only affect the ranking of
# the results, not the actual searching.
# The available algorithms are:
#       exact
#       endings
#       synonyms
#       soundex
#       metaphone
# By default only the "exact" algorithm is used with weight 1.
# Note that if you are going to use any of the algorithms other than "exact",
# you need to use the htfuzzy program to generate the databases that each
# algorithm requires.
#
#search_algorithm:              "exact:1"
#search_algorithm:       exact:1 synonyms:0.5 endings:0.1
search_algorithm:       exact:1 synonyms:.2 prefix:0.005 #endings:0.1

#
# The following are the templates used in the builtin search results
# The default is to use compiled versions of these files, which produces
# slightly faster results. However, uncommenting these lines makes it
# very easy to change the format of search results.
# See <http://www.htdig.org/hts_templates.html for more details.
#
#template_map:                  "Long builtin-long builtin-long Short builtin-short 
builtin-short"
#template_name:                 "builtin-long"
# template_map: Long long ${common_dir}/long.html \
#               Short short ${common_dir}/short.html
# template_name: long

#
# The following are used to change the text for the page index.
# The defaults are just boring text numbers.  These images spice
# up the result pages quite a bit.  (Feel free to do whatever, though)
#
next_page_text:         <img src=/Pub/Search/Graphics/buttonr.gif border=0 
align=middle width=30 height=30 alt=next>
no_next_page_text:
prev_page_text:         <img src=/Pub/Search/Graphics/buttonl.gif border=0 
align=middle width=30 height=30 alt=prev>
no_prev_page_text:
page_number_text:       "<img src=/Pub/Search/Graphics/button1.gif border=0 
align=middle width=30 height=30 alt=1>" \
                        "<img src=/Pub/Search/Graphics/button2.gif border=0 
align=middle width=30 height=30 alt=2>" \
                        "<img src=/Pub/Search/Graphics/button3.gif border=0 
align=middle width=30 height=30 alt=3>" \
                        "<img src=/Pub/Search/Graphics/button4.gif border=0 
align=middle width=30 height=30 alt=4>" \
                        "<img src=/Pub/Search/Graphics/button5.gif border=0 
align=middle width=30 height=30 alt=5>" \
                        "<img src=/Pub/Search/Graphics/button6.gif border=0 
align=middle width=30 height=30 alt=6>" \
                        "<img src=/Pub/Search/Graphics/button7.gif border=0 
align=middle width=30 height=30 alt=7>" \
                        "<img src=/Pub/Search/Graphics/button8.gif border=0 
align=middle width=30 height=30 alt=8>" \
                        "<img src=/Pub/Search/Graphics/button9.gif border=0 
align=middle width=30 height=30 alt=9>" \
                        "<img src=/Pub/Search/Graphics/button10.gif border=0 
align=middle width=30 height=30 alt=10>"
#
# To make the current page stand out, we will put a border arround the
# image for that page.
#
no_page_number_text:    "<img src=/Pub/Search/Graphics/button1.gif border=2 
align=middle width=30 height=30 alt=1>" \
                        "<img src=/Pub/Search/Graphics/button2.gif border=2 
align=middle width=30 height=30 alt=2>" \
                        "<img src=/Pub/Search/Graphics/button3.gif border=2 
align=middle width=30 height=30 alt=3>" \
                        "<img src=/Pub/Search/Graphics/button4.gif border=2 
align=middle width=30 height=30 alt=4>" \
                        "<img src=/Pub/Search/Graphics/button5.gif border=2 
align=middle width=30 height=30 alt=5>" \
                        "<img src=/Pub/Search/Graphics/button6.gif border=2 
align=middle width=30 height=30 alt=6>" \
                        "<img src=/Pub/Search/Graphics/button7.gif border=2 
align=middle width=30 height=30 alt=7>" \
                        "<img src=/Pub/Search/Graphics/button8.gif border=2 
align=middle width=30 height=30 alt=8>" \
                        "<img src=/Pub/Search/Graphics/button9.gif border=2 
align=middle width=30 height=30 alt=9>" \
                        "<img src=/Pub/Search/Graphics/button10.gif border=2 
align=middle width=30 height=30 alt=10>"
#
#       If set to true, numbers are considered words. This means that searches
#       can be done on number as well as regular words. All the same rules
#       apply to numbers as to words. See the description of valid_punctuation
#       for the rules used to determine what a word is.
#
allow_numbers:          true

#
#
#
#allow_virtual_hosts:   false

# This attribute is used to specify a list of content-type/parsers that are to be used 
to parse documents that cannot
# by parsed by any of the internal parsers. The list of external parsers is examined 
before the builtin parsers are
# checked, so this can be used to override the internal behavior without recompiling 
htdig.
# The external parsers are specified as pairs of strings. The first string of each 
pair is the content-type that the
# parser can handle while the second string each pair is the path to the external 
parsing program. The parsing
# program will get the document to be parsed on its standard input and it is to write 
information for htdig on its
# standard output.
# example: 
#     external_parsers: text/html /usr/local/bin/htmlparser application/ms-word 
/usr/local/bin/mswordparse
external_parsers:       application/msword->text/html /usr/local/bin/conv_doc.pl \
                        application/postscript->text/html /usr/local/bin/conv_doc.pl \
                        application/pdf->text/html /usr/local/bin/conv_doc.pl

#externalI_parsers:     application/msword /usr/local/bin/parse_doc.pl \
#                       application/postscript /usr/local/bin/parse_doc.pl \
#                       application/pdf /usr/local/bin/parse_doc.pl

#
# This specifies the email address that htnotify email messages get sent out from. The 
address is forged using
# /usr/lib/sendmail. Check htnotify/htnotify.cc for detail on how this is done. 
htnotify_sender:        [EMAIL PROTECTED]

#
# This sets whether htsearch should use the syslog() to log search requests. If set, 
this will log requests with a
# default level of LOG_INFO and a facility of LOG_LOCAL5. For details on redirecting 
the log into a separate file or
# other actions, see the syslog.conf(5) man page. To set the level and facility used 
in logging, change LOG_LEVEL
# and LOG_FACILITY in the include/htconfig.h file before compiling. 
# Log file path         /Search/conf/log
#
logging:                true

#
# The words in this list are used to search for keywords in HTML META tags. This list 
can contain any number of
# strings that each will be seen as the name for whatever keyword convention is used. 
# The META tags have the following format: 
# <META name="somename" value="somevalue">
keywords_meta_tag_names:        keywords htdig-keywords

#
# The string htdig will send in every request to identify the robot.  Change
# this to your email address.
#
maintainer:             [EMAIL PROTECTED]

#
#       If this is set to a relatively small number, the matches will be shown in
#       pages instead of all at once.
#
matches_per_page:       10

#
# While gathering descriptions of URLs, htdig will only record those descriptions 
which are shorter than this
# length. This is used mostly to deal with broken HTML. (If a hyperlink is not 
terminated with a </a> the description
# will go on until the end of the document.) 
#
max_description_length: 60

#
# The excerpts that are displayed in long results rely on stored information
# in the index databases.  The compiled default only stores 512 characters of
# text from each document (this excludes any HTML markup...)  If you plan on
# using the excerpts you probably want to make this larger.  The only concern
# here is that more disk space is going to be needed to store the additional
# information.  Since disk space is cheap (! :-)) you might want to set this
# to a value so that a large percentage of the documents that you are going
# to be indexing are stored completely in the database.  At SDSU we found
# that by setting this value to about 50k the index would get 97% of all
# documents completely and only 3% was cut off at 50k.  You probably want to
# experiment with this value.
# Note that if you want to set this value low, you probably want to set the
# excerpt_show_top attribute to false so that the top excerpt_length characters
# of the document are always shown.
#max_head_length:                       "512"
#
max_head_length:        500000

# Instead of limiting the indexing process by URL pattern, it can also be limited
# by the number of hops or clicks a document is removed from the starting
# URL. Unfortunately, this only works reliably when a complete index is
# created, not an update.
# The starting page will have hop count 0. 
max_hop_count:          999999

#  the maximum number of extentions ( if you set it to 2 it will only fetch abc,abca, 
abcb )
#
#max_prefix_matches:     1000

# if its left blank, it will always try to expand the search words
#prefix_match_character:                "*"

#
# When stars are used to display the score of a match, this value determines the 
maximum number of stars that can
# be displayed. 
max_stars:              5

#
#       This sets the minimum length of words that will be indexed. Words
#       shorter than this value will be silently ignored but still put into the 
excerpt.
#       Note that by making this value less than 3, a lot more words that are
#       very frequent will be indexed. It might be advisable to add some of these
#       to the bad_words list.
#
minimum_word_length:    3

#
# If no excerpt is available, this option will act the same as excerpt_show_top, that 
is, it will show the top of the
# document. 
#
no_excerpt_show_top:    true
excerpt_show_top:       no

# The following line is the default used by PDF.cc if there is no pdf_converter
# in the config file
# pdf_converter: acroread -toPostScript -pairs %src %dest
# Using acroread that is not in the PATH
# pdf_converter: /usr/local/bin/acroread -toPostScript -pairs %src %dest
# Using pdftops that comes in the xpdf package
# pdf_converter: /usr/local/bin/pdftops %src %dest
pdf_parser:             /usr/contrib/bin/pdftops

max_doc_size:           1650000

# If TRUE, htmerge will remove any URLs which were marked as unreachable
# by htdig from the database. If FALSE, it will not do this. When htdig is run in
# initial mode, documents which were referred to but could not be accessed
# should probably be removed, and hence this option should then be set to
# TRUE, however, if htdig is run to update the database, this may cause
# documents on a server which is temporarily unavailable to be removed. This
# is probably NOT what was intended, so hence this option should be set to
# FALSE in that case.
remove_bad_urls:        true

remove_default_doc:     index.shtml index.html index.htm homepage.html homepage.htm 
home.html home.htm

#
# This directive tells the indexer that servers have several DNS aliases, which all 
point to the same machine and are
# NOT virtual hosts. This allows you to ensure pages are indexed only once on a given 
machine, despite the alias
# used in a URL. 
server_aliases:         
cloud.ccsf.cc.ca.us:80=www.ccsf.cc.ca.us:80=cloud.ccsf.org:80=www.ccsf.org:80

#
# If set to true, any META description tags will be used as excerpts by htsearch. Any 
documents that do not have
# META descriptions will retain their normal excerpts. 
use_meta_description:   true

#compression_level:             0
#excerpt_length:                        300

#These characters are considered part of a word. In contrast to the characters in the 
valid_punctuation attribute, they are treated
#just like letter characters.  Note that the locale attribute is normally used to 
configure which characters constitute letter
#characters. example: extra_word_characters: _ 
extra_word_characters:          _

local_default_doc:              index.shtml index.html index.htm homepage.html 
homepage.htm home.html home.htm

#local_urls_only:               false
#Set this to access user directory URLs through the local filesystem. If you leave the 
"path" portion out, it will look up the
#user's home directory in /etc/password (or NIS or whatever). As with local_urls, if 
the files are not found, ht://Dig will try with
#HTTP. Again, note the example's format. To map http://www.my.org/~joe/foo/bar.html to 
/home/joe/www/foo/bar.html, try the example
#below. The fallback to HTTP can be disabled by setting the local_urls_only attribute 
to true. As of 3.1.5, you can provide multiple
#mappings of a given URL to different directories, and htdig will use the first 
mapping that works. Special characters can
#be embedded in these names using %xx hex encoding. For example, you can use %3D to 
embed an "=" sign in an URL pattern.
#example: local_user_urls: http://www.my.org/=/home/,/www/
local_user_urls:                http://www.ccsf.cc.ca.us/=/www/,/dptweb/

#max_descriptions:              5
#max_meta_description_length:   512
#minimum_prefix_length:         1
#valid_punctuation:             ".-_/!#$%^&'"



_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
Information: http://lists.sourceforge.net/lists/listinfo/htdig-general
FAQ: http://htdig.sourceforge.net/FAQ.html
Re: [htdig] 3.2.0b3 on BSDI, light at the end of tunnel;)

Reply via email to