It's highly appreciated to keep the list address on replies, since others may find the discussion useful and I'm far from the only person who can answer them.
I don't think you want to run the standard rundig script with "-a" for the first time, as the resulting databases will all have the .work extension added to them--htsearch will not find them by default. But that's not your question. There is a "url_rewrite_rules" attribute introduced into 3.1.6 that allows regex rewriting of URLs on-the-fly while indexing. This is used to do things like stripping off dynamic session IDs, etc. If you are not using the url_rewrite_rules attribute and you're seeing "applying regex" messages, we'd like to see output from running "htdig -vvv" as well as "htdig -?" (which should give the htdig version). Are you sure you only have version 3.1.6? Also, how have you installed ht://Dig? Since you use Solaris, I'm assuming you compiled it yourself--what compiler did you use? -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ On Thu, 5 Sep 2002, Roger M Clarke wrote: > Geoff > > We are using version 3.1.6 on Solaris 5.8 and this is a real problem. > > We are running the standard rundig script with alt=-a added, is it ok to add > this option > the first time rundig is run for an htdig installation? > > I am not intending to use URL rewrite rules, please explain what these are > in htdig terms. > > thanks > > R > > > The htdig.conf is as follows > > ============================================================================ > ======== > # > # Example config file for ht://Dig. > # > # This configuration file is used by all the programs that make up ht://Dig. > # Please refer to the attribute reference manual for more details on what > # can be put into this file. (http://www.htdig.org/confindex.html) > # Note that most attributes have very reasonable default values so you > # really only have to add attributes here if you want to change the > defaults. > # > # What follows are some of the common attributes you might want to change. > # > # apw 12/1/01 > image_url_prefix: /htdig/images > pdf_parser: /soft/Acrobat4/bin/acroread -toPostScript -pairs > > # > # Specify where the database files need to go. Make sure that there is > # plenty of free disk space available for the databases. They can get > # pretty big. > # > database_dir: /data/WWW2-htdig/db > > # > # This specifies the URL where the robot (htdig) will start. You can > specify > # multiple URLs here. Just separate them by some whitespace. > # The example here will cause the ht://Dig homepage and related pages to be > # indexed. > # You could also index all the URLs in a file like so: > # start_url: `${common_dir}/start.url` > # > #start_url: http://www.htdig.org/ > start_url: http://www2.brookes.ac.uk/ > > # > # This attribute limits the scope of the indexing process. The default is > to > # set it to the same as the start_url above. This way only pages that are > on > # the sites specified in the start_url attribute will be indexed and it will > # reject any URLs that go outside of those sites. > # > # Keep in mind that the value for this attribute is just a list of string > # patterns. As long as URLs contain at least one of the patterns it will be > # seen as part of the scope of the index. > # > limit_urls_to: ${start_url} > > # > # If there are particular pages that you definately do NOT want to index, > you > # can use the exclude_urls attribute. The value is a list of string > patterns. > # If a URL matches any of the patterns, it will NOT be indexed. This is > # useful to exclude things like virtual web trees or database accesses. By > # default, all CGI URLs will be excluded. (Note that the /cgi-bin/ > convention > # may not work on your web server. Check the path prefix used on your web > # server.) > # > # Updated by Roger Clarke 21/05/02 > # > exclude_urls: /cgi-bin/ .cgi /GsharpWE/ /bin/ /rjs/ /images/ /temp/ > /hardship > / /hotmetal/ /errors/ /excite/ /htdig/ /noframes/ m0*.html test.html > temp.html e > mail.html footer.html header.html homeold.html oder.html Rules > > > # > # Since ht://Dig does not (and cannot) parse every document type, this > # attribute is a list of strings (extensions) that will be ignored during > # indexing. These are *only* checked at the end of a URL, whereas > # exclude_url patterns are matched anywhere. > # > bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif \ > .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov > .avi > > # > # The string htdig will send in every request to identify the robot. Change > # this to your email address. > # > maintainer: [EMAIL PROTECTED] > > # > # The excerpts that are displayed in long results rely on stored information > # in the index databases. The compiled default only stores 512 characters > of > # text from each document (this excludes any HTML markup...) If you plan on > # using the excerpts you probably want to make this larger. The only > concern > # here is that more disk space is going to be needed to store the additional > # information. Since disk space is cheap (! :-)) you might want to set this > # to a value so that a large percentage of the documents that you are going > # to be indexing are stored completely in the database. At SDSU we found > # that by setting this value to about 50k the index would get 97% of all > # documents completely and only 3% was cut off at 50k. You probably want to > # experiment with this value. > # Note that if you want to set this value low, you probably want to set the > # excerpt_show_top attribute to false so that the top excerpt_length > characters > # of the document are always shown. > # > max_head_length: 10000 > > # > # To limit network connections, ht://Dig will only pull up to a certain > limit > # of bytes. This prevents the indexing from dying because the server keeps > # sending information. However, several FAQs happen because people have > files > # bigger than the default limit of 100KB. This sets the default a bit > higher. > # (see <http://www.htdig.org/FAQ.html> for more) > # > # apw 12/1/01 changed to 5MB for pdf files > #max_doc_size: 200000 > max_doc_size: 5000000 > > # > # Most people expect some sort of excerpt in results. By default, if the > # search words aren't found in context in the stored excerpt, htsearch shows > # the text defined in the no_excerpt_text attribute: > # (None of the search words were found in the top of this document.) > # This attribute instead will show the top of the excerpt. > # > no_excerpt_show_top: true > > # > # Depending on your needs, you might want to enable some of the fuzzy search > # algorithms. There are several to choose from and you can use them in any > # combination you feel comfortable with. Each algorithm will get a weight > # assigned to it so that in combinations of algorithms, certain algorithms > get > # preference over others. Note that the weights only affect the ranking of > # the results, not the actual searching. > # The available algorithms are: > # exact > # endings > # metaphone > # prefix > # soundex > # synonyms > # By default only the "exact" algorithm is used with weight 1. > # Note that if you are going to use the endings, metaphone, soundex, > # or synonyms algorithms, you will need to run htfuzzy to generate > # the databases they use. > # > search_algorithm: exact:1 synonyms:0.5 endings:0.1 > > # > # The following are the templates used in the builtin search results > # The default is to use compiled versions of these files, which produces > # slightly faster results. However, uncommenting these lines makes it > # very easy to change the format of search results. > # See <http://www.htdig.org/hts_templates.html for more details. > # > # template_map: Long long ${common_dir}/long.html \ > # Short short ${common_dir}/short.html > # template_name: long > > # > # The following are used to change the text for the page index. > # The defaults are just boring text numbers. These images spice > # up the result pages quite a bit. (Feel free to do whatever, though) > # > next_page_text: <img src="/htdig/images/buttonr.gif" border="0" > align="m > iddle" width="30" height="30" alt="next"> > no_next_page_text: > prev_page_text: <img src="/htdig/images/buttonl.gif" border="0" > align="m > iddle" width="30" height="30" alt="prev"> > no_prev_page_text: > page_number_text: '<img src="/htdig/images/button1.gif" border="0" > align=" > middle" width="30" height="30" alt="1">' \ > '<img src="/htdig/images/button2.gif" border="0" > align=" > middle" width="30" height="30" alt="2">' \ > '<img src="/htdig/images/button3.gif" border="0" > align=" > middle" width="30" height="30" alt="3">' \ > '<img src="/htdig/images/button4.gif" border="0" > align=" > middle" width="30" height="30" alt="4">' \ > '<img src="/htdig/images/button5.gif" border="0" > align=" > middle" width="30" height="30" alt="5">' \ > '<img src="/htdig/images/button6.gif" border="0" > align=" > middle" width="30" height="30" alt="6">' \ > '<img src="/htdig/images/button7.gif" border="0" > align=" > middle" width="30" height="30" alt="7">' \ > '<img src="/htdig/images/button8.gif" border="0" > align=" > middle" width="30" height="30" alt="8">' \ > '<img src="/htdig/images/button9.gif" border="0" > align=" > middle" width="30" height="30" alt="9">' \ > '<img src="/htdig/images/button10.gif" border="0" > align= > "middle" width="30" height="30" alt="10">' > # > # To make the current page stand out, we will put a border arround the > # image for that page. > # > no_page_number_text: '<img src="/htdig/images/button1.gif" border="2" > align=" > middle" width="30" height="30" alt="1">' \ > '<img src="/htdig/images/button2.gif" border="2" > align=" > middle" width="30" height="30" alt="2">' \ > '<img src="/htdig/images/button3.gif" border="2" > align=" > middle" width="30" height="30" alt="3">' \ > '<img src="/htdig/images/button4.gif" border="2" > align=" > middle" width="30" height="30" alt="4">' \ > '<img src="/htdig/images/button5.gif" border="2" > align=" > middle" width="30" height="30" alt="5">' \ > '<img src="/htdig/images/button6.gif" border="2" > align=" > middle" width="30" height="30" alt="6">' \ > '<img src="/htdig/images/button7.gif" border="2" > align=" > middle" width="30" height="30" alt="7">' \ > '<img src="/htdig/images/button8.gif" border="2" > align=" > middle" width="30" height="30" alt="8">' \ > '<img src="/htdig/images/button9.gif" border="2" > align=" > middle" width="30" height="30" alt="9">' \ > '<img src="/htdig/images/button10.gif" border="2" > align= > "middle" width="30" height="30" alt="10">' > > # local variables: > # mode: text > # eval: (if (eq window-system 'x) (progn (setq font-lock-keywords (list > '("^#.*" > . font-lock-keyword-face) '("^[a-zA-Z][^ :]+" . > font-lock-function-name-face) ' > ("[+$]*:" . font-lock-comment-face) )) (font-lock-mode))) > # end: > ============================================================================ > ====== > > > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED]]On Behalf Of Geoff > Hutchison > Sent: 04 September 2002 17:33 > To: Roger M Clarke > Cc: [EMAIL PROTECTED] > Subject: Re: [htdig] rundig cannot get beyond "Applying regex > '^.*[aeiou]y$' to aaa" > > > > > We have one version of htdig running against our internet site without > > problem and very good it is certainly compared to Excite. > > > > We have now created another copy of htdig on our intranet (seperate for > > security purposes). > > OK, it would be *really* helpful to know what the versions of ht://Dig > you're using here. > > > The process it is repeating is "Applying regex '^.*[aeiou]y$' to aaa" it > > looks like it cannot get past aaa. > > I'm not sure if this is a hypothetical example or a real one. Since you're > apparently trying to use url_rewrite_rules, I'm guessing you're using > 3.1.6 or a snapshot of 3.2.0b4. As I said, it'd really help to know which > version you're using (and preferably which OS). > > It would also really help to know what your configuration looks like--in > this case, the url_rewrite_rules if that's what you're using. And an > actual error message--if you like, try changing hostnames to protect the > innocent. (But if it's an actual bug, we'll need as much info as possible > to reproduce the bug.) > > -- > -Geoff Hutchison > Williams Students Online > http://wso.williams.edu/ > > > > ------------------------------------------------------- > This sf.net email is sponsored by: OSDN - Tired of that same old > cell phone? Get a new here for FREE! > https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390 > _______________________________________________ > htdig-general mailing list <[EMAIL PROTECTED]> > To unsubscribe, send a message to > <[EMAIL PROTECTED]> with a subject of unsubscribe > FAQ: http://htdig.sourceforge.net/FAQ.html > > ------------------------------------------------------- This sf.net email is sponsored by: OSDN - Tired of that same old cell phone? Get a new here for FREE! https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390 _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

