According to Robin Forster:
> I just tried configuring htsearch on my website and for some reason it
> will not spider my site. (http://www.rsforster.ottawa.on.ca/.
>
> If I include every page in the "start_url" line in the config file it
> will index all those pages but will not follow subdirectories.
>
> Also it never works if I specify "http://www.rsforster.ottaw.on.ca/" as
> a start URL. I think this is because my index is called "index.shtml"
> and includes SSI stuff. To get around this I created a file called
> top.html that has the same urls in it. But it still did not spider the
> local urls for me.
Yes, with local_urls_only set to true, htdig won't look at .shtml files.
> I run the following commands after updating the config file:
>
> htdig -i -s
> htfuzzy soundex accents synonyms
> htmerge
htfuzzy synonyms only needs to be run once, like htfuzzy endings, because
it uses a static dictionary which doesn't change when you reindex. However,
htfuzzy soundex and htfuzzy accents both depend on the db.words.db database
built by htmerge, so you really should run these after htmerge, not before.
This is a side-issue, though.
> Here is my config file (noticed how I put every top level html file in
> the start_url line):
...
> # The URL(s) where htdig will start. See also limit_urls_to above.
> start_url: http://www.rsforster.ottawa.on.ca/robin_main.html
> http://www.rsforster.ottawa.on.ca/stamps/stamps.html
> http://www.rsforster.ottawa.on.ca/tolkien/index.html
> http://www.rsforster.ottawa.on.ca/semiconductors.html
> http://www.rsforster.ottawa.on.ca/reef/index.html
I'm assuming that these are all on one line, and your mail program
folded the line. If not, you need a backslash at the end of each line
but the last one, to indicate continuation of the definition on the
following line.
> # This makes sure that we don't spider the web
> local_urls_only: true
>
> # These attributes allow indexing server via local filesystem rather
> than HTTP.
> local_urls: http://www.rsforster.ottawa.on.ca/=/home/httpd/html/
> local_user_urls:
> http://www.rsforster.ottawa.on.ca/=/home/,/public_html/
I'm not sure there's anything obviously wrong with your configuration, nor
the few pages on your site I looked at. You do seem to have proper HTML
links, so it's not a problem of the navigation depending on JavaScript.
I'd recommend following FAQ 5.25 - 5.27 (especially the last one) to see
if you can get some meaningful feedback from htdig about what's going
on when it attempts to spider.
I did notice you commented out the limit_urls_to definition. However, the
compiled-in one is still the same. You should probably explicitly set it
to the main URL of your site, especially since you can't simply point the
start_url at that URL. I suspect that right now, htdig is limiting itself
to the URLs listed in start_url because you didn't override the default
setting of limit_urls_to. (See http://www.htdig.org/FAQ.html#q5.24) This
is probably the first thing to try, and then the stuff above afterward.
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html