According to thorstuff:
> First of all I hope I am sending this to the "list" this time.

Goal!!!  :-)

> I am still trying to get htdig to index my documents but I am meeting 
> with no success.
> 
> I am fairly confident that at least the programs executed by rundig are 
> using htdig.conf because when I make changes to the start_url I see them 
> in the verbose messages that show up.
> 
> I tried the specific recommendation in FAQ 5.25 which explains how to 
> create a shell script that generates a file of urls. I did this, got a 
> list of all the html docs in the directory I wanted to index but when I 
> executed the script from the start_url directive it says all of the urls 
> list are not found and does not build a word list.
> 
> Sample output:
> ----------------------------------------------------------------------------------
> New server: web2.forefrontnet.com, 80
> New server: , 0
> Unknown host: 0/robots.txt
> 0:0:0:http://web2.forefrontnet.com/alhambra.html:  not found
> 1:1:0:http://web2.forefrontnet.com/comm/minaf2000-01-18.html:  not found
> 2:2:0:http://web2.forefrontnet.com/comm/minaf2000-01-31.html:  not found
> 3:3:0:http://web2.forefrontnet.com/comm/minaf2000-02-24.html:  not found
> 4:4:0:http://web2.forefrontnet.com/comm/minaf2000-03-27.html:  not found
> 5:5:0:http://web2.forefrontnet.com/comm/minaf2000-04-18.html:  not found
> 6:6:0:http://web2.forefrontnet.com/comm/minaf2000-04-24.html:  not found
> 7:7:0:http://web2.forefrontnet.com/comm/minaf2000-04-27.html:  not found
> 8:8:0:http://web2.forefrontnet.com/comm/minaf2000-05-23.html:  not found
> 9:9:0:http://web2.forefrontnet.com/comm/minaf2000-05-24.html:  not found

None of these seem like working URLs to me.  At least, I can't reach them
with my browser, so it's not too surprising that htdig can't either.
Given that you seem to be serving PHP pages on your site, this trick
of using find and sed to generate the URLs may not be the most workable
approach on your site.

What's more, there seems to be a server misconfiguration on your end,
because the 404 Not Found error page is returned with a Content-Type of
text/plain rather than text/html, so I end up looking at the raw HTML
code of the error message rather than having it rendered as HTML text.

> Sample of text that was acted on:
> ----------------------------------------------------------------------------------
> http://web2.forefrontnet.com//alhambra.html
> http://web2.forefrontnet.com//comm/minaf2000-01-18.html
...

You shouldn't have double slashes after the host name in the URL,
so you likely didn't transcribe the sed expression quite right, but
that's a side issue.  As you can see above, htdig strips out the extra
slash anyway.  Without knowing where the documents actually are on your
site, and what the valid URLs would be for them, I can't really make
any concrete recommendations to you beyond suggesting you figure out
what the URLs ought to be.

> I even removed the start_url directive entirely and it generated a large 
> wordlist but not from anything in my site past the initial directory 
> /var/www/html/ . Also the words it does find are not reachable from the 
> search html although results can be seen by using htsearch from the 
> command line.

I believe that if you remove start_url entirely from your config file,
it will default to http://www.htdig.org/, the compiled-in default value,
which won't do your site much good.

It's interesting that you don't get the same results when you run
htsearch from the command line as you do when running it from the
browser.  When I run
http://web2.forefrontnet.com/cgi-bin/htsearch?config=htdig&words=config
from here, I get the No match page.  This suggests to me that either I'm
running a different htsearch binary than you are from the command line,
or it's somehow using a different config file.

It would be useful to know if you have more than one htsearch binary
installed.  Try "locate htsearch" and "rpm -qf `locate htsearch`" to
see where the files are, and where they came from.

If I failed to mention it before, I should point out that when you install
the htdig316 RPM packages from htdig.org, you should first remove the
htdig and htdig-web packages (with rpm -e) in case they were installed
from the Red Hat distribution, as the two packages will clash with each
other if you try to have both on at the same time.  (I used a different
package name so that up2date wouldn't keep trying to get you to "upgrade"
from 3.1.6 to Red Hat's cruddy 3.2.0 beta snapshot packages.)

If it's the same binary being run in both cases, then I'm at a bit of a
loss as to why it seems to be using a different database in each case.
Maybe try a "locate htdig.conf db.wordlist" as well, to see if you have
multiple instances of these two files.  Finally, it's a bit of a long
shot, but make sure all the files in /var/lib/htdig are world-readable.
(Make sure you don't use a umask of 7 or 77 when you run rundig, or for
that matter when you create any files that are to be accessible from the
web server.)

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This SF.Net email sponsored by: Free pre-built ASP.NET sites including
Data Reports, E-commerce, Portals, and Forums are available now.
Download today and enter to win an XBOX or Visual Studio .NET.
http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

Reply via email to