According to thorstuff: > First of all I hope I am sending this to the "list" this time.
Goal!!! :-) > I am still trying to get htdig to index my documents but I am meeting > with no success. > > I am fairly confident that at least the programs executed by rundig are > using htdig.conf because when I make changes to the start_url I see them > in the verbose messages that show up. > > I tried the specific recommendation in FAQ 5.25 which explains how to > create a shell script that generates a file of urls. I did this, got a > list of all the html docs in the directory I wanted to index but when I > executed the script from the start_url directive it says all of the urls > list are not found and does not build a word list. > > Sample output: > ---------------------------------------------------------------------------------- > New server: web2.forefrontnet.com, 80 > New server: , 0 > Unknown host: 0/robots.txt > 0:0:0:http://web2.forefrontnet.com/alhambra.html: not found > 1:1:0:http://web2.forefrontnet.com/comm/minaf2000-01-18.html: not found > 2:2:0:http://web2.forefrontnet.com/comm/minaf2000-01-31.html: not found > 3:3:0:http://web2.forefrontnet.com/comm/minaf2000-02-24.html: not found > 4:4:0:http://web2.forefrontnet.com/comm/minaf2000-03-27.html: not found > 5:5:0:http://web2.forefrontnet.com/comm/minaf2000-04-18.html: not found > 6:6:0:http://web2.forefrontnet.com/comm/minaf2000-04-24.html: not found > 7:7:0:http://web2.forefrontnet.com/comm/minaf2000-04-27.html: not found > 8:8:0:http://web2.forefrontnet.com/comm/minaf2000-05-23.html: not found > 9:9:0:http://web2.forefrontnet.com/comm/minaf2000-05-24.html: not found None of these seem like working URLs to me. At least, I can't reach them with my browser, so it's not too surprising that htdig can't either. Given that you seem to be serving PHP pages on your site, this trick of using find and sed to generate the URLs may not be the most workable approach on your site. What's more, there seems to be a server misconfiguration on your end, because the 404 Not Found error page is returned with a Content-Type of text/plain rather than text/html, so I end up looking at the raw HTML code of the error message rather than having it rendered as HTML text. > Sample of text that was acted on: > ---------------------------------------------------------------------------------- > http://web2.forefrontnet.com//alhambra.html > http://web2.forefrontnet.com//comm/minaf2000-01-18.html ... You shouldn't have double slashes after the host name in the URL, so you likely didn't transcribe the sed expression quite right, but that's a side issue. As you can see above, htdig strips out the extra slash anyway. Without knowing where the documents actually are on your site, and what the valid URLs would be for them, I can't really make any concrete recommendations to you beyond suggesting you figure out what the URLs ought to be. > I even removed the start_url directive entirely and it generated a large > wordlist but not from anything in my site past the initial directory > /var/www/html/ . Also the words it does find are not reachable from the > search html although results can be seen by using htsearch from the > command line. I believe that if you remove start_url entirely from your config file, it will default to http://www.htdig.org/, the compiled-in default value, which won't do your site much good. It's interesting that you don't get the same results when you run htsearch from the command line as you do when running it from the browser. When I run http://web2.forefrontnet.com/cgi-bin/htsearch?config=htdig&words=config from here, I get the No match page. This suggests to me that either I'm running a different htsearch binary than you are from the command line, or it's somehow using a different config file. It would be useful to know if you have more than one htsearch binary installed. Try "locate htsearch" and "rpm -qf `locate htsearch`" to see where the files are, and where they came from. If I failed to mention it before, I should point out that when you install the htdig316 RPM packages from htdig.org, you should first remove the htdig and htdig-web packages (with rpm -e) in case they were installed from the Red Hat distribution, as the two packages will clash with each other if you try to have both on at the same time. (I used a different package name so that up2date wouldn't keep trying to get you to "upgrade" from 3.1.6 to Red Hat's cruddy 3.2.0 beta snapshot packages.) If it's the same binary being run in both cases, then I'm at a bit of a loss as to why it seems to be using a different database in each case. Maybe try a "locate htdig.conf db.wordlist" as well, to see if you have multiple instances of these two files. Finally, it's a bit of a long shot, but make sure all the files in /var/lib/htdig are world-readable. (Make sure you don't use a umask of 7 or 77 when you run rundig, or for that matter when you create any files that are to be accessible from the web server.) -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- This SF.Net email sponsored by: Free pre-built ASP.NET sites including Data Reports, E-commerce, Portals, and Forums are available now. Download today and enter to win an XBOX or Visual Studio .NET. http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01 _______________________________________________ ht://Dig general mailing list: <[EMAIL PROTECTED]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general

