I think I got it. Although, it could be 1 of 2 things... Looking at the rundig -vvv I noticed that it didn't find robots.txt in each of the sites. I copied in a robots.txt to each page.
I also moved the first url to the end of the list. I deleted the DB files and re-ran rundig. Now when I search I can find things in all sites. Thanks for all your help. Scott -----Original Message----- From: Johnson, S Sent: Tuesday, November 11, 2003 9:42 AM To: Jim Cole Cc: [EMAIL PROTECTED] Subject: RE: [htdig] Searching multiple sites The index is being performed on the same server as the sites are being hosted... The sites all together are about 6 gig in size. I was using vi for the editor... I went back into the config per your suggestion and verified that I have spaces between all the site names. Your suggestion on the limit urls may have merit... I have the main site: mysite.k12.mn.us, then I have DNS shortcuts to the schools: school1.mysite.k12.mn.us, school2.mysite.k12.mn.us, etc... The directory structure is within the mainsite's URL. I'm thinking that if I remove the main page/main site url from the search that it may work? Still doesn't explain the DB size for the main page... I'll work with these suggestions and post my results. Thanks! Scott -----Original Message----- From: Jim Cole [mailto:[EMAIL PROTECTED] Sent: Friday, November 07, 2003 12:20 AM To: Johnson, S Cc: [EMAIL PROTECTED] Subject: Re: [htdig] Searching multiple sites On Thursday, November 6, 2003, at 09:24 AM, Johnson, S wrote: > I performed the htdig which took a very long time (which I > expected).�After which I verified the DB files and they're around 15 > mb in size.�So I'm thinking that it did gather all the search info on > the sites. Any idea how big the sites are? Are you using a slow connection to index the sites? I wouldn't expect it to take a "very long time" to index sites that resulted in a 15 MB database unless you are working with very limited bandwidth or in some other way throttling htdig. > Now when I go to search I type in a term that should bring up hundred > of terms but only get 1 hit. �This hit happens to be on the first site > I search which is only one page in length. You might want to double check your configuration file and verify that there are no problems with the start_url attribute. If for example your editor automatically inserted some line breaks, it might be that htdig is only seeing the first URL. Did you modify your limit_urls_to attribute? Or are you still using the default? If incorrectly modified, this attribute could be excluding some of the pages that you are trying to index. If on the other hand you are using the default, what do the URLs look like in your start_url attribute? The default limit_urls_to assumes that you are not explicitly providing the name of the start page. For example, it assumes something like http://server.tld/path/ rather than http://server.tld/path/index.html. If you are using start URLs of the latter form with the default limit_urls_to, htdig will exclude everything except for the initial page. If the above items don't appear to be related to the problem you are encountering, you might want to take a look at http://www.htdig.org/FAQ.html#q5.25 and the other FAQs it references. These address a number of issues related to documents being missed during indexing. > I re-read the FAQ and it talks about using restrict to search multiple > sites.�I looked at the description for this metatag and it didn't > really explain what I needed to do to use this.�I then looked at the > search.html file that I got from htdig to test.�I noticed a restrict > line in there so I typed in the url I wanted to restrict the search on > (http://mysite.school.k12.mn.us) and reloaded the form in my > browser.�It now doesn't find anything... I would avoid messing with restrict until you resolve the more fundamental problem. It sounds like you are using it correctly, but at this point it is likely as not to just complicate the process of tracking down the missing pages. > Does anyone have any suggestions on what I can do to fix this? If none of the above helps, rerun the indexing process with -vvv and carefully examine the output. There are two things in particular to look for. First try to determine that htdig is actually seeing all of the URLs that you want indexed. Second look for messages stating that URLs are being rejected. Some rejected URLs are usually to be expected, but if any of the URLs that you are trying to grab show up as rejected, then the accompanying reason might point you to the source of the problem. > How do I search on everything in the database? Try using an asterisk (*) as the search term. Jim ------------------------------------------------------- This SF.Net email sponsored by: ApacheCon 2003, 16-19 November in Las Vegas. Learn firsthand the latest developments in Apache, PHP, Perl, XML, Java, MySQL, WebDAV, and more! http://www.apachecon.com/ _______________________________________________ ht://Dig general mailing list: <[EMAIL PROTECTED]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general ------------------------------------------------------- This SF.Net email sponsored by: ApacheCon 2003, 16-19 November in Las Vegas. Learn firsthand the latest developments in Apache, PHP, Perl, XML, Java, MySQL, WebDAV, and more! http://www.apachecon.com/ _______________________________________________ ht://Dig general mailing list: <[EMAIL PROTECTED]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general

