RE: [htdig] Searching multiple sites

Johnson, S Tue, 11 Nov 2003 09:57:15 -0800

I think I got it.  Although, it could be 1 of 2 things...

Looking at the rundig -vvv I noticed that it didn't find robots.txt in each of the 
sites.  I copied in a robots.txt to each page.

I also moved the first url to the end of the list.

I deleted the DB files and re-ran rundig.  Now when I search I can find things in all 
sites.

Thanks for all your help.

 Scott

-----Original Message-----
From: Johnson, S 
Sent: Tuesday, November 11, 2003 9:42 AM
To: Jim Cole
Cc: [EMAIL PROTECTED]
Subject: RE: [htdig] Searching multiple sites

The index is being performed on the same server as the sites are being hosted...  The 
sites all together are about 6 gig in size.

I was using vi for the editor...  I went back into the config per your suggestion and 
verified that I have spaces between all the site names.

Your suggestion on the limit urls may have merit...

I have the main site: mysite.k12.mn.us, then I have DNS shortcuts to the schools: 
school1.mysite.k12.mn.us, school2.mysite.k12.mn.us, etc...  The directory structure is 
within the mainsite's URL.  I'm thinking that if I remove the main page/main site url 
from the search that it may work?  Still doesn't explain the DB size for the main 
page...

I'll work with these suggestions and post my results.  

Thanks!
  Scott

-----Original Message-----
From: Jim Cole [mailto:[EMAIL PROTECTED] 
Sent: Friday, November 07, 2003 12:20 AM
To: Johnson, S
Cc: [EMAIL PROTECTED]
Subject: Re: [htdig] Searching multiple sites

On Thursday, November 6, 2003, at 09:24 AM, Johnson, S wrote:

> I performed the htdig which took a very long time (which I 
> expected).�After which I verified the DB files and they're around 15 
> mb in size.�So I'm thinking that it did gather all the search info on 
> the sites.

Any idea how big the sites are? Are you using a slow connection to 
index the sites? I wouldn't expect it to take a "very long time" to 
index sites that resulted in a 15 MB database unless you are working 
with very limited bandwidth or in some other way throttling htdig.

> Now when I go to search I type in a term that should bring up hundred 
> of terms but only get 1 hit. �This hit happens to be on the first site 
> I search which is only one page in length.

You might want to double check your configuration file and verify that 
there are no problems with the start_url attribute. If for example your 
editor automatically inserted some line breaks, it might be that htdig 
is only seeing the first URL.

Did you modify your limit_urls_to attribute? Or are you still using the 
default? If incorrectly modified, this attribute could be excluding 
some of the pages that you are trying to index. If on the other hand 
you are using the default, what do the URLs look like in your start_url 
attribute? The default limit_urls_to assumes that you are not 
explicitly providing the name of the start page. For example, it 
assumes something like http://server.tld/path/ rather than 
http://server.tld/path/index.html. If you are using start URLs of the 
latter form with the default limit_urls_to, htdig will exclude 
everything except for the initial page.

If the above items don't appear to be related to the problem you are 
encountering, you might want to take a look at 
http://www.htdig.org/FAQ.html#q5.25 and the other FAQs it references. 
These address a number of issues related to documents being missed 
during indexing.

> I re-read the FAQ and it talks about using restrict to search multiple 
> sites.�I looked at the description for this metatag and it didn't 
> really explain what I needed to do to use this.�I then looked at the 
> search.html file that I got from htdig to test.�I noticed a restrict 
> line in there so I typed in the url I wanted to restrict the search on 
> (http://mysite.school.k12.mn.us) and reloaded the form in my 
> browser.�It now doesn't find anything...

I would avoid messing with restrict until you resolve the more 
fundamental problem. It sounds like you are using it correctly, but at 
this point it is likely as not to just complicate the process of 
tracking down the missing pages.

> Does anyone have any suggestions on what I can do to fix this?

If none of the above helps, rerun the indexing process with -vvv and 
carefully examine the output. There are two things in particular to 
look for. First try to determine that htdig is actually seeing all of 
the URLs that you want indexed. Second look for messages stating that 
URLs are being rejected. Some rejected URLs are usually to be expected, 
but if any of the URLs that you are trying to grab show up as rejected, 
then the accompanying reason might point you to the source of the 
problem.

> How do I search on everything in the database?

Try using an asterisk (*) as the search term.

Jim

-------------------------------------------------------
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

-------------------------------------------------------
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

RE: [htdig] Searching multiple sites

Reply via email to