Re: [htdig] Duplicate pages

Gilles Detillieux Wed, 20 Sep 2000 08:49:06 -0700
According to [EMAIL PROTECTED]:
> The site I am indexing is a bit peculiar.  The following 
> is an example of the setup, where each page is exactly 
> the same.
> 
> www.domain.com/subdirectory/
> www.domain.com/subdirectory/index.html
> www.domain.com/Subdirectory/
> www.domain.com/Subdirectory/index.html
> 
> I assumed that in the case where there is no index.html 
> that it was just loading the index.html.  Here's the 
> problem.  htdig recognizes this as 4 different pages, 
> and indexes all of them.  I can see where it would think 
> it is 2 different because of the s and S.  Is there any 
> way to prevent the duplicates?

The remove_default_doc attribute should take care of the superfluous
"index.html" entries, but I'm not so sure about the extra Subdirectory
names.  You can't use exclude_urls for this, because it does a case
insensitive match.

On my site, I make use of a few symbolic links for subdirectories, to
give an all-lowercase equivalent to some mixed case names, but I never
use these in URLs on my site, for this very reason.  I only use them to
support links from other sites, where other admins may be a tad sloppy
about getting the case right.  I realise this isn't a workable alternative
for you if you don't maintain control over the whole site you're indexing.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  <http://www.htdig.org/mail/menu.html>
FAQ:            <http://www.htdig.org/FAQ.html>
Re: [htdig] Duplicate pages

Reply via email to