Re: [htdig] redundant results

Peter L. Peres Fri, 25 May 2001 14:06:34 -0700
On Fri, 25 May 2001, Geoff Hutchison wrote:

>On Fri, 25 May 2001 [EMAIL PROTECTED] wrote:
>
>> (www.example.com/dir/) or the default page name 
>> (www.example.com/dir/index.html).  Some search engines will treat 
>> this as two different pages, I don't remember about ht://Dig.  Other 
>
>There isn't a problem with this particular example--you can set a list of
>"default documents" that are stripped off the end of a URL like
>index.html.
>
>On the other hand, you're entirely correct that symlinks will appear as
>unique URLs. The only way to solve this is with "duplicate
>detection" which has been implemented in the 3.2 betas already.
>
>If, on the other hand, you have results with the *same* URL and you see it
>multiple times in the results, that's a bug and we'd like to get more
>information.

Geoff, remember I had the same problem and I fought with it in 3.1.5. One
way to have multiply indexed documents is to have soft links in the
directory tree (i.e. a document is pointed to several times but with
different paths). This is very common (sites are built that way). Maybe
3.2 solves this problem.

The other way is by having circular links in the HTML or directory
structure. The first can be detected using the cycl utility that I wrote
and the second kind are hard to find. They can be avoided even if they
exist using the prune_parent_dir patch for 3.1.5 which I wrote.

I hope that the 3.2 betas have fixed the problem.

Peter

PS: Of course this is my opinion, but I am using it like this and it
helps.


_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html
Re: [htdig] redundant results

Reply via email to