Re: [htdig3-dev] Re: htdig-3.1.4 prerelease

Gilles Detillieux Tue, 7 Dec 1999 14:50:57 -0800
According to Joe R. Jah:
> On Tue, 7 Dec 1999, Geoff Hutchison wrote:
> > Date: Tue, 7 Dec 1999 14:39:30 -0600
> > From: Geoff Hutchison <[EMAIL PROTECTED]>
> > To: "Joe R. Jah" <[EMAIL PROTECTED]>
> > Cc: htdig3-dev <[EMAIL PROTECTED]>
> > Subject: Re: [htdig3-dev] Re: htdig-3.1.4 prerelease
> > 
> > At 12:05 PM -0800 12/7/99, Joe R. Jah wrote:
> > >everything worked except my the old local duplicate suppressor patch:
> > >ftp://sol.ccsf.cc.ca.us/htdig-patches/3.0.8b2/Retriever.cc.0
> > >did not quite do its job.
> > 
> > It would probably need some tinkering to work. We changed how local 
> > documents are indexed slightly, so it would need to be "ported."
> 
> All the tinkering I did was in Retriever::Need2Get(char *u)
> 
> I applied the old patch and changed:
> 
>       String *local_filename = IsLocal(u);
> to
>       String *local_filename = GetLocal(u);
> 
> and added 
> 
>       url.lowercase();
> 
> which was missing in 3.1.4.

No, you don't want url.lowercase(); in Need2Get() anymore!  It can
break things on case sensitive servers, where upper and lower case
equivalent names can actually be used for separate files.  In 3.1.4,
URLs are converted to lowercase when they're parsed (in URL.cc), but
only when case_sensitive is set to false.

> What other changes need to be made?

None that I can think of.  Please try the patch I just posted, on an
unpatched 3.1.4 prerelease Retriever.cc.

> > >As you see database sizes do not vary too much, but the results pages
> > >point to the same URL MULTIPLE times in 3.1.4 case; baffling;-/?
> > 
> > You mean something with exactly the same string? Can you give us an example?
> 
> Here are some examples of the results URLs:
> 
> http://www.ccsf.cc.ca.us/Resources/Title3/intech/f96_p4.shtml/f96_p6.shtml
> http://www.ccsf.cc.ca.us/Resources/Title3/intech/f96_p4.shtml/
> http://www.ccsf.cc.ca.us/Resources/Title3/intech/f96_p4.shtml
> http://www.ccsf.cc.ca.us/Resources/Title3/intech/f96_p4.shtml/f96_p4.shtml
> http://www.ccsf.cc.ca.us/Resources/Title3/intech/f96_p4.shtml/f96_p5.shtml
> http://www.ccsf.cc.ca.us/Resources/Title3/intech/f96_p4.shtml/int_fall96.shtml
> http://www.ccsf.cc.ca.us/Resources/Title3/intech/f96_p4.shtml/f96_p2.shtml
> http://www.ccsf.cc.ca.us/Resources/Title3/intech/f96_p4.shtml/f96_p3.shtml
> 
> As you can see all those point to f96_p4.shtml file, but many results have
> extra garbage appended to the file name.

As you probably know, .shtml files are not fetched from the local
file system, so this is unrelated to the patch we've been discussing.
The problem of extra path information on SSI documents was actually
discussed to great length last week - the problem is there's not a lot
we can do about it in htdig, other than adding .shtml/ to exclude_urls,
which you may want to do.

As a quick recap, the SSI problem occurs when you have an href to an
SSI document that has an extra slash (/) at the end of it.  This makes
the URL look to htdig (or ANY web client) as a directory URL, so any
relative hrefs in the document are interpreted as being under this
document, rather than under the directory which contains the document.
Try it yourself in a web browser to see what happens.

The solution, of course, is to hunt down these defective hrefs and strip
off the trailing slash.  It's also a good idea to use only absolute hrefs
within SSI documents, to avoid this problem when there is a faulty link.
The exclude_urls hack above is a good precaution for htdig, but it won't
solve the problem for other spiders or web clients.  It's not a generally
known fact that SSI document are much more like CGI programs than they
are like static HTML pages, and great care must be taken to avoid them
presenting infinite hierarchies to web clients.

I can't understand why you didn't run into this with htdig 3.1.3 - the
problem definitely was there then and in previous releases.  Did you
add .shtml/ to exclude_urls in the config for 3.1.3, but not 3.1.4?

> > >That reminds me; has the _promised_ duplicate suppression feature been
> > >placed in 3.2.x yet?
> > 
> > Alas no. Some of you may remember the post to the htdig list about 2 
> > months ago from someone saying they were working on a number of 
> > projects (including duplicate elimination). Alas, they seem to have 
> > disappeared again. Hence, 3.2.0b1 will go out the door without it.
> > 
> > However, that doesn't mean it's dead yet. ;-)
> 
> That statement doesn't warm my heart very much;)  I guess I'll have to
> live with the above tinkering for the foreseeable future.  Would you
> consider porting this little patch with future release, even though you
> wouldn't include it in the release.  I am sure there are other users who
> would appreciate to have at least local duplicate suppression.

It's not included in the releases, because it's considered too much
of a hack, I assume.  I think at one point, it was added to the 3.2
source tree, but taken out again.  I've ported the patch to 3.1.4,
and I suppose I can do likewise to 3.2.0b1 when it comes out, although
you should be able to do this yourself pretty easily.  Just apply the
code to Retriever.cc, which you'll likely have to do manually for 3.2
as Need2Get() has changed, then "diff -up Retriever.cc.orig Retriever.cc".

Are there any other old patches that you think were overlooked, which
should be considered for 3.2?

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this.
Re: [htdig3-dev] Re: htdig-3.1.4 prerelease

Reply via email to