According to Michael J. Fiorill:
> Hi guys,
> 
> I don't mean to be a pest, but I was wondering if any decision was made
> regarding eliminating and ?somedata tags from a URL when doing a
> file-system type scan.
> 
> Any info about this would be appreciated.
> 
> As always, thanks for your time!
> 
> Sincerely,
> Mike

Geoff and I seem to be in agreement that query strings should not be
stripped off URLs by default, whether indexing using local_urls or not.
URLs with query strings should always be passed to the HTTP server.

What you are asking for is really a separate feature - URL editing
(specifically strippig off query strings), which should be optional and
implemented independently of local file system handling.  The whole idea
behind local_urls handling is you should get the same data whether you go
through the HTTP server or the local file system, with the only difference
being the speed of access.  That would not be guaranteed in the general
case if htdig stripped off query strings just to get at the local file.
htdig must assume in general that query strings are significant.

URL rewriting is available right now as a patch for 3.1.5.  See
ftp://ftp.ccsf.org/htdig-patches/3.1.5/htdig-3.1.5.aarmstrong.tar.gz
It's also included in the 3.2.0b3 beta release.

What Geoff and I discussed in the end was not a solution to your problem,
because AFAIK your problem is already solved (my apologies if we didn't
point you to this patch before), but rather we were discussing how to
avoid mistakenly grabbing a local file that seems to match a query string
pattern, but still allowing hex-encoded question marks through as part of
a file name.

> On Mon, 19 Feb 2001, Gilles Detillieux wrote:
> 
> > Date: Mon, 19 Feb 2001 14:05:19 -0600 (CST)
> > From: Gilles Detillieux <[EMAIL PROTECTED]>
> > To: Geoff Hutchison <[EMAIL PROTECTED]>
> > Cc: Gilles Detillieux <[EMAIL PROTECTED]>,
> >      Michael J.Fiorill <[EMAIL PROTECTED]>,
> >      [EMAIL PROTECTED]
> > Subject: Re: [htdig] Re: HTdig Change
> >
> > According to Geoff Hutchison:
> > > On Fri, 16 Feb 2001, Gilles Detillieux wrote:
> > > > I disagree with this.  I think htdig is making a safe assumption in
> > > > treating the query strings as significant.
> > >
> > > My argument would be that if a file has a query, it should really be
> > > treated through the HTTP server (for server parsing, etc.) *or* perhaps as
> > > an option, the query should be ignored for local filesystem indexing. I
> > > guess my point can be summed up that when looking for a file on the
> > > system, it should not be a file "index.html?1723" but "index.html" if the
> > > query is to be ignored.
> >
> > OK, I guess I misunderstood what you were saying.  Mike Fiorill wanted
> > the query string stripped off when indexing by the local file system,
> > and you seemed to be agreeing with that.  My point was that anything
> > with a query string should fall back to HTTP, unless some other option
> > explicitly requests stripping of query strings.  My concern is that
> > local_urls handling should not affect what document htdig gets, only
> > the method by which it gets it, and in the majority of cases a query
> > string has an effect on what document gets fetched, so it should not be
> > simply ignored.  That should be handled by a different config attribute.
> >
> > You're right, though, that htdig shouldn't be looking for a file name
> > like "index.html?1723" on the local filesystem.  It does this now, and
> > it's only because the lookup normally fails that it falls back to HTTP.
> > I think this is the case for all versions of htdig since local_urls was
> > added in the early 3.1.x betas.
> >
> > > > to the handling of local_urls, though, because there are cases where
> > > > users wanted query string stripping even for HTTP-based digging.  I think
> > >
> > > Yes, removing query strings is a separate matter, but as I said above, the
> > > file code should never try to lookup for "index.html?1723." That's just
> > > not how the URL is to be parsed by the RFCs. If you have a legitimate
> > > question-mark in a filename, it has to be an encoded one.
> >
> > That's right.  But does it have to be SGML encoded or %xx hex encoded?
> > The way htdig works now, as of 3.1.4, is to decode any SGML encoding in
> > the entire URL before it breaks down the URL into its component parts.
> > So, if I'm not mistaken, when it pops an URL off the server queue,
> > it has no way of knowing if a "?" in the URL (or any other character
> > for that matter) had been SGML-encoded or not.  So, if an encoded question
> > mark can legitimately be part of a file name, maybe the code is working
> > correctly the way it is now.  The only problem is the small chance of a
> > "false positive" match if an unencoded question mark and query string
> > happen to match an existing file name on the local file system.  I think
> > the only way to make certain the code can destinguish an unencoded "?"
> > from an SGML-encoded one would be to dissect the URL before SGML decoding.
> >
> > On the other hand, hex encoding would be easier, as that's normally
> > left up to the HTTP server.  The local_urls handling would be able to
> > destinguish between an unencoded "?" and a "%3F" quite easily.  Up to
> > version 3.1.4, it didn't do any decoding of these, though, so they
> > would likely have failed and fallen back to HTTP.  As of version 3.1.5,
> > it decodes these for the whole URL, so if we were to add a test for a
> > query string, it should be before the hex-decoding.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to