According to Geoff Hutchison:
> On Fri, 16 Feb 2001, Gilles Detillieux wrote:
> > I disagree with this. I think htdig is making a safe assumption in
> > treating the query strings as significant.
>
> My argument would be that if a file has a query, it should really be
> treated through the HTTP server (for server parsing, etc.) *or* perhaps as
> an option, the query should be ignored for local filesystem indexing. I
> guess my point can be summed up that when looking for a file on the
> system, it should not be a file "index.html?1723" but "index.html" if the
> query is to be ignored.
OK, I guess I misunderstood what you were saying. Mike Fiorill wanted
the query string stripped off when indexing by the local file system,
and you seemed to be agreeing with that. My point was that anything
with a query string should fall back to HTTP, unless some other option
explicitly requests stripping of query strings. My concern is that
local_urls handling should not affect what document htdig gets, only
the method by which it gets it, and in the majority of cases a query
string has an effect on what document gets fetched, so it should not be
simply ignored. That should be handled by a different config attribute.
You're right, though, that htdig shouldn't be looking for a file name
like "index.html?1723" on the local filesystem. It does this now, and
it's only because the lookup normally fails that it falls back to HTTP.
I think this is the case for all versions of htdig since local_urls was
added in the early 3.1.x betas.
> > to the handling of local_urls, though, because there are cases where
> > users wanted query string stripping even for HTTP-based digging. I think
>
> Yes, removing query strings is a separate matter, but as I said above, the
> file code should never try to lookup for "index.html?1723." That's just
> not how the URL is to be parsed by the RFCs. If you have a legitimate
> question-mark in a filename, it has to be an encoded one.
That's right. But does it have to be SGML encoded or %xx hex encoded?
The way htdig works now, as of 3.1.4, is to decode any SGML encoding in
the entire URL before it breaks down the URL into its component parts.
So, if I'm not mistaken, when it pops an URL off the server queue,
it has no way of knowing if a "?" in the URL (or any other character
for that matter) had been SGML-encoded or not. So, if an encoded question
mark can legitimately be part of a file name, maybe the code is working
correctly the way it is now. The only problem is the small chance of a
"false positive" match if an unencoded question mark and query string
happen to match an existing file name on the local file system. I think
the only way to make certain the code can destinguish an unencoded "?"
from an SGML-encoded one would be to dissect the URL before SGML decoding.
On the other hand, hex encoding would be easier, as that's normally
left up to the HTTP server. The local_urls handling would be able to
destinguish between an unencoded "?" and a "%3F" quite easily. Up to
version 3.1.4, it didn't do any decoding of these, though, so they
would likely have failed and fallen back to HTTP. As of version 3.1.5,
it decodes these for the whole URL, so if we were to add a test for a
query string, it should be before the hex-decoding.
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
Information: http://lists.sourceforge.net/lists/listinfo/htdig-general
FAQ: http://htdig.sourceforge.net/FAQ.html