According to Jim Cole:
> The patch I provided is apparently not a complete fix. Gilles shot it 
> down, pointing out that it misses at least one case. He offered an 
> alternative patch, but asked for feedback before committing it. I don't 
> think the person that initially reported the problem ever responded 
> regarding either patch. This information never ended up being appended 
> to the bug report; my bad I guess since I opened it. Sorry.
> 
> Gilles' response can be found at 
> http://sourceforge.net/mailarchive/message.php?msg_id=5470006.
> 
> I don't know whether it makes more sense to undo the patch and reopen 
> the report or accept the partial fix and open a new one for the missing 
> case. To be honest I never got around to taking the time to understand 
> the part of the problem that my initial patch missed.
...
> > Fix robot exclusions.  Jim Cole's patch with bug report #765726.

I think properly explaining it requires a bit of a history lesson.
Back in 3.1.x, htdig only checked against the disallow list during a
Server::push(), and it only checked the url.path() portion, not the full
URL.  It did this using StringMatch::Compare(), rather than a regex, so
it was an "anchored" match - i.e. the pattern had to match the path from
the start.  It worked, but the problem was that by the time htdig did the
Server::push(), it had already created a db.docdb entry for the URL, which
htmerge had to delete, which led to a lot of confusion because the message
from htmerge was ambiguous as to the reason for deleting the record.

Enter 3.2, and in Dec, 1999, Geoff made some pretty radical changes to
robots.txt handling, including using regex (though I'm still not sure
why*, as robots.txt entries don't use regular expressions), and checking
for disallowed paths in Retriever::IsValidURL().  The latter change was to
check the URLs earlier on, before a db.docdb entry is needlessly created,
to avoid the problem above, and to give a meaningful error message
saying why the URL is rejected.  All that is great, but the change to
regex introduced this bug:  the match was now an unanchored match of the
whole URL, rather than an anchored match of the path only.  This gave
rise to the problem of false negatives, rejecting a URL if a disallowed
path matched any portion of the URL's path, or even the server name.

The change also left the test in Server::push(), so that URLs are
essentially screened twice.  What's not clear to me is if this test
is still needed, or whether the test in Retriever::IsValidURL() is
now sufficient.  Finally, the change to regex matching also may lead
to another potential problem - characters in Disallow lines could be
mistaken for regex meta characters, as they're not escaped.  I don't
know how likely this is, but it's a theoretical possibility.

Jim, your patch fixes the problem with false negatives, by forcing an
anchored search of the path (by inserting ^ before each Disallowed path).
The inconsistency here is that the match is done on the path only on
the test in Retriever::IsValidURL() - the test in Server::push() still
tries the (now anchored) match on the whole URL.  If there was indeed a
reason to keep this old test in Server::push(), that test is now broken.
This patch also doesn't do anything about escaping regex meta characters
in Disallow patterns.

I can see taking one of the following courses of action to remedy the
problem:

1) Recognize that regex is the wrong tool for the job, and go back to
using StringMatch for robots.txt handling as in 3.2.  However, keep
the test of the URL's path in Retriever::IsValidURL(), and make sure
Server::push() tests the path only and not the full URL.

2) Stick to using regex, and Jim's patch, but fix Server::push() to
test the path.  Also, it would be a good idea to add the meta character
escaping code from my patch.

3) Back out Jim's patch and apply mine, which addresses all the concerns
I brought up, and is the simplest fix to the existing code (pre-Jim's
patch).  My fix adds a pattern to the regex to skip over the protocol
and server name parts of the URL, so the match is effectively anchored to
the path component of the URL, even though the pattern matching is done
(consistently) on the whole URL.

Whichever of the 3 you choose, it's going to need testing in any case.
Fix 1 would be the ideal one, I think, but I didn't (and still don't)
have time to do that, so I opted for the simple fix in the above
mentioned e-mail.  Given that fix 2 is now partially implemented,
maybe the best course of action would be to follow through and address
the 2 missing pieces.  My fix (no. 3) has the disadvantage that future
developers will be scratching their heads trying to figure out my regular
expression for skipping to the path portion of the whole URL.

As a separate design decision, it might be worth trying to determine
for sure whether the test in Server::push() is actually needed or not.
If it's redundant, it should either be tossed out for the sake of
efficiency, or kept for the sake of "defensive programming" if it doesn't
really add any inefficiency.  If it's kept, though, it should be kept in a
way that's consistent with the new test in Retriever::IsValidURL().  As it
stands now, as of this Sunday's CVS commit, it's confusing and misleading
to have a functional test of the path in Retriever::IsValidURL(), and
a non-functional test of the whole URL in Server::push().

* Pure speculation as to the motive for moving to regex from StringMatch:
maybe Geoff intended eventually to eliminate the StringMatch class
altogether from the ht://Dig code base.  I know it had caused us some
grief in the early 3.1.x development, but I thought we had gotten
past that.  Maybe there were other issues or potential problems still
to be faced.  Geoff, care to comment?

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
SourceForge.net hosts over 70,000 Open Source Projects.
See the people who have HELPED US provide better services:
Click here: http://sourceforge.net/supporters.php
_______________________________________________
ht://Dig Developer mailing list:
[EMAIL PROTECTED]
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to