According to Jim Cole: > The patch I provided is apparently not a complete fix. Gilles shot it > down, pointing out that it misses at least one case. He offered an > alternative patch, but asked for feedback before committing it. I don't > think the person that initially reported the problem ever responded > regarding either patch. This information never ended up being appended > to the bug report; my bad I guess since I opened it. Sorry. > > Gilles' response can be found at > http://sourceforge.net/mailarchive/message.php?msg_id=5470006. > > I don't know whether it makes more sense to undo the patch and reopen > the report or accept the partial fix and open a new one for the missing > case. To be honest I never got around to taking the time to understand > the part of the problem that my initial patch missed. ... > > Fix robot exclusions. Jim Cole's patch with bug report #765726.
I think properly explaining it requires a bit of a history lesson. Back in 3.1.x, htdig only checked against the disallow list during a Server::push(), and it only checked the url.path() portion, not the full URL. It did this using StringMatch::Compare(), rather than a regex, so it was an "anchored" match - i.e. the pattern had to match the path from the start. It worked, but the problem was that by the time htdig did the Server::push(), it had already created a db.docdb entry for the URL, which htmerge had to delete, which led to a lot of confusion because the message from htmerge was ambiguous as to the reason for deleting the record. Enter 3.2, and in Dec, 1999, Geoff made some pretty radical changes to robots.txt handling, including using regex (though I'm still not sure why*, as robots.txt entries don't use regular expressions), and checking for disallowed paths in Retriever::IsValidURL(). The latter change was to check the URLs earlier on, before a db.docdb entry is needlessly created, to avoid the problem above, and to give a meaningful error message saying why the URL is rejected. All that is great, but the change to regex introduced this bug: the match was now an unanchored match of the whole URL, rather than an anchored match of the path only. This gave rise to the problem of false negatives, rejecting a URL if a disallowed path matched any portion of the URL's path, or even the server name. The change also left the test in Server::push(), so that URLs are essentially screened twice. What's not clear to me is if this test is still needed, or whether the test in Retriever::IsValidURL() is now sufficient. Finally, the change to regex matching also may lead to another potential problem - characters in Disallow lines could be mistaken for regex meta characters, as they're not escaped. I don't know how likely this is, but it's a theoretical possibility. Jim, your patch fixes the problem with false negatives, by forcing an anchored search of the path (by inserting ^ before each Disallowed path). The inconsistency here is that the match is done on the path only on the test in Retriever::IsValidURL() - the test in Server::push() still tries the (now anchored) match on the whole URL. If there was indeed a reason to keep this old test in Server::push(), that test is now broken. This patch also doesn't do anything about escaping regex meta characters in Disallow patterns. I can see taking one of the following courses of action to remedy the problem: 1) Recognize that regex is the wrong tool for the job, and go back to using StringMatch for robots.txt handling as in 3.2. However, keep the test of the URL's path in Retriever::IsValidURL(), and make sure Server::push() tests the path only and not the full URL. 2) Stick to using regex, and Jim's patch, but fix Server::push() to test the path. Also, it would be a good idea to add the meta character escaping code from my patch. 3) Back out Jim's patch and apply mine, which addresses all the concerns I brought up, and is the simplest fix to the existing code (pre-Jim's patch). My fix adds a pattern to the regex to skip over the protocol and server name parts of the URL, so the match is effectively anchored to the path component of the URL, even though the pattern matching is done (consistently) on the whole URL. Whichever of the 3 you choose, it's going to need testing in any case. Fix 1 would be the ideal one, I think, but I didn't (and still don't) have time to do that, so I opted for the simple fix in the above mentioned e-mail. Given that fix 2 is now partially implemented, maybe the best course of action would be to follow through and address the 2 missing pieces. My fix (no. 3) has the disadvantage that future developers will be scratching their heads trying to figure out my regular expression for skipping to the path portion of the whole URL. As a separate design decision, it might be worth trying to determine for sure whether the test in Server::push() is actually needed or not. If it's redundant, it should either be tossed out for the sake of efficiency, or kept for the sake of "defensive programming" if it doesn't really add any inefficiency. If it's kept, though, it should be kept in a way that's consistent with the new test in Retriever::IsValidURL(). As it stands now, as of this Sunday's CVS commit, it's confusing and misleading to have a functional test of the path in Retriever::IsValidURL(), and a non-functional test of the whole URL in Server::push(). * Pure speculation as to the motive for moving to regex from StringMatch: maybe Geoff intended eventually to eliminate the StringMatch class altogether from the ht://Dig code base. I know it had caused us some grief in the early 3.1.x development, but I thought we had gotten past that. Maybe there were other issues or potential problems still to be faced. Geoff, care to comment? -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. SourceForge.net hosts over 70,000 Open Source Projects. See the people who have HELPED US provide better services: Click here: http://sourceforge.net/supporters.php _______________________________________________ ht://Dig Developer mailing list: [EMAIL PROTECTED] List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-dev