On Sat, Oct 26, 2002 at 10:06:12PM -0500, Geoff Hutchison wrote: > Instead, I'd suggest using the SourceForge bug tracker for ht://Dig > http://sourceforge.net/tracker/?atid=104593&group_id=4593&func=browse
OK, I've tried to avoid it ;-) If this doesn't get resolved soon I will submit it. > >I found it with wwwoffle cache indexing scripts. htdig 3.1.x worked > >well but after upgrading to 3.2.0b4-072201 it broke. The cached > >pages are under "/search/index" directory and "/index" is > >disallowed. You can see that 3.2.0b rejects "/search/index" in > >debug output: > > Yes. I can't see anything in particular that would have solved this > in the meantime (which surprises me since I seem to remember this > before). For my own benefit, could you confirm that it fails for you > on the current snapshot? Hm, I've got sucking slow and expensive dialup here (Czech Republic, monopolistic phone operator ... you know) so I would like to avoid downloading extra 2MB ... Back to topic - I've got ht://Dig 3.2.0b4-072201 source code here and I tried to fix it after some short time of looking at the code. See the attachment and review it cause I'm not too familiar with htdig code internals, this is just a quick-try-hack, but it seems to be working here but not heavily tested though... By the way, I think that using regular expressions here is a way too big hammer for this simple task (i.e. just for testing if one string is equal to or just an extension of another). Robots.txt is not defined to contain regular expressions but htdig handles disallow lines as if they are regexps. Are you sure that won't cause any problems if somebody puts some "weird" characters in it? Thanks for you reply and have a nice day -- Martin Mačok http://underground.cz/ [EMAIL PROTECTED] http://Xtrmntr.org/ORBman/
diff -urNp htdig-3.2.0b4-072201.orig/htdig/Retriever.cc htdig-3.2.0b4-072201/htdig/Retriever.cc --- htdig-3.2.0b4-072201.orig/htdig/Retriever.cc 2001-07-08 09:14:01.000000000 +0200 +++ htdig-3.2.0b4-072201/htdig/Retriever.cc 2002-10-27 14:38:08.000000000 +0100 @@ -961,7 +961,7 @@ Retriever::IsValidURL(const String &u) // URL testURL((char*)url); Server *server = (Server *) servers[testURL.signature()]; - if (server && server->IsDisallowed(url) != 0) + if (server && server->IsDisallowed(testURL.path()) != 0) { if (debug > 2) cout << endl << " Rejected: forbidden by server robots.txt!"; Binary files htdig-3.2.0b4-072201.orig/htdig/.Retriever.cc.swp and htdig-3.2.0b4-072201/htdig/.Retriever.cc.swp differ diff -urNp htdig-3.2.0b4-072201.orig/htdig/Server.cc htdig-3.2.0b4-072201/htdig/Server.cc --- htdig-3.2.0b4-072201.orig/htdig/Server.cc 2001-05-20 09:13:51.000000000 +0200 +++ htdig-3.2.0b4-072201/htdig/Server.cc 2002-10-27 13:57:32.000000000 +0100 @@ -187,7 +187,7 @@ void Server::robotstxt(Document &doc) String contents = doc.Contents(); int length; int pay_attention = 0; - String pattern; + String pattern = ""; String myname = config->Find("server", _host.get(), "robotstxt_name"); int seen_myname = 0; char *name, *rest; @@ -277,9 +277,9 @@ void Server::robotstxt(Document &doc) if (*rest) { if (pattern.length()) - pattern << '|' << rest; + pattern << "|^" << rest; else - pattern = rest; + pattern << '^' << rest; } } //