Earlier I wrote: > According to Christopher Murtagh: > > Easy thing to test. I'll give it a try later this week if I can, > > perhaps tomorrow, and report back. > > Great. I'll try to get my fix to Regex.cc in by the end of the week too, > so it would be great if you could give it a whirl. It would probably > mean having to back out your own patch, though, or it wouldn't really > get tested.
OK, the simple fix I had in mind was indeed very simple to implement, but unfortunately ineffective. The problem is the way HtRegexList uses (or abuses) the HtRegex class. It repeatedly calls HtRegex::set() with an increasingly complex pattern, until it fails, to see how big a pattern it can build before breaking it up. Of course, this completely defeats any attempt to fix the problem at the level of HtRegex. Here's the result of a trace print I added to HtRegex::set() which shows how htdig deals with these two simple patterns: exclude_urls: /cgi-bin/ .cgi bad_querystr: C=D C=M C=N C=S O=A O=D For every URL htdig encountered within every document it parsed, it spat out the following: compiling pattern: /cgi-bin/ compiling pattern: /cgi-bin/|\.cgi compiling pattern: C=D compiling pattern: C=D|C=M compiling pattern: C=D|C=M|C=N compiling pattern: C=D|C=M|C=N|C=S compiling pattern: C=D|C=M|C=N|C=S|O=A compiling pattern: C=D|C=M|C=N|C=S|O=A|O=D You can see how this would rapidly degenerate with lots of documents containing lots of links, and with more complex patterns in those two attributes! No wonder this was such a big problem. I'll try again tomorrow with a similar fix in HtRegexList::setEscaped() to see how much that helps matters. It's a bit more complicated there because we're dealing with lists instead of strings, but it shouldn't be too nasty. On the other hand, if these two attributes are the only ones that pose a problem, maybe we should just deal with them in Retriever.cc and be done with it! Rather than using a flag, as Chris and Lachlan's patches do, I was thinking of saving the string returned by config->Find(&aUrl, "exclude_urls") and comparing it to the string we had the last time through. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- This SF.Net email is sponsored by: IBM Linux Tutorials Free Linux tutorial presented by Daniel Robbins, President and CEO of GenToo technologies. Learn everything from fundamentals to system administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click _______________________________________________ ht://Dig Developer mailing list: [EMAIL PROTECTED] List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-dev