According to Christopher Murtagh: > On Tue, 2004-04-20 at 17:45, Gilles Detillieux wrote: > > Hi, Chris and other developers. The problem with this fix is that > > exclude_urls and bad_querystr can no longer be used in server blocks or > > URL blocks, as they'll only be parsed once regardless of how they're used. > > That's OK if you don't use them in blocks, but for the distributed code, > > we need to find a more generalized solution. > > Right. Having just found the block documentation, I can indeed see this > as a useful feature, and probably something that I would use if the > performance hit wasn't so bad. > > One thing I could think of that could help performance quite > considerably is to have an array of type *HtRegexList that could contain > the parsed excludes list/badquery lists, etc. per block. Or perhaps a > struct that contains all parsed config attributes per block and have an > array of pointers to it. This way the config could be loaded and still > only need to be parsed once. The only downside I could see is that this > would mean htdig would have a slightly larger memory footprint, but I > don't really see that as a big problem. We're probably talking about a > couple k more, by today's standards, even a couple meg more wouldn't be > a big deal.
There's an idea worth considering. It's quite a bit more complicated than the quick fix I had in mind, but probably much simpler than a full-blown caching scheme. It would also help out the case where regex-based attributes are used in URL or server blocks, which my proposed fix would only marginally help. > > 3) We may also need to determine if the repeated calls to config->Find() > > at each URL are having an impact on performance as well. E.g. what is > > the performance cost of doing thousands of calls like this one? > > > > tmpList.Create(config->Find(&aUrl, "exclude_urls"), " \t"); > > Easy thing to test. I'll give it a try later this week if I can, > perhaps tomorrow, and report back. Great. I'll try to get my fix to Regex.cc in by the end of the week too, so it would be great if you could give it a whirl. It would probably mean having to back out your own patch, though, or it wouldn't really get tested. Neal, I'd still like your opinion on the matter of making these HtRegexList variables global, and whether that will be a problem for libhtdig. Looking at the code, I see that "limits" and "limitsn", set by limit_urls_to and limit_normalized, are already global. But these are defined in htdig.cc, rather than Retriever.cc. Does this matter? I imagine it just means making parallel changes to libhtdig_htdig.cc, but right now it doesn't even seem to be making use of URL blocks, as it doesn't pass aUrl to HtConfiguration::Find(). Is this an oversight, or am I missing something? -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- This SF.Net email is sponsored by: IBM Linux Tutorials Free Linux tutorial presented by Daniel Robbins, President and CEO of GenToo technologies. Learn everything from fundamentals to system administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click _______________________________________________ ht://Dig Developer mailing list: [EMAIL PROTECTED] List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-dev