On Thu, 22 Apr 2004, Gilles Detillieux wrote: > Date: Thu, 22 Apr 2004 17:17:03 -0500 (CDT) > From: Gilles Detillieux <[EMAIL PROTECTED]> > To: "ht://Dig developers list" <[EMAIL PROTECTED]> > Cc: Christopher Murtagh <[EMAIL PROTECTED]> > Subject: Re: [htdig-dev] PATCH: Performance issue with exclude_urls > > According to me: > > According to Christopher Murtagh: > > > Then I re-compiled and ran with my normal excludes URL list. It didn't > > > seem to have much of an impact on performance. This means that the > > > performance hit is definitely in the HtRegexList::setEscaped method. > > > > Thanks, Chris. That's good to know! Maybe then we don't need to use > > Lachlan's patch to the config parser to track whether server or URL > > blocks were defined. I'll try a quick fix to Retriever.cc, but I'll > > also try to find if there are other uses of HtRegexList that may need > > attention. > > Well, there are some uses of it in htsearch related code that I'm not > all that sure about, but let's just cross that bridge when we get to it. > In any case, the optimizations tend to need to be done right where the > HtRegexList object is created anyway, rather than buried in the class > definition, because you need to make sure the object sticks around or > any optimization will have no effect. > > So, here's my simple stab at fixing Retriever.cc, with no other files > needing patches. It should be used instead of Chris's and Lachlan's > patches from earlier this week, not put on top of them. I've built > this patch over the current CVS code for Retriever.cc (as of Apr 7), > but it does seem to apply to a vanilla 3.2.0b5 source as well. > > Please test this out and make sure it doesn't cause any problems, and > that it helps! Apply using "patch -p0 << this-message-file".
I backed out both slightly-better.0 and exclude_perform.0 and applied exclude_perform.1, this patch. I ran htdig on the same site as before for profile; htdig ran ~43% faster than the first time;) Here is the profile: ftp://ftp.ccsf.org/htdig-patches/3.2.0b5/htdig.gmon.exclude_perform.1.gz Regards, Joe -- _/ _/_/_/ _/ ____________ __o _/ _/ _/ _/ ______________ _-\<,_ _/ _/ _/_/_/ _/ _/ ......(_)/ (_) _/_/ oe _/ _/. _/_/ ah [EMAIL PROTECTED] > --- htdig/Retriever.cc.orig 2004-04-07 17:02:00.000000000 -0500 > +++ htdig/Retriever.cc 2004-04-22 16:45:52.000000000 -0500 > @@ -995,10 +995,21 @@ int Retriever::IsValidURL(const String & > // If the URL contains any of the patterns in the exclude list, > // mark it as invalid > // > - tmpList.Create(config->Find(&aUrl, "exclude_urls"), " \t"); > - HtRegexList excludes; > - excludes.setEscaped(tmpList, config->Boolean("case_sensitive")); > - if (excludes.match(url, 0, 0) != 0) > + String exclude_urls = config->Find(&aUrl, "exclude_urls"); > + static String *prevexcludes = 0; > + static HtRegexList *excludes = 0; > + if (!excludes || !prevexcludes || prevexcludes->compare(exclude_urls) != 0) > + { > + if (!excludes) > + excludes = new HtRegexList; > + if (prevexcludes) > + delete prevexcludes; > + prevexcludes = new String(exclude_urls); > + tmpList.Create(exclude_urls, " \t"); > + excludes->setEscaped(tmpList, config->Boolean("case_sensitive")); > + tmpList.Destroy(); > + } > + if (excludes->match(url, 0, 0) != 0) > { > if (debug > 2) > cout << endl << " Rejected: item in exclude list "; > @@ -1009,12 +1020,22 @@ int Retriever::IsValidURL(const String & > // If the URL has a query string and it is in the bad query list > // mark it as invalid > // > - tmpList.Destroy(); > - tmpList.Create(config->Find(&aUrl, "bad_querystr"), " \t"); > - HtRegexList badquerystr; > - badquerystr.setEscaped(tmpList, config->Boolean("case_sensitive")); > + String bad_querystr = config->Find(&aUrl, "bad_querystr"); > + static String *prevbadquerystr = 0; > + static HtRegexList *badquerystr = 0; > + if (!badquerystr || !prevbadquerystr || prevbadquerystr->compare(bad_querystr) > != 0) > + { > + if (!badquerystr) > + badquerystr = new HtRegexList; > + if (prevbadquerystr) > + delete prevbadquerystr; > + prevbadquerystr = new String(bad_querystr); > + tmpList.Create(bad_querystr, " \t"); > + badquerystr->setEscaped(tmpList, config->Boolean("case_sensitive")); > + tmpList.Destroy(); > + } > char *ext = strrchr((char *) url, '?'); > - if (ext && badquerystr.match(ext, 0, 0) != 0) > + if (ext && badquerystr->match(ext, 0, 0) != 0) > { > if (debug > 2) > cout << endl << " Rejected: item in bad query list "; > > > -- > Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> > Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ > Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- This SF.net email is sponsored by: The Robotic Monkeys at ThinkGeek For a limited time only, get FREE Ground shipping on all orders of $35 or more. Hurry up and shop folks, this offer expires April 30th! http://www.thinkgeek.com/freeshipping/?cpg=12297 _______________________________________________ ht://Dig Developer mailing list: [EMAIL PROTECTED] List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-dev