According to me:
> According to Christopher Murtagh:
> >  Then I re-compiled and ran with my normal excludes URL list. It didn't
> > seem to have much of an impact on performance. This means that the
> > performance hit is definitely in the HtRegexList::setEscaped method.
> 
> Thanks, Chris.  That's good to know!  Maybe then we don't need to use
> Lachlan's patch to the config parser to track whether server or URL
> blocks were defined.  I'll try a quick fix to Retriever.cc, but I'll
> also try to find if there are other uses of HtRegexList that may need
> attention.

Well, there are some uses of it in htsearch related code that I'm not
all that sure about, but let's just cross that bridge when we get to it.
In any case, the optimizations tend to need to be done right where the
HtRegexList object is created anyway, rather than buried in the class
definition, because you need to make sure the object sticks around or
any optimization will have no effect.

So, here's my simple stab at fixing Retriever.cc, with no other files
needing patches.  It should be used instead of Chris's and Lachlan's
patches from earlier this week, not put on top of them.  I've built
this patch over the current CVS code for Retriever.cc (as of Apr 7),
but it does seem to apply to a vanilla 3.2.0b5 source as well.

Please test this out and make sure it doesn't cause any problems, and
that it helps!  Apply using "patch -p0 << this-message-file".

--- htdig/Retriever.cc.orig     2004-04-07 17:02:00.000000000 -0500
+++ htdig/Retriever.cc  2004-04-22 16:45:52.000000000 -0500
@@ -995,10 +995,21 @@ int Retriever::IsValidURL(const String &
        // If the URL contains any of the patterns in the exclude list,
        // mark it as invalid
        //
-       tmpList.Create(config->Find(&aUrl, "exclude_urls"), " \t");
-       HtRegexList excludes;
-       excludes.setEscaped(tmpList, config->Boolean("case_sensitive"));
-       if (excludes.match(url, 0, 0) != 0)
+       String exclude_urls = config->Find(&aUrl, "exclude_urls");
+       static String *prevexcludes = 0;
+       static HtRegexList *excludes = 0;
+       if (!excludes || !prevexcludes || prevexcludes->compare(exclude_urls) != 0)
+       {
+               if (!excludes)
+                       excludes = new HtRegexList;
+               if (prevexcludes)
+                       delete prevexcludes;
+               prevexcludes = new String(exclude_urls);
+               tmpList.Create(exclude_urls, " \t");
+               excludes->setEscaped(tmpList, config->Boolean("case_sensitive"));
+               tmpList.Destroy();
+       }
+       if (excludes->match(url, 0, 0) != 0)
        {
                if (debug > 2)
                        cout << endl << "   Rejected: item in exclude list ";
@@ -1009,12 +1020,22 @@ int Retriever::IsValidURL(const String &
        // If the URL has a query string and it is in the bad query list
        // mark it as invalid
        //
-       tmpList.Destroy();
-       tmpList.Create(config->Find(&aUrl, "bad_querystr"), " \t");
-       HtRegexList badquerystr;
-       badquerystr.setEscaped(tmpList, config->Boolean("case_sensitive"));
+       String bad_querystr = config->Find(&aUrl, "bad_querystr");
+       static String *prevbadquerystr = 0;
+       static HtRegexList *badquerystr = 0;
+       if (!badquerystr || !prevbadquerystr || prevbadquerystr->compare(bad_querystr) 
!= 0)
+       {
+               if (!badquerystr)
+                       badquerystr = new HtRegexList;
+               if (prevbadquerystr)
+                       delete prevbadquerystr;
+               prevbadquerystr = new String(bad_querystr);
+               tmpList.Create(bad_querystr, " \t");
+               badquerystr->setEscaped(tmpList, config->Boolean("case_sensitive"));
+               tmpList.Destroy();
+       }
        char *ext = strrchr((char *) url, '?');
-       if (ext && badquerystr.match(ext, 0, 0) != 0)
+       if (ext && badquerystr->match(ext, 0, 0) != 0)
        {
                if (debug > 2)
                        cout << endl << "   Rejected: item in bad query list ";


-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This SF.net email is sponsored by: The Robotic Monkeys at ThinkGeek
For a limited time only, get FREE Ground shipping on all orders of $35
or more. Hurry up and shop folks, this offer expires April 30th!
http://www.thinkgeek.com/freeshipping/?cpg=12297
_______________________________________________
ht://Dig Developer mailing list:
[EMAIL PROTECTED]
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to