On Thu, 22 Apr 2004, Gilles Detillieux wrote:

> Date: Thu, 22 Apr 2004 17:17:03 -0500 (CDT)
> From: Gilles Detillieux <[EMAIL PROTECTED]>
> To: "ht://Dig developers list" <[EMAIL PROTECTED]>
> Cc: Christopher Murtagh <[EMAIL PROTECTED]>
> Subject: Re: [htdig-dev] PATCH: Performance issue with exclude_urls
> 
> According to me:
> > According to Christopher Murtagh:
> > >  Then I re-compiled and ran with my normal excludes URL list. It didn't
> > > seem to have much of an impact on performance. This means that the
> > > performance hit is definitely in the HtRegexList::setEscaped method.
> > 
> > Thanks, Chris.  That's good to know!  Maybe then we don't need to use
> > Lachlan's patch to the config parser to track whether server or URL
> > blocks were defined.  I'll try a quick fix to Retriever.cc, but I'll
> > also try to find if there are other uses of HtRegexList that may need
> > attention.
> 
> Well, there are some uses of it in htsearch related code that I'm not
> all that sure about, but let's just cross that bridge when we get to it.
> In any case, the optimizations tend to need to be done right where the
> HtRegexList object is created anyway, rather than buried in the class
> definition, because you need to make sure the object sticks around or
> any optimization will have no effect.
> 
> So, here's my simple stab at fixing Retriever.cc, with no other files
> needing patches.  It should be used instead of Chris's and Lachlan's
> patches from earlier this week, not put on top of them.  I've built
> this patch over the current CVS code for Retriever.cc (as of Apr 7),
> but it does seem to apply to a vanilla 3.2.0b5 source as well.
> 
> Please test this out and make sure it doesn't cause any problems, and
> that it helps!  Apply using "patch -p0 << this-message-file".

I backed out both slightly-better.0 and exclude_perform.0 and applied
exclude_perform.1, this patch.  I ran htdig on the same site as before for
profile; htdig ran ~43% faster than the first time;)  Here is the profile:

 ftp://ftp.ccsf.org/htdig-patches/3.2.0b5/htdig.gmon.exclude_perform.1.gz

Regards,

Joe
-- 
     _/   _/_/_/       _/              ____________    __o
     _/   _/   _/      _/         ______________     _-\<,_
 _/  _/   _/_/_/   _/  _/                     ......(_)/ (_)
  _/_/ oe _/   _/.  _/_/ ah        [EMAIL PROTECTED]


> --- htdig/Retriever.cc.orig   2004-04-07 17:02:00.000000000 -0500
> +++ htdig/Retriever.cc        2004-04-22 16:45:52.000000000 -0500
> @@ -995,10 +995,21 @@ int Retriever::IsValidURL(const String &
>       // If the URL contains any of the patterns in the exclude list,
>       // mark it as invalid
>       //
> -     tmpList.Create(config->Find(&aUrl, "exclude_urls"), " \t");
> -     HtRegexList excludes;
> -     excludes.setEscaped(tmpList, config->Boolean("case_sensitive"));
> -     if (excludes.match(url, 0, 0) != 0)
> +     String exclude_urls = config->Find(&aUrl, "exclude_urls");
> +     static String *prevexcludes = 0;
> +     static HtRegexList *excludes = 0;
> +     if (!excludes || !prevexcludes || prevexcludes->compare(exclude_urls) != 0)
> +     {
> +             if (!excludes)
> +                     excludes = new HtRegexList;
> +             if (prevexcludes)
> +                     delete prevexcludes;
> +             prevexcludes = new String(exclude_urls);
> +             tmpList.Create(exclude_urls, " \t");
> +             excludes->setEscaped(tmpList, config->Boolean("case_sensitive"));
> +             tmpList.Destroy();
> +     }
> +     if (excludes->match(url, 0, 0) != 0)
>       {
>               if (debug > 2)
>                       cout << endl << "   Rejected: item in exclude list ";
> @@ -1009,12 +1020,22 @@ int Retriever::IsValidURL(const String &
>       // If the URL has a query string and it is in the bad query list
>       // mark it as invalid
>       //
> -     tmpList.Destroy();
> -     tmpList.Create(config->Find(&aUrl, "bad_querystr"), " \t");
> -     HtRegexList badquerystr;
> -     badquerystr.setEscaped(tmpList, config->Boolean("case_sensitive"));
> +     String bad_querystr = config->Find(&aUrl, "bad_querystr");
> +     static String *prevbadquerystr = 0;
> +     static HtRegexList *badquerystr = 0;
> +     if (!badquerystr || !prevbadquerystr || prevbadquerystr->compare(bad_querystr) 
> != 0)
> +     {
> +             if (!badquerystr)
> +                     badquerystr = new HtRegexList;
> +             if (prevbadquerystr)
> +                     delete prevbadquerystr;
> +             prevbadquerystr = new String(bad_querystr);
> +             tmpList.Create(bad_querystr, " \t");
> +             badquerystr->setEscaped(tmpList, config->Boolean("case_sensitive"));
> +             tmpList.Destroy();
> +     }
>       char *ext = strrchr((char *) url, '?');
> -     if (ext && badquerystr.match(ext, 0, 0) != 0)
> +     if (ext && badquerystr->match(ext, 0, 0) != 0)
>       {
>               if (debug > 2)
>                       cout << endl << "   Rejected: item in bad query list ";
> 
> 
> -- 
> Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
> Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
> Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)



-------------------------------------------------------
This SF.net email is sponsored by: The Robotic Monkeys at ThinkGeek
For a limited time only, get FREE Ground shipping on all orders of $35
or more. Hurry up and shop folks, this offer expires April 30th!
http://www.thinkgeek.com/freeshipping/?cpg=12297
_______________________________________________
ht://Dig Developer mailing list:
[EMAIL PROTECTED]
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to