Hi, Chris and other developers.  The problem with this fix is that
exclude_urls and bad_querystr can no longer be used in server blocks or
URL blocks, as they'll only be parsed once regardless of how they're used.
That's OK if you don't use them in blocks, but for the distributed code,
we need to find a more generalized solution.  Also, there may be other
regex-based attributes that also need optimizing.

However, your fix and the discussion that led up to it did give me an
inspiration for a fairly simple fix to the Regex class to optimize all
uses of it in a general way.  Instead of just having a flag in the Regex
object that says if the pattern has been compiled, why not have it hold on
to a copy of the compiled string?  That way, whenever the set() method is
called again, it can check to see if it's being asked to compile the same
pattern over again.  If it is, it knows it doesn't need to call regcomp()
again.  Presumably, it's regcomp() that's the real time killer here,
and not all the attribute and string handling done at a higher level.
The repeated string comparisons should be cheap in comparison to all the
regcomp() calls on those strings.  I'll see if I can find some time to
work out a patch for this.

Complications:
1) The variables that use the HtRegex and HtRegexList classes will need to
be global, or otherwise made persistent, so that they can take advantage
of the optimization.  Is this going to be a problem with libhtdig, Neal?
What is the best way to approach this issue?

2) The savings in not calling regcomp(), which likely will be substantial,
may not be entirely enough to get the performance gain that the quick fix
below gets.  If so, we could look into higher-level fixes as well, e.g.
in HtRegexList::setEscaped() as well, to save us some of the string
handling that takes place there.  Chris, would you be willing to do some
comparative performance testing on various patches, if it comes to that?

3) We may also need to determine if the repeated calls to config->Find()
at each URL are having an impact on performance as well.  E.g. what is
the performance cost of doing thousands of calls like this one?

     tmpList.Create(config->Find(&aUrl, "exclude_urls"), " \t");

We might need to do some more profiling as well.

Thoughts?

According to Christopher Murtagh:
>  Ok, I've solved my problem, and can now have a list of working
> exclude_urls without the serious performance decrease. Here are the
> changes I made (sorry I'm not sending a proper diff file... need
> guidance on how to do that properly):
> 
> 
> htdig/htdig.h
> --------------------
> 
> added:
> 
> extern int exclude_checked;
> extern int badquerystr_checked;
> extern HtRegexList  excludes;
> extern HtRegexList  badquerystr;
> 
> 
> 
> htdig/htdig.cc
> ----------------------
> 
> added these as global variable definitions:
> 
> int exclude_checked = 0;
> int badquerystr_checked = 0;
> 
> HtRegexList     excludes;
> HtRegexList     badquerystr;
> 
> 
> htdig/Retriever.cc
> 
> added these conditionals and removed the previous tmplist creates and
> .setEscaped() calls:
> 
> if(!(exclude_checked)){
>     //only parse this once and store into global variable
>     tmpList.Destroy();
>     tmpList.Create(config->Find(&aUrl, "exclude_urls"), " \t");
>     excludes.setEscaped(tmpList, config->Boolean("case_sensitive"));
>     exclude_checked = 1;
> }
> 
> if(!(badquerystr_checked)){
>     //only parse this once and store into global variable
>     tmpList.Destroy();
>     tmpList.Create(config->Find(&aUrl, "bad_querystr"), " \t");
>     badquerystr.setEscaped(tmpList, config->Boolean("case_sensitive"));
>     badquerystr_checked = 1;
> }
> 
>  The difference in performance is night and day, and the excludes list
> is only parsed once per dig rather than at *every* URL found.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
ht://Dig Developer mailing list:
[EMAIL PROTECTED]
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to