Earlier I wrote:
> According to Christopher Murtagh:
> >  Easy thing to test. I'll give it a try later this week if I can,
> > perhaps tomorrow, and report back.
> 
> Great.  I'll try to get my fix to Regex.cc in by the end of the week too,
> so it would be great if you could give it a whirl.  It would probably
> mean having to back out your own patch, though, or it wouldn't really
> get tested.

OK, the simple fix I had in mind was indeed very simple to implement, but
unfortunately ineffective.  The problem is the way HtRegexList uses (or
abuses) the HtRegex class.  It repeatedly calls HtRegex::set() with an
increasingly complex pattern, until it fails, to see how big a pattern
it can build before breaking it up.  Of course, this completely defeats
any attempt to fix the problem at the level of HtRegex.

Here's the result of a trace print I added to HtRegex::set() which shows
how htdig deals with these two simple patterns:

exclude_urls:   /cgi-bin/ .cgi
bad_querystr:   C=D C=M C=N C=S O=A O=D

For every URL htdig encountered within every document it parsed, it spat out
the following:

compiling pattern: /cgi-bin/
compiling pattern: /cgi-bin/|\.cgi
compiling pattern: C=D
compiling pattern: C=D|C=M
compiling pattern: C=D|C=M|C=N
compiling pattern: C=D|C=M|C=N|C=S
compiling pattern: C=D|C=M|C=N|C=S|O=A
compiling pattern: C=D|C=M|C=N|C=S|O=A|O=D

You can see how this would rapidly degenerate with lots of documents
containing lots of links, and with more complex patterns in those two
attributes!  No wonder this was such a big problem.

I'll try again tomorrow with a similar fix in HtRegexList::setEscaped()
to see how much that helps matters.  It's a bit more complicated there
because we're dealing with lists instead of strings, but it shouldn't
be too nasty.

On the other hand, if these two attributes are the only ones that pose
a problem, maybe we should just deal with them in Retriever.cc and be
done with it!  Rather than using a flag, as Chris and Lachlan's patches
do, I was thinking of saving the string returned by

    config->Find(&aUrl, "exclude_urls")

and comparing it to the string we had the last time through.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
ht://Dig Developer mailing list:
[EMAIL PROTECTED]
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to