On Thu, 27 Sep 2001, Gilles Detillieux wrote:

> Date: Thu, 27 Sep 2001 12:23:08 -0500 (CDT)
> From: Gilles Detillieux <[EMAIL PROTECTED]>
> To: Joe R. Jah <[EMAIL PROTECTED]>
> Cc: Gilles Detillieux <[EMAIL PROTECTED]>,
     Geoff Hutchison <[EMAIL PROTECTED]>,
     [EMAIL PROTECTED]
> Subject: Re: [htdig-dev] Re: URL Rewrite patch for 3.1.6 snapshots
> 
> Geoff Hutchison responded:
> > This tells me that on your system, the rx library is faster than the 
> > system library regex calls. On the other hand, many people cheered 
> > when we switched from the rx library to regex for the Endings fuzzy 
> > generation. (Remember the complaints that German endings took weeks 
> > to generate?)
> > 
> > This may explain some of the different performance reports with the 
> > 3.2 code as this uses regex calls heavily.
> > 
> > (Hmm. Maybe the configure test should try benchmarking...)
> 
> If I recall, though, when Joe did profiling on 3.2, there were millions
> of unexplained calls to regcomp(), when there really shouldn't have been
> more than one or a few per document.  We never got to the bottom of this.
> 
> In terms of execution speeds of regex code, I think most of the delays
> would be due to regcomp(), as regexec() is normally pretty quick.
> However, the url_rewrite_rules support in 3.1.6 should only result in at
> most a few calls to regcomp() at the very start, when the HtURLRewriter
> instance is first created.  It shouldn't be continually calling regcomp(),
> so I don't think it should add an appreciable delay to the digging
> process, assuming the code is working correctly.
> 
> I'd be very interested in seeing some profiling done with 3.1.6 on Joe's
> system.

I finally compiled and ran 3.2.0b4 by stopping profiling.  For some reason
profiling caused it to segfault.  Any way, it took 30 hours to index my
site, so I have placed it on the back burner for now.  I will try to
compile and run 3.1.6 with profiling, if it cooperates;)

> Also, looking at Joe's logs above, I notice a few important differences:
> 
> - the word counts and document counts are different.  Surprisingly,
> they're smaller for the slower dig, but it indicates you're not digging
> exactly the same site.

I am digging the same site.  Getting fewer documents from running the
latest snapshot hasn't escaped my attention; in fact that is the reason
why I have kept that old snapshot.  All subsequent snapshots result in
around 200 fewer document counts.  I haven't figured out if some
duplicates escaped the old snapshot -or- newer snapshots miss about 200
documents;-/

> - the snapshots are different.  Was there really a snapshot on Wed.,
> Aug 29, or is that supposed to be 082601?  In either case, there were
> code changes made to the HTML parser between the two snapshots, so
> we don't know if part of the problem is that the new parser is slower.
> It might be useful to compare the timings of the two parser versions.

Here is a more complete comparison; the first two digs just show that a
couple of hundred documents do not make much of a difference.  This is a
dynamic site; it changes continuously;)  The second and third digs show
the reduction in document count from 3.1.6-082901 to 3.1.6-092301
snapshot.  The third and fourth digs show a dramatic difference between
digging duration caused by nothing other than the two patches:

________3.1.6-082901 + ssl.4 + Armstrong patch_________
htdig:    Start digging:   Tue Sep 25 19:43:14 PDT 2001
htmerge:  Start merging:   Tue Sep 25 20:09:57 PDT 2001   27 minutes
htmerge:  Total word count: 110412
htmerge:  Total documents: 7279
htmerge:  Total doc db size (in K): 117405
htnotify: Start notifying: Tue Sep 25 20:12:25 PDT 2001
htfuzzy:  Start fuzzying:  Tue Sep 25 20:12:33 PDT 2001
rundig:   end rundig:      Tue Sep 25 20:13:35 PDT 2001
________3.1.6-082901 + ssl.4 + Armstrong patch_________
htdig:    Start digging:   Thu Sep 27 16:37:23 PDT 2001
htmerge:  Start merging:   Thu Sep 27 17:05:20 PDT 2001   28 minutes
htmerge:  Total word count: 111521
htmerge:  Total documents: 7429
htmerge:  Total doc db size (in K): 118940
htnotify: Start notifying: Thu Sep 27 17:07:46 PDT 2001
htfuzzy:  Start fuzzying:  Thu Sep 27 17:07:54 PDT 2001
rundig:   end rundig:      Thu Sep 27 17:08:58 PDT 2001
________3.1.6-092301 + ssl.4 + Armstrong patch_________
htdig:    Start digging:   Thu Sep 27 18:55:58 PDT 2001
htmerge:  Start merging:   Thu Sep 27 19:23:02 PDT 2001   27 minutes
htmerge:  Total word count: 109506
htmerge:  Total documents: 7156
htmerge:  Total doc db size (in K): 117748
htnotify: Start notifying: Thu Sep 27 19:25:17 PDT 2001
htfuzzy:  Start fuzzying:  Thu Sep 27 19:25:24 PDT 2001
rundig:   end rundig:      Thu Sep 27 19:26:32 PDT 2001
_________3.1.6-092301 + ssl.4 + Geoff's patch__________
htdig:    Start digging:   Wed Sep 26 15:44:56 PDT 2001
htmerge:  Start merging:   Wed Sep 26 17:10:49 PDT 2001   76 minutes
htmerge:  Total word count: 107762
htmerge:  Total documents: 7095
htmerge:  Total doc db size (in K): 115092
htnotify: Start notifying: Wed Sep 26 17:13:02 PDT 2001
htfuzzy:  Start fuzzying:  Wed Sep 26 17:13:10 PDT 2001
rundig:   end rundig:      Wed Sep 26 17:14:16 PDT 2001
_______________________________________________________

Regards,

Joe
-- 
     _/   _/_/_/       _/              ____________    __o
     _/   _/   _/      _/         ______________     _-\<,_
 _/  _/   _/_/_/   _/  _/                     ......(_)/ (_)
  _/_/ oe _/   _/.  _/_/ ah        [EMAIL PROTECTED]


_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to