On Thu, 27 Sep 2001, Gilles Detillieux wrote:
> Date: Thu, 27 Sep 2001 12:23:08 -0500 (CDT)
> From: Gilles Detillieux <[EMAIL PROTECTED]>
> To: Joe R. Jah <[EMAIL PROTECTED]>
> Cc: Gilles Detillieux <[EMAIL PROTECTED]>,
Geoff Hutchison <[EMAIL PROTECTED]>,
[EMAIL PROTECTED]
> Subject: Re: [htdig-dev] Re: URL Rewrite patch for 3.1.6 snapshots
>
> Geoff Hutchison responded:
> > This tells me that on your system, the rx library is faster than the
> > system library regex calls. On the other hand, many people cheered
> > when we switched from the rx library to regex for the Endings fuzzy
> > generation. (Remember the complaints that German endings took weeks
> > to generate?)
> >
> > This may explain some of the different performance reports with the
> > 3.2 code as this uses regex calls heavily.
> >
> > (Hmm. Maybe the configure test should try benchmarking...)
>
> If I recall, though, when Joe did profiling on 3.2, there were millions
> of unexplained calls to regcomp(), when there really shouldn't have been
> more than one or a few per document. We never got to the bottom of this.
>
> In terms of execution speeds of regex code, I think most of the delays
> would be due to regcomp(), as regexec() is normally pretty quick.
> However, the url_rewrite_rules support in 3.1.6 should only result in at
> most a few calls to regcomp() at the very start, when the HtURLRewriter
> instance is first created. It shouldn't be continually calling regcomp(),
> so I don't think it should add an appreciable delay to the digging
> process, assuming the code is working correctly.
>
> I'd be very interested in seeing some profiling done with 3.1.6 on Joe's
> system.
I finally compiled and ran 3.2.0b4 by stopping profiling. For some reason
profiling caused it to segfault. Any way, it took 30 hours to index my
site, so I have placed it on the back burner for now. I will try to
compile and run 3.1.6 with profiling, if it cooperates;)
> Also, looking at Joe's logs above, I notice a few important differences:
>
> - the word counts and document counts are different. Surprisingly,
> they're smaller for the slower dig, but it indicates you're not digging
> exactly the same site.
I am digging the same site. Getting fewer documents from running the
latest snapshot hasn't escaped my attention; in fact that is the reason
why I have kept that old snapshot. All subsequent snapshots result in
around 200 fewer document counts. I haven't figured out if some
duplicates escaped the old snapshot -or- newer snapshots miss about 200
documents;-/
> - the snapshots are different. Was there really a snapshot on Wed.,
> Aug 29, or is that supposed to be 082601? In either case, there were
> code changes made to the HTML parser between the two snapshots, so
> we don't know if part of the problem is that the new parser is slower.
> It might be useful to compare the timings of the two parser versions.
Here is a more complete comparison; the first two digs just show that a
couple of hundred documents do not make much of a difference. This is a
dynamic site; it changes continuously;) The second and third digs show
the reduction in document count from 3.1.6-082901 to 3.1.6-092301
snapshot. The third and fourth digs show a dramatic difference between
digging duration caused by nothing other than the two patches:
________3.1.6-082901 + ssl.4 + Armstrong patch_________
htdig: Start digging: Tue Sep 25 19:43:14 PDT 2001
htmerge: Start merging: Tue Sep 25 20:09:57 PDT 2001 27 minutes
htmerge: Total word count: 110412
htmerge: Total documents: 7279
htmerge: Total doc db size (in K): 117405
htnotify: Start notifying: Tue Sep 25 20:12:25 PDT 2001
htfuzzy: Start fuzzying: Tue Sep 25 20:12:33 PDT 2001
rundig: end rundig: Tue Sep 25 20:13:35 PDT 2001
________3.1.6-082901 + ssl.4 + Armstrong patch_________
htdig: Start digging: Thu Sep 27 16:37:23 PDT 2001
htmerge: Start merging: Thu Sep 27 17:05:20 PDT 2001 28 minutes
htmerge: Total word count: 111521
htmerge: Total documents: 7429
htmerge: Total doc db size (in K): 118940
htnotify: Start notifying: Thu Sep 27 17:07:46 PDT 2001
htfuzzy: Start fuzzying: Thu Sep 27 17:07:54 PDT 2001
rundig: end rundig: Thu Sep 27 17:08:58 PDT 2001
________3.1.6-092301 + ssl.4 + Armstrong patch_________
htdig: Start digging: Thu Sep 27 18:55:58 PDT 2001
htmerge: Start merging: Thu Sep 27 19:23:02 PDT 2001 27 minutes
htmerge: Total word count: 109506
htmerge: Total documents: 7156
htmerge: Total doc db size (in K): 117748
htnotify: Start notifying: Thu Sep 27 19:25:17 PDT 2001
htfuzzy: Start fuzzying: Thu Sep 27 19:25:24 PDT 2001
rundig: end rundig: Thu Sep 27 19:26:32 PDT 2001
_________3.1.6-092301 + ssl.4 + Geoff's patch__________
htdig: Start digging: Wed Sep 26 15:44:56 PDT 2001
htmerge: Start merging: Wed Sep 26 17:10:49 PDT 2001 76 minutes
htmerge: Total word count: 107762
htmerge: Total documents: 7095
htmerge: Total doc db size (in K): 115092
htnotify: Start notifying: Wed Sep 26 17:13:02 PDT 2001
htfuzzy: Start fuzzying: Wed Sep 26 17:13:10 PDT 2001
rundig: end rundig: Wed Sep 26 17:14:16 PDT 2001
_______________________________________________________
Regards,
Joe
--
_/ _/_/_/ _/ ____________ __o
_/ _/ _/ _/ ______________ _-\<,_
_/ _/ _/_/_/ _/ _/ ......(_)/ (_)
_/_/ oe _/ _/. _/_/ ah [EMAIL PROTECTED]
_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev