On Thu, 27 Sep 2001, Gilles Detillieux wrote: > Date: Thu, 27 Sep 2001 12:23:08 -0500 (CDT) > From: Gilles Detillieux <[EMAIL PROTECTED]> > To: Joe R. Jah <[EMAIL PROTECTED]> > Cc: Gilles Detillieux <[EMAIL PROTECTED]>, Geoff Hutchison <[EMAIL PROTECTED]>, [EMAIL PROTECTED] > Subject: Re: [htdig-dev] Re: URL Rewrite patch for 3.1.6 snapshots > > Geoff Hutchison responded: > > This tells me that on your system, the rx library is faster than the > > system library regex calls. On the other hand, many people cheered > > when we switched from the rx library to regex for the Endings fuzzy > > generation. (Remember the complaints that German endings took weeks > > to generate?) > > > > This may explain some of the different performance reports with the > > 3.2 code as this uses regex calls heavily. > > > > (Hmm. Maybe the configure test should try benchmarking...) > > If I recall, though, when Joe did profiling on 3.2, there were millions > of unexplained calls to regcomp(), when there really shouldn't have been > more than one or a few per document. We never got to the bottom of this. > > In terms of execution speeds of regex code, I think most of the delays > would be due to regcomp(), as regexec() is normally pretty quick. > However, the url_rewrite_rules support in 3.1.6 should only result in at > most a few calls to regcomp() at the very start, when the HtURLRewriter > instance is first created. It shouldn't be continually calling regcomp(), > so I don't think it should add an appreciable delay to the digging > process, assuming the code is working correctly. > > I'd be very interested in seeing some profiling done with 3.1.6 on Joe's > system.
I finally compiled and ran 3.2.0b4 by stopping profiling. For some reason profiling caused it to segfault. Any way, it took 30 hours to index my site, so I have placed it on the back burner for now. I will try to compile and run 3.1.6 with profiling, if it cooperates;) > Also, looking at Joe's logs above, I notice a few important differences: > > - the word counts and document counts are different. Surprisingly, > they're smaller for the slower dig, but it indicates you're not digging > exactly the same site. I am digging the same site. Getting fewer documents from running the latest snapshot hasn't escaped my attention; in fact that is the reason why I have kept that old snapshot. All subsequent snapshots result in around 200 fewer document counts. I haven't figured out if some duplicates escaped the old snapshot -or- newer snapshots miss about 200 documents;-/ > - the snapshots are different. Was there really a snapshot on Wed., > Aug 29, or is that supposed to be 082601? In either case, there were > code changes made to the HTML parser between the two snapshots, so > we don't know if part of the problem is that the new parser is slower. > It might be useful to compare the timings of the two parser versions. Here is a more complete comparison; the first two digs just show that a couple of hundred documents do not make much of a difference. This is a dynamic site; it changes continuously;) The second and third digs show the reduction in document count from 3.1.6-082901 to 3.1.6-092301 snapshot. The third and fourth digs show a dramatic difference between digging duration caused by nothing other than the two patches: ________3.1.6-082901 + ssl.4 + Armstrong patch_________ htdig: Start digging: Tue Sep 25 19:43:14 PDT 2001 htmerge: Start merging: Tue Sep 25 20:09:57 PDT 2001 27 minutes htmerge: Total word count: 110412 htmerge: Total documents: 7279 htmerge: Total doc db size (in K): 117405 htnotify: Start notifying: Tue Sep 25 20:12:25 PDT 2001 htfuzzy: Start fuzzying: Tue Sep 25 20:12:33 PDT 2001 rundig: end rundig: Tue Sep 25 20:13:35 PDT 2001 ________3.1.6-082901 + ssl.4 + Armstrong patch_________ htdig: Start digging: Thu Sep 27 16:37:23 PDT 2001 htmerge: Start merging: Thu Sep 27 17:05:20 PDT 2001 28 minutes htmerge: Total word count: 111521 htmerge: Total documents: 7429 htmerge: Total doc db size (in K): 118940 htnotify: Start notifying: Thu Sep 27 17:07:46 PDT 2001 htfuzzy: Start fuzzying: Thu Sep 27 17:07:54 PDT 2001 rundig: end rundig: Thu Sep 27 17:08:58 PDT 2001 ________3.1.6-092301 + ssl.4 + Armstrong patch_________ htdig: Start digging: Thu Sep 27 18:55:58 PDT 2001 htmerge: Start merging: Thu Sep 27 19:23:02 PDT 2001 27 minutes htmerge: Total word count: 109506 htmerge: Total documents: 7156 htmerge: Total doc db size (in K): 117748 htnotify: Start notifying: Thu Sep 27 19:25:17 PDT 2001 htfuzzy: Start fuzzying: Thu Sep 27 19:25:24 PDT 2001 rundig: end rundig: Thu Sep 27 19:26:32 PDT 2001 _________3.1.6-092301 + ssl.4 + Geoff's patch__________ htdig: Start digging: Wed Sep 26 15:44:56 PDT 2001 htmerge: Start merging: Wed Sep 26 17:10:49 PDT 2001 76 minutes htmerge: Total word count: 107762 htmerge: Total documents: 7095 htmerge: Total doc db size (in K): 115092 htnotify: Start notifying: Wed Sep 26 17:13:02 PDT 2001 htfuzzy: Start fuzzying: Wed Sep 26 17:13:10 PDT 2001 rundig: end rundig: Wed Sep 26 17:14:16 PDT 2001 _______________________________________________________ Regards, Joe -- _/ _/_/_/ _/ ____________ __o _/ _/ _/ _/ ______________ _-\<,_ _/ _/ _/_/_/ _/ _/ ......(_)/ (_) _/_/ oe _/ _/. _/_/ ah [EMAIL PROTECTED] _______________________________________________ htdig-dev mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/htdig-dev