Because of the size limitation of this mailing list the message was returned. I have placed the attachments on the patch site:
ftp://ftp.ccsf.org/htdig-patches/Bench/ On Fri, 28 Sep 2001, Gilles Detillieux wrote: > Date: Fri, 28 Sep 2001 11:27:03 -0500 (CDT) > From: Gilles Detillieux <[EMAIL PROTECTED]> > To: Joe R. Jah <[EMAIL PROTECTED]> > Cc: [EMAIL PROTECTED] > Subject: Re: [htdig-dev] Re: URL Rewrite patch for 3.1.6 snapshots > > OK, let's get any 3.1.6 problems nailed down first, then when it's out we > can hopefully figure out all the strangeness with 3.2.0b4. Let me know > what, if anything, comes of profiling in 3.1.5. Attached are basic block profiles of htdig-3.1.6-092301 with Armstrong patch and Geoff's version of the patch, bb.out-Andy.gz and bb.out-Geoff.gz respectively. Those are huge files and have very little difference in most blocks except in regex.d where Geoff's version numbers break the scale;) To save your time I have attached regex.d blocks also: bb.out.regex-Andy.gz and bb.out.regex-Geoff.gz. > OK, I've done some more testing on my site, and didn't notice any problems > on my site, with over 300 HTML documents. The biggest difference I've > noticed is the changes to the HTML parser made it about 15% slower, but > it's much more robust so I think it's worth it. Would you please elaborate on what you mean with "more robust." What are the specific problems with Armstrong patch that warrants a performance hit of, 15% in a limited sanitized environment on your system, and 400% in a realistic environment on my system. I do not think testing in a sanitized environment with a few hundred HTML documents is adequate to arrive at a realistic conclusion. > I can think of two changes that might account for getting less documents > indexed in the post-Aug 29 snapshots: > > * htlib/URL.cc (URL): Fixed to call normalizePath() even if URL > is relative but with absolute path. Should fix bug #408586. > > and > > * htdig/HTML.h, htdig/HTML.cc (HTML, parse, do_tag): Fixed buggy > handling of nested tags that independently turn off indexing, so > </script> doesn't cancel <meta name=robots ...> tag. Add handling > of <noindex follow> tag. > > both on Aug. 31. The URL class change will get rid of some double slashes > that were previously missed, which can reduce the number of duplicates. > The HTML class change may prevent the parser from following links in > documents that have meta robots tags, i.e. that it wasn't supposed > to follow. > > If you get a chance to run old and new snapshots of htdig with -vvv and > compare the outputs, you may be able to track down the source of the > different URLs that are parsed in both cases. To do this in a meaningful > way, though, you'll need to try a static site, or perhaps a snapshot of > your site, so you don't get thrown off in your comparisons by updates > to the site between digs. Yes, I have kept that snapshot for a happy occasion like that;) > If you don't have meta robots tags on your site, though, it's almost > certainly going to be the URL class that accounts for the differences. > A quick test would be to run htdig -t with an old snapshot, then grep for > "http://.*//" in db.docs. That grep would give me a great deal of hits, where multiple URL's are on the same line; "[^:]//" gives more accurate results: grep -c "http://.*//" db.docs = 537 grep -c "[^:]//" db.docs = 88 There still are around 100 documents unaccounted for;-/ > I just answered my own question here. The 082601 snapshot was the truncated > one, so I take it that 082901 was a manual snapshot. That was the very first 3.1.6 snapshot, right after you left for vacation. I believe it was a manual snapshot. > Yes, that last comparison is the one I wanted to see. An almost 3-fold > increase in indexing time is dramatic. A comparison of profiling output > for these two builds would really be informative. Right you are;) Regards, Joe -- _/ _/_/_/ _/ ____________ __o _/ _/ _/ _/ ______________ _-\<,_ _/ _/ _/_/_/ _/ _/ ......(_)/ (_) _/_/ oe _/ _/. _/_/ ah [EMAIL PROTECTED] _______________________________________________ htdig-dev mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/htdig-dev