Because of the size limitation of this mailing list the message was
returned.  I have placed the attachments on the patch site:

        ftp://ftp.ccsf.org/htdig-patches/Bench/

On Fri, 28 Sep 2001, Gilles Detillieux wrote:

> Date: Fri, 28 Sep 2001 11:27:03 -0500 (CDT)
> From: Gilles Detillieux <[EMAIL PROTECTED]>
> To: Joe R. Jah <[EMAIL PROTECTED]>
> Cc: [EMAIL PROTECTED]
> Subject: Re: [htdig-dev] Re: URL Rewrite patch for 3.1.6 snapshots
> 
> OK, let's get any 3.1.6 problems nailed down first, then when it's out we
> can hopefully figure out all the strangeness with 3.2.0b4.  Let me know
> what, if anything, comes of profiling in 3.1.5.

Attached are basic block profiles of htdig-3.1.6-092301 with Armstrong
patch and Geoff's version of the patch, bb.out-Andy.gz and bb.out-Geoff.gz
respectively.  Those are huge files and have very little difference in
most blocks except in regex.d where Geoff's version numbers break the
scale;)  To save your time I have attached regex.d blocks also:
bb.out.regex-Andy.gz and bb.out.regex-Geoff.gz.

> OK, I've done some more testing on my site, and didn't notice any problems
> on my site, with over 300 HTML documents.  The biggest difference I've
> noticed is the changes to the HTML parser made it about 15% slower, but
> it's much more robust so I think it's worth it.

Would you please elaborate on what you mean with "more robust."  What are
the specific problems with Armstrong patch that warrants a performance hit
of, 15% in a limited sanitized environment on your system, and 400% in a
realistic environment on my system.  I do not think testing in a sanitized
environment with a few hundred HTML documents is adequate to arrive at a
realistic conclusion.

> I can think of two changes that might account for getting less documents
> indexed in the post-Aug 29 snapshots:
> 
>       * htlib/URL.cc (URL): Fixed to call normalizePath() even if URL
>       is relative but with absolute path. Should fix bug #408586.
> 
> and
> 
>       * htdig/HTML.h, htdig/HTML.cc (HTML, parse, do_tag): Fixed buggy
>       handling of nested tags that independently turn off indexing, so
>       </script> doesn't cancel <meta name=robots ...> tag. Add handling
>       of <noindex follow> tag.
> 
> both on Aug. 31.  The URL class change will get rid of some double slashes
> that were previously missed, which can reduce the number of duplicates.
> The HTML class change may prevent the parser from following links in
> documents that have meta robots tags, i.e. that it wasn't supposed
> to follow.
> 
> If you get a chance to run old and new snapshots of htdig with -vvv and
> compare the outputs, you may be able to track down the source of the
> different URLs that are parsed in both cases.  To do this in a meaningful
> way, though, you'll need to try a static site, or perhaps a snapshot of
> your site, so you don't get thrown off in your comparisons by updates
> to the site between digs.

Yes, I have kept that snapshot for a happy occasion like that;)

> If you don't have meta robots tags on your site, though, it's almost
> certainly going to be the URL class that accounts for the differences.
> A quick test would be to run htdig -t with an old snapshot, then grep for
> "http://.*//"; in db.docs.

That grep would give me a great deal of hits, where multiple URL's are on
the same line; "[^:]//" gives more accurate results:

        grep -c "http://.*//";   db.docs  =  537
        grep -c "[^:]//"        db.docs  =   88

There still are around 100 documents unaccounted for;-/

> I just answered my own question here.  The 082601 snapshot was the truncated
> one, so I take it that 082901 was a manual snapshot.

That was the very first 3.1.6 snapshot, right after you left for vacation.  
I believe it was a manual snapshot.

> Yes, that last comparison is the one I wanted to see.  An almost 3-fold
> increase in indexing time is dramatic.  A comparison of profiling output
> for these two builds would really be informative.

Right you are;)

Regards,

Joe
-- 
     _/   _/_/_/       _/              ____________    __o
     _/   _/   _/      _/         ______________     _-\<,_
 _/  _/   _/_/_/   _/  _/                     ......(_)/ (_)
  _/_/ oe _/   _/.  _/_/ ah        [EMAIL PROTECTED]


_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to