According to Joe R. Jah:
> On Thu, 27 Sep 2001, Gilles Detillieux wrote:
> > I'd be very interested in seeing some profiling done with 3.1.6 on Joe's
> > system.
>
> I finally compiled and ran 3.2.0b4 by stopping profiling. For some reason
> profiling caused it to segfault. Any way, it took 30 hours to index my
> site, so I have placed it on the back burner for now. I will try to
> compile and run 3.1.6 with profiling, if it cooperates;)
OK, let's get any 3.1.6 problems nailed down first, then when it's out we
can hopefully figure out all the strangeness with 3.2.0b4. Let me know
what, if anything, comes of profiling in 3.1.5.
> > Also, looking at Joe's logs above, I notice a few important differences:
> >
> > - the word counts and document counts are different. Surprisingly,
> > they're smaller for the slower dig, but it indicates you're not digging
> > exactly the same site.
>
> I am digging the same site. Getting fewer documents from running the
> latest snapshot hasn't escaped my attention; in fact that is the reason
> why I have kept that old snapshot. All subsequent snapshots result in
> around 200 fewer document counts. I haven't figured out if some
> duplicates escaped the old snapshot -or- newer snapshots miss about 200
> documents;-/
OK, I've done some more testing on my site, and didn't notice any problems
on my site, with over 300 HTML documents. The biggest difference I've
noticed is the changes to the HTML parser made it about 15% slower, but
it's much more robust so I think it's worth it.
I can think of two changes that might account for getting less documents
indexed in the post-Aug 29 snapshots:
* htlib/URL.cc (URL): Fixed to call normalizePath() even if URL
is relative but with absolute path. Should fix bug #408586.
and
* htdig/HTML.h, htdig/HTML.cc (HTML, parse, do_tag): Fixed buggy
handling of nested tags that independently turn off indexing, so
</script> doesn't cancel <meta name=robots ...> tag. Add handling
of <noindex follow> tag.
both on Aug. 31. The URL class change will get rid of some double slashes
that were previously missed, which can reduce the number of duplicates.
The HTML class change may prevent the parser from following links in
documents that have meta robots tags, i.e. that it wasn't supposed
to follow.
If you get a chance to run old and new snapshots of htdig with -vvv and
compare the outputs, you may be able to track down the source of the
different URLs that are parsed in both cases. To do this in a meaningful
way, though, you'll need to try a static site, or perhaps a snapshot of
your site, so you don't get thrown off in your comparisons by updates
to the site between digs.
If you don't have meta robots tags on your site, though, it's almost
certainly going to be the URL class that accounts for the differences.
A quick test would be to run htdig -t with an old snapshot, then grep for
"http://.*//" in db.docs.
> > - the snapshots are different. Was there really a snapshot on Wed.,
> > Aug 29, or is that supposed to be 082601? In either case, there were
I just answered my own question here. The 082601 snapshot was the truncated
one, so I take it that 082901 was a manual snapshot.
> > code changes made to the HTML parser between the two snapshots, so
> > we don't know if part of the problem is that the new parser is slower.
> > It might be useful to compare the timings of the two parser versions.
>
> Here is a more complete comparison; the first two digs just show that a
> couple of hundred documents do not make much of a difference. This is a
> dynamic site; it changes continuously;) The second and third digs show
> the reduction in document count from 3.1.6-082901 to 3.1.6-092301
> snapshot. The third and fourth digs show a dramatic difference between
> digging duration caused by nothing other than the two patches:
Yes, that last comparison is the one I wanted to see. An almost 3-fold
increase in indexing time is dramatic. A comparison of profiling output
for these two builds would really be informative.
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev