According to Joe R. Jah:
> On Mon, 1 Oct 2001, Gilles Detillieux wrote:
> > Shouldn't you be using the C library's regex code?  Maybe the automatic
> > configure test isn't working correctly.  Try the manual solution as
> > for older htdig versions, and see if that clears up some of these wierd
> > regex-related problems, in both 3.1.6 and 3.2.0b4 snapshots.  If that
> > helps, we'll need to work out a better test.
> 
> Yessss;)  That helped a lot indeed:
> _________________3.1.6-093001 + ssl.4__________________
> htdig:    Start digging:   Sun Sep 30 02:27:48 PDT 2001
> htmerge:  Start merging:   Sun Sep 30 03:56:51 PDT 2001  89 minues;(
...
> ____________3.1.6-093001 + ssl.4 & FAQ#5.14____________
> htdig:    Start digging:   Mon Oct  1 16:15:46 PDT 2001
> htmerge:  Start merging:   Mon Oct  1 16:59:05 PDT 2001  44 minutes;)
...
> _______________________________________________________
> 
> In pre 4.2 versions of BSDi htdig would segfault without using FAQ#5.14.
> After upgrading to BSDi-4.2 that problem never occured until this patch.
> Thank you very much Gilles; I have removed the files form
> htdig-patches/Bench folder because they were irrelevant.

The segfault doesn't occur in 4.2, but apparently there's still a serious
conflict happening here.

> > You misunderstand.  My tests above didn't involve Andy's or Geoff's code
> > for url_rewrite_rules at all.  The 15% difference was solely attributable
> > to the changes in htdig/HTML.cc, to use a different technique for
> > parsing tag attributes.  The old code used a StringMatch object to
> > search for certain attributes, like href, src, etc., but the search
> > could get thown off by the existance of these words within attribute
> > value strings in tags.  The new code instead creates a Configuration
> > object for each tag, and uses the code for this class to Add all the
> > attributes in the tag to this object.  This greatly simplifies the
> > HTML parser, makes it easier to extend it to handle new tag attributes,
> > and makes it more reliable.  It should NOT make it much more than 15%
> > slower on ANY system, including yours.
> 
> Sorry; I thought you were still on the same subject.

Well, we were, but the subject has a few forks in it.  It started with
a discussion about how the two benchmarks you ran initially with two
3.1.6 snapshots differed greatly, and it split off into 3 possible reasons
for the difference:

1) HTML parser changes on Aug 31
2) changes to your web site
3) different implementations of url_rewrite_rules, using rx or regex

That's why I suggested testing the 3 changes in isolation.  The 15%
difference I observed was attributed solely to the HTML parser changes.
The changes to the web site, as you pointed out, have only a minor effect
on timings.  The 200-400% difference you observed seems to be due purely
to conflicts between the bundled regex code and your C/C++ libraries.

> > The problems with regex handling are a completely separate issue, and are
> > not tied to the HTML parser in any way.  I do want to resolve this issue
> > too, if we can ever get to the bottom of it.
> 
> I believe it is resolved, thanks to you.  I will try FAQ#5.14 on 3.2
> snapshot as soon as I get a chance to clarify this point once and for all.
> Is it possible to set a test in the configure program to take care of
> FAQ#5.14 automatically?

That's the big question.  We can certainly add tests into configure.in to
try different things out.  That's what Geoff did, actually, but his test
was a very simple one which doesn't seem to catch the subtle conflict
on your BSDi 4.2 system.  The test just tries compiling and running
a simple C program that calls the regcomp() function from the bundled
htlib/regex.c code.  If this runs, that's the code that gets used to
build htdig, htfuzzy and htsearch.  Unfortunately, this code does run
OK on your system.  I don't know what test we could come up with that
would fail on your system, so we don't use the bundled regex code.
Maybe we just need to check for a BSD/BSDi system and stick to the C
library regex functions on these systems.

> > > > If you get a chance to run old and new snapshots of htdig with -vvv and
> > > > compare the outputs, you may be able to track down the source of the
> > > > different URLs that are parsed in both cases.  To do this in a meaningful
> > > > way, though, you'll need to try a static site, or perhaps a snapshot of
> > > > your site, so you don't get thrown off in your comparisons by updates
> > > > to the site between digs.
> > > 
> > > Yes, I have kept that snapshot for a happy occasion like that;)
> > 
> > Keep me posted if you get a chance to run this test with both snapshots.
> > I can't think of any changes to 3.1.6 that would cause it to lose valid
> > URLs, but it would be good to confirm without a doubt that the lost URLs
> > on your system are all indeed URLs that should not have been indexed.
> 
> In the happy hour;)))

It might be best if you're sober when you do this test.  ;-)

> > You're right, I was forgetting that URLs can appear in the body text
> > of a document, and therefore in the excerpt field of db.docs.  This
> > does suggest that the change to URL.cc on Aug. 31 would account for
> > almost half of the missing URLs.  Presumably a grep of "[^:]//" in a
> > db.docs from a recent 3.1.6 snapshot wouldn't find any matches, unless
> > the double slashes are in URLs within the body text of documents.
> > 
> > So, I guess the next question is do you have any documents that have
> > meta robots tags followed by script tags?
> 
> Yes; most of the 88 documents in my previous post

OK, have a look at these documents and see how many URLs appear in links
after a <meta name=robots content="noindex,nofollow"> tag and after a
subsequent </script> tag.  Pre Aug. 31 code will erroneously follow links
after the </script> tag in this case.  If any of those URLs appear only
there, and nowhere else on your site, they should never have been indexed,
and post-Aug. 31 code won't touch them.  If this doesn't account for all
the other "missing" URLs, then we may have a problem elsewhere.  That's
what I'd like to nail down before 3.1.6 is released, if at all possible.
I've been unable to reproduce any such error on my site.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to