According to Joe R. Jah: > On Mon, 1 Oct 2001, Gilles Detillieux wrote: > > Shouldn't you be using the C library's regex code? Maybe the automatic > > configure test isn't working correctly. Try the manual solution as > > for older htdig versions, and see if that clears up some of these wierd > > regex-related problems, in both 3.1.6 and 3.2.0b4 snapshots. If that > > helps, we'll need to work out a better test. > > Yessss;) That helped a lot indeed: > _________________3.1.6-093001 + ssl.4__________________ > htdig: Start digging: Sun Sep 30 02:27:48 PDT 2001 > htmerge: Start merging: Sun Sep 30 03:56:51 PDT 2001 89 minues;( ... > ____________3.1.6-093001 + ssl.4 & FAQ#5.14____________ > htdig: Start digging: Mon Oct 1 16:15:46 PDT 2001 > htmerge: Start merging: Mon Oct 1 16:59:05 PDT 2001 44 minutes;) ... > _______________________________________________________ > > In pre 4.2 versions of BSDi htdig would segfault without using FAQ#5.14. > After upgrading to BSDi-4.2 that problem never occured until this patch. > Thank you very much Gilles; I have removed the files form > htdig-patches/Bench folder because they were irrelevant.
The segfault doesn't occur in 4.2, but apparently there's still a serious conflict happening here. > > You misunderstand. My tests above didn't involve Andy's or Geoff's code > > for url_rewrite_rules at all. The 15% difference was solely attributable > > to the changes in htdig/HTML.cc, to use a different technique for > > parsing tag attributes. The old code used a StringMatch object to > > search for certain attributes, like href, src, etc., but the search > > could get thown off by the existance of these words within attribute > > value strings in tags. The new code instead creates a Configuration > > object for each tag, and uses the code for this class to Add all the > > attributes in the tag to this object. This greatly simplifies the > > HTML parser, makes it easier to extend it to handle new tag attributes, > > and makes it more reliable. It should NOT make it much more than 15% > > slower on ANY system, including yours. > > Sorry; I thought you were still on the same subject. Well, we were, but the subject has a few forks in it. It started with a discussion about how the two benchmarks you ran initially with two 3.1.6 snapshots differed greatly, and it split off into 3 possible reasons for the difference: 1) HTML parser changes on Aug 31 2) changes to your web site 3) different implementations of url_rewrite_rules, using rx or regex That's why I suggested testing the 3 changes in isolation. The 15% difference I observed was attributed solely to the HTML parser changes. The changes to the web site, as you pointed out, have only a minor effect on timings. The 200-400% difference you observed seems to be due purely to conflicts between the bundled regex code and your C/C++ libraries. > > The problems with regex handling are a completely separate issue, and are > > not tied to the HTML parser in any way. I do want to resolve this issue > > too, if we can ever get to the bottom of it. > > I believe it is resolved, thanks to you. I will try FAQ#5.14 on 3.2 > snapshot as soon as I get a chance to clarify this point once and for all. > Is it possible to set a test in the configure program to take care of > FAQ#5.14 automatically? That's the big question. We can certainly add tests into configure.in to try different things out. That's what Geoff did, actually, but his test was a very simple one which doesn't seem to catch the subtle conflict on your BSDi 4.2 system. The test just tries compiling and running a simple C program that calls the regcomp() function from the bundled htlib/regex.c code. If this runs, that's the code that gets used to build htdig, htfuzzy and htsearch. Unfortunately, this code does run OK on your system. I don't know what test we could come up with that would fail on your system, so we don't use the bundled regex code. Maybe we just need to check for a BSD/BSDi system and stick to the C library regex functions on these systems. > > > > If you get a chance to run old and new snapshots of htdig with -vvv and > > > > compare the outputs, you may be able to track down the source of the > > > > different URLs that are parsed in both cases. To do this in a meaningful > > > > way, though, you'll need to try a static site, or perhaps a snapshot of > > > > your site, so you don't get thrown off in your comparisons by updates > > > > to the site between digs. > > > > > > Yes, I have kept that snapshot for a happy occasion like that;) > > > > Keep me posted if you get a chance to run this test with both snapshots. > > I can't think of any changes to 3.1.6 that would cause it to lose valid > > URLs, but it would be good to confirm without a doubt that the lost URLs > > on your system are all indeed URLs that should not have been indexed. > > In the happy hour;))) It might be best if you're sober when you do this test. ;-) > > You're right, I was forgetting that URLs can appear in the body text > > of a document, and therefore in the excerpt field of db.docs. This > > does suggest that the change to URL.cc on Aug. 31 would account for > > almost half of the missing URLs. Presumably a grep of "[^:]//" in a > > db.docs from a recent 3.1.6 snapshot wouldn't find any matches, unless > > the double slashes are in URLs within the body text of documents. > > > > So, I guess the next question is do you have any documents that have > > meta robots tags followed by script tags? > > Yes; most of the 88 documents in my previous post OK, have a look at these documents and see how many URLs appear in links after a <meta name=robots content="noindex,nofollow"> tag and after a subsequent </script> tag. Pre Aug. 31 code will erroneously follow links after the </script> tag in this case. If any of those URLs appear only there, and nowhere else on your site, they should never have been indexed, and post-Aug. 31 code won't touch them. If this doesn't account for all the other "missing" URLs, then we may have a problem elsewhere. That's what I'd like to nail down before 3.1.6 is released, if at all possible. I've been unable to reproduce any such error on my site. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 _______________________________________________ htdig-dev mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/htdig-dev