According to Joe R. Jah:
> On Wed, 3 Oct 2001, Gilles Detillieux wrote:
> > Date: Wed, 3 Oct 2001 09:51:03 -0500 (CDT)
> > From: Gilles Detillieux <[EMAIL PROTECTED]>
> > To: Joe R. Jah <[EMAIL PROTECTED]>
> > Cc: [EMAIL PROTECTED]
> > Subject: Re: [htdig-dev] Re: URL Rewrite patch for 3.1.6 snapshots
> > 
> > > > > > If you get a chance to run old and new snapshots of htdig with -vvv and
> > > > > > compare the outputs, you may be able to track down the source of the
> > > > > > different URLs that are parsed in both cases.  To do this in a meaningful
> > > > > > way, though, you'll need to try a static site, or perhaps a snapshot of
> > > > > > your site, so you don't get thrown off in your comparisons by updates
> > > > > > to the site between digs.
> > > > > 
> > > > > Yes, I have kept that snapshot for a happy occasion like that;)
> > > > 
> > > > Keep me posted if you get a chance to run this test with both snapshots.
> > > > I can't think of any changes to 3.1.6 that would cause it to lose valid
> > > > URLs, but it would be good to confirm without a doubt that the lost URLs
> > > > on your system are all indeed URLs that should not have been indexed.
> > > 
> > > In the happy hour;)))
> > 
> > It might be best if you're sober when you do this test.  ;-)
> 
> The happy hour turned into a couple of unhappy weeks:(
> 
> -r--r--r--  1 jjah  www    24621528 Oct  2 13:20 rundig_vvv.082901
> -r--r--r--  1 jjah  www    20266702 Oct  2 14:15 rundig_vvv.093001
> 
> I found 82 links from one document with META ROBOT: Noindex tag;)  I could
> not find an efficient way of hunting down the other 138 links that were
> unaccounted for in two 20 meg+ files; however, I must assume that they are
> some sort of duplicates;-/

Hmm.  Too bad we couldn't get something more definitive.  I'm fairly
confident that the changes to the HTML parser didn't break anything, but
I'd feel much more comfortable if we could explain the missing files you
discovered rather than just assuming it's OK.  If I recall, there were
88 URLs with doubled slashes that were eliminated in an earlier test,
but that still leaves around 50 URLs unaccounted for.

If there's any way you can take a snapshot of your site, or a few major
subdirectories, and duplicate them somewhere else where they won't get
modified, it would be a big help in getting conclusive results.  If you
index the exact same files with 3.1.5 and 3.1.6, you should be able to
diff the output of htdig -vvv from both, and pinpoint exactly where the
differences are happening.  I know this is asking a lot, but it would be
a shame to release 3.1.6 after all the work that's gone into it, only to
discover afterward that it introduced a serious bug.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to