aseek-devel  

Re: [aseek-devel] Patches to a couple of indexing problems...

Kir Kolyshkin
Mon, 17 Sep 2001 07:35:30 -0700

Matt, thanks for the patch. But can you redo/resend it done
against the latest CVS, and it will be great if you do several
separated patches, each one dedicated to one problem, so it will
be easy to look at it and apply it.

BWT I have already committed some obvious pieces in CVS. In short,
not accepted yet are 3, 4 and 5 items.

PS I will not be available during Tue-Wed, so you can send the patches
by Thursday.

Matt Sullivan ΠΙΣΑΜ(Α):
> 
> Kir / All,
> 
> Attached are a couple of patches to problems that I've discovered (mostly) in
> the indexer over the last few weeks.  Details are:
> 
> 1. Fixed mistaken indexing of robots.txt.  Some web sites output content (such
> as their home page) on what should be 404 errors but with status 200.  Since
> Content-Type and filename were both checked when indexing robots.txt it was
> possible that robots.txt files were actually indexed mistakenly in these
> instances (since mime type is not text/plain).  Patch forces status 404 if
> filename is robots.txt and content type is not text/plain.
> 
> 2. Minor fixes when HTTPS support not compiled in.  Was possible for
> (supposedly) HTTPS files to be indexed mistakenly even with HTTPS support not
> compiled in.  This would only occur when the exact same filepath was available
> via both HTTP and HTTPS protocols (in other words URL
> "https://someserver/somepath/"; would be added and indexed over port 80
> mistakenly).
> 
> 3. Fixed broken redirect handling.  Although RFC 2616 states that the RHS of a
> "Location" header should be an AbsoluteURI, it is very common for
> partial/relative URI's to be used in a Location header.  ASPSeek would not
> follow partial URI's in redirects (both header and meta based).  Added
> whitespace stripping and qualifying of partial URI's to support redirects
> proper.  Redirects are now followed in exactly the same manner as HREF's in a
> document (including addition of robot.txt when follow & followoutside, note
> below).
> 
> 4. Fixed Tolower call in META handling that caused URI to be lower cased thus
> breaking redirect in case where remote system was case sensitive in it's URI
> handling.
> 
> 5. Fixed robots.txt URL parameter addition in HREF and Redirect code to
> correctly add robots.txt with referrer 0, hops count 0 and next index time
> 24hrs < now (robots.txt were not added at all in redirect case - which caused
> problems with FollowOutside on).
> 
> 6. Also (not included in attached patch) newer autoconf needs addition of
> AC_EXEEXT macro in configure.in prior to AC_LANG_CPLUSPLUS etc.  Effect of
> missing macro is EXE extension mistakenly set to ".C".  Patch is:


-- 
[EMAIL PROTECTED]  ICQ 7551596  Phone +7 903 6722750
Reality always seems harsher in the early morning.
--