[htdig] prune_parent_dir_href-0.1.patch

Peter L. Peres Wed, 10 May 2000 14:48:02 -0700


Hi,

here is the patch, fixed 3 serious bugs, reduced the number of blank lines
and other garbage diffed, leaves config alone now.

Peter

diff -rcN tmp/htdig-3.1.5/README.plp.txt htdig-3.1.5/README.plp.txt
*** tmp/htdig-3.1.5/README.plp.txt      Thu Jan  1 02:00:00 1970
--- htdig-3.1.5/README.plp.txt  Sun May 10 20:37:42 1998
***************
*** 0 ****
--- 1,121 ----
+ 
+ About the patch to allow htdig to index soft-linked directories without
+ indexing the parent directories.
+ 
+ Applies to:     htdig stable 3.1.5
+ Related files:  the.patch  htdig-3.1.5+prune_parent_dir_href-0.1.patch 
+ Related URLs:   http://www.htdig.org  htdig mailing list archive
+ 
+ 1. Description:
+   
+ 1.1 The problem:
+ 
+ When htdig indexes a page under a DocumentRoot noted as http://a/b/c and
+ the page contains a href pointing to http://a/d/e where the latter page
+ is an Apache index, htdig will try to index http://a/d and all its 
+ descendents if this is an Apache index.
+ 
+ 1.2 The solution:
+ 
+ To avoid this, a mechanism is implemented in htdig, that prevents it from
+ reaping and indexing any URLs that are the direct parents of the currently
+ indexed document. For example:
+ 
+ If the document http://here/a/b/c is being indexed, then if the following
+ URLs that will be reaped from it, will need not to be added to the list of
+ URLs to be indexed:
+ 
+ http://
+ http://here
+ http://here/a
+ http://here/a/b
+ 
+ In particular the last one would appear as a 'previous directory' entry in
+ an Apache-generated directory index for http://here/a/b/c.
+ 
+ 2. Patch
+ 
+ 2.1 Patch description:
+ 
+ The patch modifies the htdig/Retriever class to add the required
+ functionality, and adds a new configuration option, that turns the new
+ feature ON or OFF. 
+ 
+ The feature is turned OFF by default, and it needs to be turned ON by an
+ entry in the config file used with htdig using a line like:
+ 
+ prune_parent_dir_href:        true
+ 
+ 2.2 Patch application:
+ 
+ copy the patch source to the htdig-3.1.5 directory and then apply the patch
+ using the command:
+ 
+ patch -p1 <htdig-3.1.5+prune_parent_dir_href-0.1.patch
+ 
+ then recompile and reinstall the htdig (make; make install). Edit the config
+ file to turn on the new option, add a symbolic link to the DocumentRoot
+ (f.ex. cd /usr/local/httpd/htdocs/misc; ln -s /usr/doc .; on Suse systems),
+ and run htdig (rundig). 
+ 
+ The patch is distributed with the filename 'the.patch-1' or the long name, and
+ it is available in the htdig mailing list archives at http://www.htdig.org.
+ It should be renamed to htdig-3.1.5+prune_parent_dir_href-0.1.patch for
+ archival purposes.
+ 
+ NOTE that if you upgrade Suse htdig to 3.1.5 then you have to edit the Suse
+ image, search and cgi-bin directories in CONFIG before compiling, as they
+ are not standard. 
+ 
+ 2.3 Patch problems:
+ 
+ Note that the symbolic links under DocumentRoot have security implications.
+ While normal web sites have paranoid thoughts about security when serving
+ files from outside the DocumentRoot, open systems (Linux in particular),
+ should not have any. On a stock Linux installation, visitors can visit the
+ contents of a stock Linux installation, which is also available elsewhere
+ (presumably with larger bandwidth). FYI directory browse access for Apache
+ on Linux is disbaled by having its global permissions reset (xxxxxx---).
+ 
+ The patch may prevent some sites that are not entered at the top from being
+ indexed properly. For example, if a site is started as:
+ 
+ http://somewhere/pub/someone/start/here.html
+ 
+ then anything not below http://somewhere/pub/someone/start will be omitted,
+ even if it is linked to from here.html
+ 
+ This is not a problem for most sites, which are entered at the top. If you
+ have funny sites, then you will need funny configurations. ;-)
+ 
+ 2.4 Patch function indication:
+ 
+ To see the patch working, run htdig with -v. The patch causes a bang
+ (ascii '!') to be printed among the other progress characters, for each url
+ that was pruned by the patch. I did not try to see what happens when more
+ than one -v is used. In theory it should print bangs then too, but I can't
+ tell with what text they will be mixed.
+ 
+ 3. Some statistics:
+ 
+ A i486/100MHz with 24MB RAM with EIDE disks (not UDMA) ran htdig -ilv with
+ the applied patch with niceness 10 in about 13 hours and htmerge -v in 2
+ hours. The doc db size reported was 310 MB with 36500 documents in it. The
+ machine was throughly usable during this time, for shell and compilation
+ use, as well as web server use (moderate). The kernel was 2.2.5 Suse Linux
+ (stock).
+ 
+ This means that a 'legacy' machine can be employed as intranet document
+ server and run htdig about twice a week from cron, without any problems.
+ 
+ New with 0.1: the optimized runtime of htdig is 30 to 40% faster on the
+ same machine. It pays to optimize this program ! Select all the availab-
+ le speedup options in the Make config for best results, and add your 
+ own !
+ 
+ 4. Who did this
+ 
+ Me, Peter Lorand Peres, [EMAIL PROTECTED], when I tried to index the
+ documentation (not only html) on my Suse 6.2 system in April/May 2000 and
+ failed, due to the looping problem described above.
+ 
diff -rcN tmp/htdig-3.1.5/htcommon/defaults.cc htdig-3.1.5/htcommon/defaults.cc
*** tmp/htdig-3.1.5/htcommon/defaults.cc        Fri Feb 25 04:29:10 2000
--- htdig-3.1.5/htcommon/defaults.cc    Thu May  4 23:14:48 2000
***************
*** 24,29 ****
--- 24,33 ----
      {"pdf_parser",                      PDF_PARSER " -toPostScript"},
      {"version",                               VERSION},
  
+ 
+     // plp
+     {"prune_parent_dir_href",         "false"},
+ 
      //
      // General defaults
      //
diff -rcN tmp/htdig-3.1.5/htdig/Retriever.cc htdig-3.1.5/htdig/Retriever.cc
*** tmp/htdig-3.1.5/htdig/Retriever.cc  Fri Feb 25 04:29:10 2000
--- htdig-3.1.5/htdig/Retriever.cc      Sun May 10 20:45:53 1998
***************
*** 20,29 ****
  #include <stdio.h>
  #include "HtWordType.h"
  
  static WordList       words;
  static int noSignal;
  
- 
  //*****************************************************************************
  // Retriever::Retriever()
  //
--- 20,31 ----
  #include <stdio.h>
  #include "HtWordType.h"
  
+ // plp
+ #include <string.h>
+ 
  static WordList       words;
  static int noSignal;
  
  //*****************************************************************************
  // Retriever::Retriever()
  //
***************
*** 34,39 ****
--- 36,44 ----
      currenthopcount = 0;
      max_hop_count = config.Value("max_hop_count", 999999);
                
+     // plp
+     gus.hop_count = 0;
+ 
      //
      // Initialize the weight factors for words in the different
      // HTML headers
***************
*** 276,295 ****
            // There may be no more documents, or the server
            // has passed the server_max_docs limit
  
!           //
!           // We have a URL to index, now.  We need to register the
!           // fact that we are not done yet by setting the 'more'
!           // variable.
!           //
!           more = 1;
! 
!           //
!           // Deal with the actual URL.
!           // We'll check with the server to see if we need to sleep()
!           // before parsing it.
!           //
!           server->delay();   // This will pause if needed and reset the time
!           parse_url(*ref);
              delete ref;
        }
      }
--- 281,306 ----
            // There may be no more documents, or the server
            // has passed the server_max_docs limit
  
!           // plp: store and preprocess new url for parent dir stripping
!           if (config.Boolean("prune_parent_dir_href", 0))
!             store_url(ref->URL());
!           else
!             gus.hop_count = 0; // avoid chk config w every href
!       
!           //
!           // We have a URL to index, now.  We need to register the
!           // fact that we are not done yet by setting the 'more'
!           // variable.
!           //
!           more = 1;
! 
!           //
!           // Deal with the actual URL.
!           // We'll check with the server to see if we need to sleep()
!           // before parsing it.
!           //
!           server->delay();   // This will pause if needed and reset the time
!           parse_url(*ref);
              delete ref;
        }
      }
***************
*** 1164,1169 ****
--- 1175,1188 ----
  
        url.normalize();
  
+         // plp: check whether it is a substring of the base URL
+         if((gus.hop_count > 0) && (url_is_parent_dir(url.get()) != 0)) {
+           // cout << "got_href: pruning (is substr of base url) " << url.get() << 
+"\n"; // debug
+           if(debug > 0)
+           cout << "!"; // bang ! in the progress indicator characters
+           return;
+         }
+ 
        // If it is a backlink from the current document,
        // just update that field.  Writing to the database
        // is meaningless, as it will be overwritten.
***************
*** 1521,1523 ****
--- 1540,1607 ----
      }
  }
  
+ // plp
+ // private function used to chop and store the url for substring comparison
+ void
+ Retriever::chop_url(ChoppedUrlStore &cus,char *c_url)
+ {  
+   int l;
+ 
+   cus.url_store[0] = '\0';
+   cus.hop_count = 0;
+   l = strlen(c_url);
+   if((l == 0) || (l >= MAX_CAN_URL_LEN)) {
+     if(debug > 0) 
+       cout << "chop_url: failed on len==0\n";
+     return; 
+   }
+   strcpy(cus.url_store,c_url);
+   l = 0;
+   if((cus.url_store_chopped[l++] = strtok(cus.url_store,"/")) == NULL) {
+     cus.url_store[0] = '\0';
+     if(debug > 0)
+       cout << "chop_url: failed on NULL with " << c_url << "\n";
+     return; 
+   }
+   while((cus.url_store_chopped[l++] = strtok(NULL,"/")) != NULL) {
+     if(l >= MAX_CAN_URL_HOPS) {
+       cus.url_store[0] = '\0';
+       return; // fail silently with a valid url, print a bang somewhere else
+     }
+   }
+   cus.hop_count = l - 1;
+   return; // success 
+ }
+ 
+ // call this function to store the base URL of a document being indexed,
+ // when starting to index it (in HTML::parse or ExternalParser::parse)
+ void
+ Retriever::store_url(char *c_url)
+ {
+   chop_url(gus,c_url);
+   return; 
+ }
+ 
+ // call this function to decide if a reaped URL is a direct parent of
+ // the URL being indexed. call in Retriever::got_href()
+ int
+ Retriever::url_is_parent_dir(char *c_url)
+ {
+   int j,k;
+   ChoppedUrlStore cus;
+ 
+   if(gus.hop_count == 0)
+     return 0;
+   chop_url(cus,c_url);
+   if(cus.hop_count == 0) 
+     return 0; 
+   // seek a matching first part (gus == substr of cus)
+   j = k = 0;
+   while(strcmp(gus.url_store_chopped[j++],cus.url_store_chopped[k++]) == 0) {
+     if(k == cus.hop_count)
+       return 1; // substring !
+     if(j == gus.hop_count) 
+       break; // not 
+   }
+   return 0; // not
+ }
diff -rcN tmp/htdig-3.1.5/htdig/Retriever.h htdig-3.1.5/htdig/Retriever.h
*** tmp/htdig-3.1.5/htdig/Retriever.h   Fri Feb 25 04:29:10 2000
--- htdig-3.1.5/htdig/Retriever.h       Sun May 10 20:06:31 1998
***************
*** 24,29 ****
--- 24,35 ----
      Retriever_Restart
  };
  
+ // plp 000503 - for prune_parent_href feature
+ // max length of URL, in chars, fail silently if exceeded
+ #define MAX_CAN_URL_LEN 256
+ // max no. of slashes in same + 1, fail silently if exceeded
+ #define MAX_CAN_URL_HOPS 32
+ 
  class Retriever
  {
  public:
***************
*** 64,79 ****
      // Allow for the indexing of protected sites by using a
      // username/password
      //
!       void            setUsernamePassword(char *credentials);
  
      //
      // Routines for dealing with local filesystem access
      //
      StringList *            GetLocal(char *url);
      StringList *            GetLocalUser(char *url, StringList *defaultdocs);
!     int                       IsLocalURL(char *url);
!       
  private:
      //
      // A hash to keep track of what we've seen
      //
--- 70,102 ----
      // Allow for the indexing of protected sites by using a
      // username/password
      //
!     void              setUsernamePassword(char *credentials);
  
      //
      // Routines for dealing with local filesystem access
      //
      StringList *            GetLocal(char *url);
      StringList *            GetLocalUser(char *url, StringList *defaultdocs);
!     int                           IsLocalURL(char *url);
! 
!     // plp 000503 - for prune_parent_href feature
!     void              store_url(char *c_url);
!     int                       url_is_parent_dir(char *c_url);
! 
  private:
+ 
+     // plp 000503 - for prune_parent_href feature
+     typedef struct {
+       char url_store[MAX_CAN_URL_LEN];
+       char *url_store_chopped[MAX_CAN_URL_HOPS];
+       int  hop_count; // the last valid index in url_store_chopped + 1 or zero
+     } ChoppedUrlStore;
+ 
+     ChoppedUrlStore   gus; // Global chopped Url Store
+     
+     void              chop_url(ChoppedUrlStore &cus,char *c_url);
+     // /plp
+ 
      //
      // A hash to keep track of what we've seen
      //

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.

[htdig] prune_parent_dir_href-0.1.patch

Reply via email to