htdig: Skipping parts of a document

Marjolein Katsma Tue, 12 Jan 1999 01:58:16 -0500
Sometimes it's useful not to index parts of a document. Some examples:
- When using anchors in the search results to jump to the appropriate part
in the text (see a previous contribution), jumping to a top-of-page menu is
hardly relevant;
- Code (maybe produced by an external server) that changes faster than the
indexing cycle, for instance daily news headlines
- Deleted text (<DEL></DEL> HTML 4.0)

This patch allows placing start and end markers in the text so that
anything in-between will not be indexed; but existing tags can also be used
(for instance <DEL> and </DEL> in my last example!). Default markers are
<!--htdig_noindex--> and <!--/htdig_noindex-->. Corresponding config
parameters are noindex_start and noindex_end.

Two patches:

(1)
Defaults.cc - this patch is compared to release 3.1.0b4  and is a
*replacement* for the previously posted patch (just a correction to a
comment though); it contains a few other changes needed for other features.

diff -c3p defaults.cc defaultsMK.cc
*** defaults.cc Tue Dec 22 18:53:12 1998
--- defaultsMK.cc       Mon Jan 11 10:41:35 1999
***************
*** 3,8 ****
--- 3,22 ----
  //
  // default values for the ht programs
  //
+ // Revision 1999-01-11 mkatsma
+ // Added options translate_amp, translate_lt_gt and translate_quote to enable
+ // configuration of whether or not entities for '&', '<', '>' and '"' will
+ // be translated. The default is true, leaving the normal operation of htdig
+ // unchanged.
+ //
+ // Revision 1999-01-10 mkatsma
+ // Implemented configurable 'no title' text (found on mail list archive)
+ //
+ // Revision 1999-01-06 mkatsma
+ // Added options noindex_start and noindex_end to enable NOT indexing
+ // some sections of code; useful to exclude such things as local page menus
+ // and server-generated code that changes faster than an indexing cycle.
+ //
  // $Log: defaults.cc,v $
  // Revision 1.24  1998/12/11 02:49:54  ghutchis
  // Added option for server_max_docs as a limit on the number of docs returned
*************** ConfigDefaults  defaults[] =
*** 168,173 ****
--- 182,190 ----
      {"no_excerpt_show_top",             "false"},
      {"no_next_page_text",             "[next]"},
      {"no_prev_page_text",             "[prev]"},
+     {"no_title_text",                 "[No title]"},
//mk19990110
+     {"noindex_start",                 "<!--htdig_noindex-->"},
//mk19990106
+     {"noindex_end",                           "<!--/htdig_noindex-->"},
                 //mk19990106
      {"nothing_found_file",            "${common_dir}/nomatch.html"},
      {"page_list_header",              "<hr noshade size=2>Pages:<br>"},
      {"prefix_match_character",                "*"},
*************** ConfigDefaults  defaults[] =
*** 195,200 ****
--- 212,220 ----
      {"text_factor",                   "1"},
      {"timeout",                               "30"},
      {"title_factor",                  "100"},
+     {"translate_amp",                 "false"},
 //mk19990111
+     {"translate_lt_gt",                       "false"},
                 //mk19990111
+     {"translate_quot",                        "false"},
                 //mk19990111
      {"url_list",                      "${database_base}.urls"},
      {"use_star_image",                        "true"},
      {"use_meta_description",            "false"},


(2)
Patch to HTML.cc - this is in comparison with my previous version with the
modified comments-skipping algorithm (previous post):

javawoman: {10} % diff -c3p HTMLcommentMK.cc HTMLMK.cc
*** HTMLcommentMK.cc    Mon Jan 11 22:46:49 1999
--- HTMLMK.cc   Mon Jan 11 23:21:31 1999
***************
*** 3,8 ****
--- 3,12 ----
  //
  // Implementation of HTML
  //
+ // Revision 1999-01-09 mkatsma
+ // Added algorithm to skip text between configurable markers so it will
+ // not be indexed.
+ //
  // Revision 1999-01-07/1999-01-09 mkatsma
  // Modification of comment-filtering algorithm so it skips all legal SGML
  // comment declarations, including ones with whitespace after the last
*************** HTML::parse(Retriever &retriever, URL &b
*** 188,193 ****
--- 192,213 ----

      while (*position)
      {
+
+       //
+       // Filter out section marked to be ignored for indexing. 
+       // This can contain any HTML. 
+       //
+       char *skip_start = config["noindex_start"];
+       char *skip_end = config["noindex_end"];
+       if (strncmp((char *)position, skip_start, strlen(skip_start)) == 0)
+       {
+               q = (unsigned char*)strstr((char *)position, skip_end);
+               if (!q)
+                       *position = '\0';       // Rest of document will be
skipped...
+               else
+                       position = q + strlen(skip_end);
+               continue;
+       }

        //  Improved algorithm 1999-01-07 Marjolein Katsma
        //      (with help from Gilles Detillieux)


Marjolein Katsma      [EMAIL PROTECTED]
Java Woman - http://javawoman.com/
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.
htdig: Skipping parts of a document

Reply via email to