htdig: Patch: support META elements for external parsers.

Hans-Peter Nilsson Wed, 13 Jan 1999 22:13:50 -0500
Here's an implementation of META elements for external parsers;
'm' was used for this.  Nothing really new; most was stolen from
HTML.cc (no, I could not find a good way to share that code
within limits).

Note that meta.html is not up-to-date (regardless of this).
I did not fix that; I see it as a bug that can be fixed during
the feature-freeze (schemes within schemes :-) 


Thu Jan 14 03:16:15 1999  Hans-Peter Nilsson  <[EMAIL PROTECTED]>

        * htdig/ExternalParser.cc (parse): Added support for 'm': meta
        element.
        * htdoc/attrs.html: Document it.

Index: htdig/ExternalParser.cc
===================================================================
RCS file: /opt/htdig/cvs/htdig3/htdig/ExternalParser.cc,v
retrieving revision 1.4
diff -p -c -r1.4 ExternalParser.cc
*** ExternalParser.cc   1998/12/06 18:46:59     1.4
--- ExternalParser.cc   1999/01/14 02:36:20
*************** static char RCSid[] = "$Id: ExternalPars
*** 30,35 ****
--- 30,36 ----
  #include <Dictionary.h>
  #include <ctype.h>
  #include <stdio.h>
+ #include <good_strtok.h>
  
  static Dictionary     *parsers = 0;
  extern String         configFile;
*************** ExternalParser::parse(Retriever &retriev
*** 153,158 ****
--- 154,162 ----
        return;
      }
  
+     unsigned int minimum_word_length
+       = config.Value("minimum_word_length", 3);
+ 
      String    line;
      char      *token1, *token2, *token3;
      URL               url;
*************** ExternalParser::parse(Retriever &retriev
*** 209,214 ****
--- 213,328 ----
                token1 = strtok(0, "\t");
                if (token1 != NULL)
                  retriever.got_image(token1);
+               else
+                 cerr<< "External parser error in line:"<<line<<"\n";
+               break;
+           case 'm':   // meta
+               // Using good_strtok means we can accept empty
+               // fields.
+               char *httpEquiv = good_strtok(token1+2, "\t");
+               char *name = good_strtok(0, "\t");
+               char *content = good_strtok(0, "\t");
+ 
+               if (httpEquiv != NULL && name != NULL && content != NULL)
+               {
+                 // It would be preferable if we could share
+                 // this part with HTML.cc, but it has other
+                 // chores too, and I do not se a point where to
+                 // split it up to get a common shared function
+                 // (or class).  Which should not stop anybody from
+                 // finding a better solution.
+                 // For now, there is duplicated code.
+                 StringMatch   keywordsMatch;
+                 String        keywordNames = config["keywords_meta_tag_names"];
+ 
+                 keywordNames.replace(' ', '|');
+                 keywordNames.remove(",\t\r\n");
+                 keywordsMatch.IgnoreCase();
+                 keywordsMatch.Pattern(keywordNames);
+     
+                 // 
+<URL:http://www.w3.org/MarkUp/html-spec/html-spec_5.html#SEC5.2.5> 
+                 // says that the "name" attribute defaults to
+                 // the http-equiv attribute if empty.
+                 if (*name == '\0')
+                   name = httpEquiv;
+ 
+                 if (*httpEquiv != '\0')
+                 {
+                   // <META HTTP-EQUIV=REFRESH case
+                   if (mystrcasecmp(httpEquiv, "refresh") == 0
+                       && *content != '\0')
+                   {
+                     char *q = mystrcasestr(content, "url=");
+                     if (q && *q)
+                     {
+                       q += 4; // skiping "URL="
+                       char *qq = q;
+                       while (*qq && (*qq != ';') && (*qq != '"') &&
+                              !isspace(*qq))qq++;
+                       *qq = 0;
+                       URL href(q, base);
+                       // I don't know why anyone would do this, but hey...
+                       retriever.got_href(href, "");
+                     }
+                   }
+                 }
+ 
+                 //
+                 // Now check for <meta name=...  content=...> tags that
+                 // fly with any reasonable DTD out there
+                 //
+                 if (*name != '\0' && *content != '\0')
+                 {
+                   if (keywordsMatch.CompareWord(name))
+                   {
+                     char      *w = strtok(content, " ,\t\r");
+                     while (w)
+                     {
+                       if (strlen(w) >= minimum_word_length)
+                         retriever.got_word(w, 1, 10);
+                       w = strtok(0, " ,\t\r");
+                     }
+                   }
+                   else if (mystrcasecmp(name, "htdig-email") == 0)
+                   {
+                     retriever.got_meta_email(content);
+                   }
+                   else if (mystrcasecmp(name, "htdig-notification-date") == 0)
+                   {
+                     retriever.got_meta_notification(content);
+                   }
+                   else if (mystrcasecmp(name, "htdig-email-subject") == 0)
+                   {
+                     retriever.got_meta_subject(content);
+                   }
+                   else if (mystrcasecmp(name, "description") == 0 
+                            && strlen(content) != 0)
+                   {
+                     //
+                     // We need to do two things. First grab the description
+                     //
+                     String meta_dsc = content;
+ 
+                     if (meta_dsc.length() > max_meta_description_length)
+                       meta_dsc = meta_dsc.sub(0, max_meta_description_length).get();
+                     if (debug > 1)
+                       cout << "META Description: " << content << endl;
+                     retriever.got_meta_dsc(meta_dsc);
+ 
+                     //
+                     // Now add the words to the word list
+                     // (slot 11 is the new slot for this)
+                     //
+                     char        *w = strtok(content, " \t\r");
+                     while (w)
+                     {
+                       if (strlen(w) >= minimum_word_length)
+                         retriever.got_word(w, 1, 11);
+                       w = strtok(0, " \t\r");
+                     }
+                   }
+                 }
+               }
                else
                  cerr<< "External parser error in line:"<<line<<"\n";
                break;
Index: htdoc/attrs.html
===================================================================
RCS file: /opt/htdig/cvs/htdig3/htdoc/attrs.html,v
retrieving revision 1.15
diff -p -c -r1.15 attrs.html
*** attrs.html  1999/01/14 01:19:25     1.15
--- attrs.html  1999/01/14 02:36:25
***************
*** 1277,1284 ****
              The external parser is to write information for
              htdig on its standard output.<br>
               The output consists of records, each record terminated
!             with a newline. Each record is a series of non-empty tab
!             separated fields. The first field is a single character
              that specifies the record type. The rest of the fields
              are determined by the record type. 
              <table border="1">
--- 1277,1285 ----
              The external parser is to write information for
              htdig on its standard output.<br>
               The output consists of records, each record terminated
!             with a newline. Each record is a series of (unless
!             expressively allowed to be empty) non-empty tab-separated
!             fields. The first field is a single character
              that specifies the record type. The rest of the fields
              are determined by the record type. 
              <table border="1">
***************
*** 1467,1472 ****
--- 1468,1504 ----
                    the document.
                  </td>
                </tr>
+               <tr>
+                 <th rowspan="3" valign="top">
+                   m
+                 </th>
+                 <td valign="top">
+                   http-equiv
+                 </td>
+                 <td>
+                   The HTTP-EQUIV attribute of a <a
+                   href="meta.html"><i>META</i> tag</a>.
+                   May be empty.
+                 </td>
+               </tr>
+               <tr>
+                 <td valign="top">
+                   name
+                 </td>
+                 <td>
+                   The NAME attribute of this <i>META</i>
+                   tag</a>.  May be empty.
+                 </td>
+               </tr>
+               <tr>
+                 <td valign="top">
+                   contents
+                 </td>
+                 <td>
+                   The CONTENTS attribute of this <i>META</i> tag</a>.
+                   May be empty.
+                 </td>
+               </tr>
              </table>
            </dd>
            <dt>

brgds, H-P
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.
htdig: Patch: support META elements for external parsers.

Reply via email to