(Please excuse if this has already been covered)

HtDig 3.1.1 isn't parsing (slightly non-standard) comments correctly.

Extra dashes in the comment can confuse the current parser into
ignoring a lot of content.  For example <!--comment----> is seen as 
an uncompleted comment beginning.

It seems a lot of web content doesn't strictly adhere to the 
"standard" for comments, so we should be a little careful here.

For example both IE and Netscape require "<!--" comments to end 
with a "-->" without whitespace between the "--" and the ">".
Perhaps htDig would be better off doing the same.  

i.e.:

[modified snip from HTML.cc]
      if (strncmp((char *)position, "<!", 2) == 0)
        {
          //
          // Possible comment declaration (but could be DTD declaration!)
          // A comment can contain other '<' and '>':
          // we have to ignore a complete comment declarations
          // but of course also DTD declarations.
          //
          position += 2;        // Get past declaration start

          // is it a comment?
          if (strncmp((char *)position, "--", 2) == 0)
            {
              // Found start of comment - now find the end
              q = (unsigned char*)strstr((char *)position, "-->");
              if (!q)
                {
                  // Rest of document seems to be a comment...
                  *position = '\0';  
                } 
              else 
                {
                  position = q + 3;
                }
            }
          else
            {
              // Not a comment declaration after all
              // but possibly DTD: get to the end
              q = (unsigned char*)strstr((char *)position, ">");
              if (q)
                {
                  position = q + 1;
                  // End of (whatever) declaration
                }
              else
                {
                  *position = '\0'; // Rest of document is DTD?
                }
            }
          continue;
        }
[snip]


According to Marjolein Katsma: 
> Starting on my next project, I had to dig in HTML.cc, and found th 
> efollowing code to filter out comments: 

According to Gilles Detilleux
> While this will catch *most* comments, it will see some perfectly legal 
> comments as illegal and skip the rest of the page. The best definition
> of comments is found in HTML 2.0 (unchanged in the actual DTD in later 
> versions, but never properly explained any more...): 
> 
> "To include comments in an HTML document, use a comment declaration. A 
> comment declaration consists of `<!' followed by zero or more comments 
> followed by `>'. Each comment starts with `--' and includes all text up
> to and including the next occurrence of `--'. In a comment declaration,
> white space is allowed after each comment, but not before the first 
> comment.  The entire comment declaration is ignored." 
> 
Matthew Edwards ([EMAIL PROTECTED]) |  The fuel of innovation and
Go2Net Inc.  999 Third Ave Suite 4700 |     progress is freedom.



------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.

Reply via email to