bug report -> HTML::Parser 3.02 non-strict comment parsing errors

Craig W. Shaver Wed, 12 Jan 2000 17:08:17 -0800

Hi,

I am having problems with the parser not recognizing the end of
comments.  It seems that at line 496 through line 526 of hparser.c the
end of comment recognizer will fail until it sees an even number of '-'
chars before the '\s*>' ending sequence. Here is a small html example:

<html><body>
This is text 1
<!---spacer cell ---->
This is text 2
<!-- spacer cell -->
This is text 3
<!---spacer cell -->
This is text 4
<!---spacer cell ----->
This is text 5
<!-- spacer cell -->
This is text 6
<!-- spacer cell --->
This is text 7
<!-- spacer cell -->
</body></html>

I have a test parser using HTML::TokeParser to just drop comments and I
get the following when I run this html through it:

Content-Length: 286
Content-Type: text/html
Last-Modified: Thu, 13 Jan 2000 01:02:25 GMT
Client-Date: Thu, 13 Jan 2000 01:04:37 GMT

response base = file:./comments.html
<html><body >
This is text 1

This is text 2

This is text 3

This is text 4

This is text 6

</body></html>

As you can see the '5' text and the '7' text have been removed via the
run on comments.

I would try to provide a fix in the hparser.c code, but I do not know
enough about how the token buffer routines work to test end with a look
ahead.

Thanks,

BTW, this is a great tool and I love it!

-- 
[EMAIL PROTECTED] (408)543-6451
Craig Shaver, Productivity Group
POB 60458 Sunnyvale, CA  94088 (650)390-0654
http://www.progroup.com/ mailto:[EMAIL PROTECTED]

bug report -> HTML::Parser 3.02 non-strict comment parsing errors

Reply via email to