http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4695





------- Additional Comments From [EMAIL PROTECTED]  2006-01-24 18:25 -------
I've been prodding this for a little bit.  As far as I can tell, HTML::Parser 
(I'm using 3.48) treats "<br>" 
different than "<br/>".  Specifically, "<br>" gets turned into "\n", whereas 
"<br/>" turns into "".  For 
example:

<br/> gets:

This is a littletest with a fewwords perlineto keep itshort// Mene Tekal

<br> gets:

This
is a little
test
with a few
words per
line
to keep it
short
 
//
Mene
Tekal


So when HTML::html_text() looks for obfuscation, instead of seeing the previous 
array element as 
"\n" (not considered obfuscation), it sees text ("is a little", etc.) (it is 
considered obfuscation)

This brings up: how do we deal with this, since the issue is HTML::Parser?  
Looking at the POD, there's a 
function to tweek parse() to handle "</>" as an empty element 
($p->empty_element_tags()).  Enabling 
this seems to also cause HTML::Parser to not consider the trailing "/" as part 
of the element name for 
other elements:

       $p->empty_element_tags
       $p->empty_element_tags( $bool )
           By default, empty element tags are not recognized as such and the
           "/" before ">" is just treated like a nomal name character (unless
           "strict_names" is enabled).  Enabling this attribute make
           "HTML::Parser" recognize these tags.

           Empty element tags look like start tags, but end with the character
           sequence "/>" instead of ">".  When recognized by "HTML::Parser"
           they cause an artificial end event in addition to the start event.
           The "text" for the artificial end event will be empty and the
           "tokenpos" array will be undefined even though the the token array
           will have one element containg the tag name.


adding "$self->empty_element_tags(1);" to the end of M::SA::HTML::new() seems 
to fix the issue.  After 
the change, $self->{text} with "<br/>" shows the same as "<br>" above.

So I'll put up a patch in a minute, though I'm not sure if setting the option 
will have an effect on 
anything else.  If someone else with more HTML::Parser experience could chime 
in, that'd be good. :)



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to