[habari-dev] InputFilter && HTMLTokenizer flaw

Sean Coates Sun, 22 Mar 2009 17:49:14 -0700

Hi all,

Looking for some opinions and epiphanies.
InputFilter has been giving me trouble tonight. First, it was dropping  
invalid tags and their contents if they happened to be within a stack.  
Moeffju fixed this in r3379, in InputFilter. This fixed part of the  
problem.


The other part of the problem is that the HTMLTokenizer *always*  
treats < as an opening tag.

Consider the following:

$ echo 'echo InputFilter::filter("foo <one two> bar");' | heval
foo &lt;one&gt; bar

"two" simply disappears because the HTML tokenizer treats "<one two>"  
as a tag. In many cases, especially in comment filtering, this is not  
the desired behaviour. I'd much prefer that to output:
foo &lt;one two&gt; bar

Here's a real-world example that has _extremely_ non-obvious output to  
someone who hasn't studied the HTMLTokenizer:

$ echo 'echo InputFilter::filter("foo <http://localhost/> bar");' |  
heval
foo &lt;http:&gt;/localhost/&gt; bar

Dumping this node indicates that the HTMLTokenizer is actually  
discarding bits of data that, while completely invalid in HTML, are  
important to some messages:
echo 'echo InputFilter::filter("<foo bar>");' | heval
array(4) {
   ["type"]=>
   int(1)
   ["name"]=>
   string(5) "#text"
   ["value"]=>
   string(5) "<foo>"
   ["attrs"]=>
   array(0) {
   }
}
&lt;foo&gt;
... Where did "bar" go? HTMLTokenizer thought it was an invalid  
attribute or bad markup and dropped it.

Here's a plausible real-world example:
$ echo 'echo InputFilter::filter("one<two but four>three");' | heval
one&lt;two&gt;three

... that's just wrong.

This affects ticket #65 (and #104).

If anyone has ideas, please speak up! (-:

S


--~--~---------~--~----~------------~-------~--~----~
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at http://groups.google.com/group/habari-dev
-~----------~----~----~----~------~----~------~--~---

[habari-dev] InputFilter && HTMLTokenizer flaw

Reply via email to