Hi all,

Looking for some opinions and epiphanies.
InputFilter has been giving me trouble tonight. First, it was dropping  
invalid tags and their contents if they happened to be within a stack.  
Moeffju fixed this in r3379, in InputFilter. This fixed part of the  
problem.

The other part of the problem is that the HTMLTokenizer *always*  
treats < as an opening tag.

Consider the following:

$ echo 'echo InputFilter::filter("foo <one two> bar");' | heval
foo &lt;one&gt; bar

"two" simply disappears because the HTML tokenizer treats "<one two>"  
as a tag. In many cases, especially in comment filtering, this is not  
the desired behaviour. I'd much prefer that to output:
foo &lt;one two&gt; bar

Here's a real-world example that has _extremely_ non-obvious output to  
someone who hasn't studied the HTMLTokenizer:

$ echo 'echo InputFilter::filter("foo <http://localhost/> bar");' |  
heval
foo &lt;http:&gt;/localhost/&gt; bar

Dumping this node indicates that the HTMLTokenizer is actually  
discarding bits of data that, while completely invalid in HTML, are  
important to some messages:
echo 'echo InputFilter::filter("<foo bar>");' | heval
array(4) {
   ["type"]=>
   int(1)
   ["name"]=>
   string(5) "#text"
   ["value"]=>
   string(5) "<foo>"
   ["attrs"]=>
   array(0) {
   }
}
&lt;foo&gt;
... Where did "bar" go? HTMLTokenizer thought it was an invalid  
attribute or bad markup and dropped it.

Here's a plausible real-world example:
$ echo 'echo InputFilter::filter("one<two but four>three");' | heval
one&lt;two&gt;three

... that's just wrong.

This affects ticket #65 (and #104).

If anyone has ideas, please speak up! (-:

S


--~--~---------~--~----~------------~-------~--~----~
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at http://groups.google.com/group/habari-dev
-~----------~----~----~----~------~----~------~--~---

Reply via email to