Hi all,
Looking for some opinions and epiphanies.
InputFilter has been giving me trouble tonight. First, it was dropping
invalid tags and their contents if they happened to be within a stack.
Moeffju fixed this in r3379, in InputFilter. This fixed part of the
problem.
The other part of the problem is that the HTMLTokenizer *always*
treats < as an opening tag.
Consider the following:
$ echo 'echo InputFilter::filter("foo <one two> bar");' | heval
foo <one> bar
"two" simply disappears because the HTML tokenizer treats "<one two>"
as a tag. In many cases, especially in comment filtering, this is not
the desired behaviour. I'd much prefer that to output:
foo <one two> bar
Here's a real-world example that has _extremely_ non-obvious output to
someone who hasn't studied the HTMLTokenizer:
$ echo 'echo InputFilter::filter("foo <http://localhost/> bar");' |
heval
foo <http:>/localhost/> bar
Dumping this node indicates that the HTMLTokenizer is actually
discarding bits of data that, while completely invalid in HTML, are
important to some messages:
echo 'echo InputFilter::filter("<foo bar>");' | heval
array(4) {
["type"]=>
int(1)
["name"]=>
string(5) "#text"
["value"]=>
string(5) "<foo>"
["attrs"]=>
array(0) {
}
}
<foo>
... Where did "bar" go? HTMLTokenizer thought it was an invalid
attribute or bad markup and dropped it.
Here's a plausible real-world example:
$ echo 'echo InputFilter::filter("one<two but four>three");' | heval
one<two>three
... that's just wrong.
This affects ticket #65 (and #104).
If anyone has ideas, please speak up! (-:
S
--~--~---------~--~----~------------~-------~--~----~
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at http://groups.google.com/group/habari-dev
-~----------~----~----~----~------~----~------~--~---