Edit report at https://bugs.php.net/bug.php?id=62032&edit=1
ID: 62032
Comment by: anon at anon dot anon
Reported by: iamcraigcampbell at gmail dot com
Summary: filter_var incorrectly strips characters from
strings after "<"
Status: Open
Type: Bug
Package: Filter related
Operating System: Mac OS X
PHP Version: 5.4.3
Block user comment: N
Private report: N
New Comment:
Well I never heard of this "SANITIZE_STRING" filter before, but it looks just
as retarded as it sounds, and about as retarded as strip_tags. 99.99% of the
time, strip_tags just should not be used at all because it's horribly broken.
The real bugs are (1) strip_tags exists, and (2) that PHP should imply that any
kind of magical all-purpose "string sanitization" process could exist.
@iamcraigcampbell:
>Well I can understand stripping it if there is a closing > somewhere, but if
>it is
a < that is not followed by a matching > then it should be allowed in the
string
and not stripped.
In that case:
(1) Unclosed tags will eat extra page content, breaking page layout.
(2) Pages consist of many echo statements. By your logic, "<script" is a
possibly legal string to echo, but if some later string contains a ">", we need
to implement a delayed-choice quantum eraser to make all the parallel universes
in which the earlier echo statement occurred cease to exist.
>I think it is more expected behavior to display this string as "This is NOT
>good!".
No. Display what users type. Don't delete text from their posts based on the
quirks of what just happens to be the underlying display format on a particular
day. Suppose your hypothetical forum also displays posts in another format,
e.g., it has a Flash or iPhone-based app, or it tweets posts, or a few years
from now we're all using a completely different markup language. Should it then
also strip HTML-like tags from all text in perpetuity from all media just
because HTML happened to be a relevant format to someone somewhere once upon a
time, or should user-submitted text throw integrity to the wind and change
depending on what kind of device someone is attempting to use to view it,
whether or not that device's markup was invented when the post was made? What
if someone is trying to use text that legitimately resembles an HTML tag (it
happens), or, more likely, they're trying to quote or talk about HTML -- no
filter can handle this. No no no no no. Display what they type and don't
confuse the poor souls. I.e., use htmlspecialchars() if outputting to HTML; or
if not, use whatever other escaping method is appropriate to the particular
output format that still preserves the integrity of the user-typed text in that
format, while making exception for the formatting markup that is legitimately
supported and documented to be supported by the forum, such as markdown or
bbcode syntax (and probably not HTML, since besides the fact that HTML is ugly
and over-complicated for most forum post needs, strip_tags with an allowed tags
parameter is the most dangerous of the lot and allows blatant abuse via
attributes).
And don't get me started on entities.
tl;dr: no amount of wrapping it in flashy filter functions changes the fact
that strip_tags confuses countless souls, is brain-damaged, and ought to be
deprecated to death.
Previous Comments:
------------------------------------------------------------------------
[2012-05-15 15:06:26] iamcraigcampbell at gmail dot com
@pajoye I agree with you, but there is a use case that encoding will not solve.
Let's say there is a forum where users are posting. If the user posts:
"This is <strong>NOT</strong> good!" and the tags get encoded then that means
the
HTML tags will be displayed in the forum as plain text. I think it is more
expected
behavior to display this string as "This is NOT good!".
So one option would be encoding the < only if it is not followed by a > but
that is a
lot of extra work to figure that out.
At the end of the day the point is that no matter how you look at it I still
think
this is a bug.
$string = 'This is true: 2<5';
strip_tags($string); and filter_var($string, FILTER_SANITIZE_STRING);
Should not strip out <5 since that is not an HTML tag.
------------------------------------------------------------------------
[2012-05-15 14:51:09] aleksey dot v dot korzun at gmail dot com
How is stripping anything after < with a space is a valid operation? That seems
like a lazy man's html stripper.
Let's just blindly strip everything that can possibly be made into an html tag
of
any sort. Not.
------------------------------------------------------------------------
[2012-05-15 14:49:02] [email protected]
> or < should be encoded then, see
http://www.php.net/manual/en/filter.filters.sanitize.php
btw, any option should be added using the option array or defaults, as it is
the
case already.
------------------------------------------------------------------------
[2012-05-15 14:45:27] iamcraigcampbell at gmail dot com
So in that case I think strip_tags and filter_var are both broken. In this
context:
"It is true that 5<10"
"It is true that 5 < 10"
Neither of these are html tags so the string should not be touched regardless
of if
there is a space or not.
------------------------------------------------------------------------
[2012-05-15 14:42:47] reeze dot xia at gmail dot com
PS: the reason why strip_tags() didn't strip it is '<' is followed by a
space char but not without ending '>', this is the key point.
look deep into the source code, there difference is switch whether or
not to trait '<' followed by a(or more) spaces a tag or not.
------------------------------------------------------------------------
The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
https://bugs.php.net/bug.php?id=62032
--
Edit this bug report at https://bugs.php.net/bug.php?id=62032&edit=1