On Sat, Oct 1, 2022, at 10:39 AM, Kamil Tekiela wrote:
> Hi Internals,
>
> For quite some time now, PHP's sanitize filters have "Rustled My Jimmies".
> These filters bother me because I can't really justify their existence. I
> can understand that a few of them are sensible and may come in handy, but I
> would like to talk about some of these in particular.
>
> In PHP 8.1, we have deprecated FILTER_SANITIZE_STRING which I deemed to be
> a priority due to its confusing name and behaviour. The rest is slightly
> less dangerous, but as was pointed out to me in a recent conversation with
> a PHP developer, these filters are all very confusing.
>
> I would like to have some opinions on the following filters. What do you
> think we should do with them? Deprecate? Fix? Provide better documentation?
>
> ---
>
> *FILTER_SANITIZE_ENCODED *- "URL-encode string, optionally strip or encode
> special characters."
> Now, what does that mean? PHP has two functions for URL encoding: urlencode
> used for encoding query-string parts, and rawurlencode used for encoding
> any other URL part (two different RFCs are followed by these functions).
> Which of these RFCs is applied in this filter? Furthermore, the description
> says that "special characters" can be stripped or encoded. Is one of these
> actions the default and the other can be selected by a flag or are both
> optional? What are these special characters? Are they special in the
> context of URL? If so, why did we encode them first? If these are HTML
> special characters (there's no single definition of special HTML chars),
> then why does this filter encode them if the filter is for URL
> sanitization? What does backtick have to do with any of this
> (FILTER_FLAG_STRIP_BACKTICK)?
>
> *FILTER_SANITIZE_ADD_SLASHES - "*Apply addslashes(). (Available as of PHP
> 7.3.0)"
> This filter was added as a replacement for magic_quotes filter. According
> to PHP documentation, addslashes is supposed to be used when injecting PHP
> variables into eval'd string. Real-life showed that this function is used
> in a lot of places that have nothing to do with PHP's eval. I am not sure
> if the sanitize filter is misused in a similar fashion, but judging from
> the fact that it was meant as a replacement for magic_quotes, my guess is
> that it's very likely still abused.
>
> *FILTER_SANITIZE_EMAIL *- "Remove all characters except letters, digits and
> !#$%&'*+-=?^_`{|}~@.[]."
> Which RFC does this adhere to? It strips slashes and quoted parts, doesn't
> allow IPv6 addresses and doesn't accept RFC 6530 email addresses. This
> filter is ok for simple usage, but it isn't true to any known specification
> AFAIK.
>
> *FILTER_SANITIZE_SPECIAL_CHARS *- "HTML-encode '"<>& and characters with
> ASCII value less than 32, optionally strip or encode other special
> characters."
> What's the intended purpose of this filter? "Special characters" are still
> not clearly defined, but at least it's more clear than
> the FILTER_SANITIZE_ENCODED description. Same question about backticks
> though: why? Why encode ASCII <32 chars?
>
> *FILTER_SANITIZE_FULL_SPECIAL_CHARS *- "Equivalent to calling
> htmlspecialchars() with ENT_QUOTES set. Encoding quotes can be disabled by
> setting FILTER_FLAG_NO_ENCODE_QUOTES. Like htmlspecialchars(), this filter
> is aware of the default_charset and if a sequence of bytes is detected that
> makes up an invalid character in the current character set then the entire
> string is rejected resulting in a 0-length string. When using this filter
> as a default filter, see the warning below about setting the default flags
> to 0."
> Not to be mistaken with FILTER_SANITIZE_SPECIAL_CHARS. As long as it's not
> used with filter_input(), it's the least problematic. We
> have htmlspecialchars() though, so how useful is this filter?
>
> *FILTER_UNSAFE_RAW *- What makes it unsafe? Why isn't this just
> called FILTER_RAW_STRING? If the value being filtered is something other
> than a string, what will this filter return? Integers, floats, booleans and
> nulls are converted to a string, Arrays and objects make the filter fail.
>
> ---
>
> Let's quickly mention the filter flags.
>
> The FILTER_FLAG_STRIP_LOW flag will also remove tabs, carriage returns and
> newlines as these are all less than 32 ASCII codes. When is this useful and
> expected?
>
> The FILTER_FLAG_ENCODE_LOW flag "encodes" ASCII <32 codes presumably into
> HTML entities, although that's not specified anywhere in the PHP manual.
> The word HTML does not appear on the
> https://www.php.net/manual/en/filter.filters.flags.php page. What do these
> characters look like when presented by HTML? When is it ever useful to use
> this flag?
>
> FILTER_FLAG_ENCODE_AMP & FILTER_FLAG_STRIP_BACKTICK - why is this even a
> thing?
>
> Due to flags, FILTER_VALIDATE_EMAIL will happily validate email addresses
> that would be otherwise mangled by FILTER_SANITIZE_EMAIL.
>
> These are just the things I found confusing and strange about the sanitize
> filters. Let's try to put ourselves in the shoes of an average PHP
> developer trying to comprehend these filters. It's quite easy to shoot
> yourself in the foot if you try to use them. The PHP manual doesn't do a
> good job of explaining them, but that's probably because they are not easy
> to explain. I can't come up with good examples of when they should be used.
>
> Regards,
> Kamil

The filter extension has always been a stillborn mess.  Its API is an absolute 
disaster and, as you note, its functionality is unclear at best, misleading at 
worst.  Frankly it's worse than SPL.

I'd be entirely on board with jettisoning the entire thing, but baring that, 
ripping out large swaths of it that are misleading suits me fine.

--Larry Garfield

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Reply via email to