Re: [PHP-DEV] Sanitize filters

2022-10-11 Thread Derick Rethans
Hi all,

On Sat, 1 Oct 2022, Kamil Tekiela wrote:

> For quite some time now, PHP's sanitize filters have "Rustled My 
> Jimmies". These filters bother me because I can't really justify their 
> existence. I can understand that a few of them are sensible and may 
> come in handy, but I would like to talk about some of these in 
> particular.

I want to provide some context to why we have ext/filter, and why the 
filters that we currently have exist. At the time when we introduced 
ext/filter (which I mostly wrote), we were beholden to the scourge of 
"magic quotes".

In order for PHP to allow for a safer acceptance of input variables into 
a script, we added the ext/filter API to do so. The filters and 
sanitisers that we added were at that moment reasonable to add, and also 
likely to be used. We did punt on a view, and I am sure we made some 
'interesting' decisions.

For example the e-mail validator was not designed to allow for what the 
full spec allowed, but instead what we thought would be in-put by 
reasonable people.

The sanitising filters were added to get a rough, but reasonable filter 
to make data safe for specific contexts. 

Some of them were added so that people could easily upgrade, but for 
example setting the default filter to "magic_quotes" (or "add_slashes"). 
They're probably less useful *now*, but that doesn't distract that they 
might still be in use. 

I do believe we need to be better in promoting ext/filter's *good use*, 
of which there are plenty of cases. And evulating on how to *improve* 
(and not *remove) filters and sanitisers would be useful too.

Do you have specific suggestions towards that?

cheers,
Derick

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] Sanitize filters

2022-10-11 Thread Thomas Hruska

On 10/6/2022 1:19 AM, Rowan Tommins wrote:

On 05/10/2022 22:35, David Gebler wrote:
There are multiple RFC standards for email address format but AFAIK 
PHP's FILTER_SANITIZE_EMAIL doesn't conform to any of them.


FILTER_SANITIZE_EMAIL is a very short list of characters which claims to 
be based on RFC 822 section 6: 
https://heap.space/xref/php-src/ext/filter/sanitizing_filters.c?r=4df3dd76#295 

FILTER_VALIDATE_EMAIL doesn't say exactly which standard it's attempting 
to adhere to; it's one of many long unreadable regexes I've seen online 
claiming to cover all possible addresses. (Actually, there are now two 
regexes there, because there's a different version to support 
FILTER_FLAG_EMAIL_UNICODE). 
https://heap.space/xref/php-src/ext/filter/logical_filters.c?r=d8fc05c0#651


The idea behind my suggestion for something like is_valid_email 
(whatever it might be named) is as a step towards deprecating and 
removing the entire existing filter API, which I think many of us 
agree is a mess.


You described FILTER_VALIDATE_EMAIL as "notorious for being next to 
useless"; that gives us two possibilities:


a) A new function will be just as useless, because it will be based on 
the same implementation



b) There is a better implementation out there, which we should start 
using in ext/filter right now


For (b), well, there is always the option of handling email addresses 
the way the IETF intended instead of using regexes.


For example, SMTP::MakeValidEmailAddress() from:

https://github.com/cubiclesoft/ultimate-email

Does three things quite differently from ext/filter:

1)  It uses a custom state engine to implement half of the relevant IETF 
EBNF grammars and then cheats for the other half.  The very complex 
specifications that the IETF (and W3C) produces should generally be 
implemented as custom state engines (finite state machines or FSMs) in 
software.  A custom state engine can correctly identify certain common 
input errors and both transparently and correctly fix those errors in 
very specific instances as it processes the input (e.g. gmail,com -> 
gmail.com happens often).  State engines can also accurately and 
correctly do things such as remove CFWS (comments and folding 
whitespace) from email addresses, which are not necessary components of 
an email address and CFWS causes all kinds of issues.  State engines, 
when done right, can even outperform all other functional 
implementations.  State engines can also read partial input and maintain 
their internal state while using few resources to process very large 
inputs (not particularly relevant in this case).  The current 
regex-based approach in ext/filter is obviously causing some problems 
that can probably be fixed by using a custom state engine.


Important caveat:  Custom state engines do run the risk of winding up in 
an infinite loop when forgetting to properly transition between states 
or forgetting to move pointers through the input, resulting in DoS 
issues.  Been there, done that - they are both very easy things to do.


2)  It parses email addresses in reverse:  Domain part first, local part 
second.  The EBNF grammars for the domain part are simpler and less 
contentious than the grammars for the local part.  Also, IIRC, the 
domain portion can't contain '@' while the local portion can - it's been 
a while since I looked at the specs though.


3)  It considers sanitization and validation as being the same function. 
 There is no separate SMTP::IsValidEmailAddress() in the library 
because there is no need for one.  If MakeValidEmailAddress() can't turn 
an input into a valid email address string, it returns an error.  If the 
returned email address is not the same as the one that was input, the 
original address can be viewed as technically "invalid."  One shared 
internal function for both FILTER_SANITIZE_EMAIL and 
FILTER_VALIDATE_EMAIL would produce consistent output/results.



Other thoughts:  I'm aware that a regex is effectively defining a state 
engine as a compact string.  However, as evidenced by the two Perl CPAN 
regexes for email addresses currently in use, regexes are limited in 
utility/function and are somewhat inflexible, get more difficult to read 
and comprehend once they get longer than a few dozen bytes, and can't 
readily correct errors or other problems in complex input strings.  The 
~250 lines of userland code referenced above is also not perfect (e.g. 
extracting characters using substr() is rather inefficient) but it works 
well enough.  The userland code also performs a DNS MX record check by 
default, but that is its own complex can of worms and was probably not 
the best idea I've ever had.  However, the three main concepts are the 
important takeaways here, not the referenced userland code.



My gut feel is that (a) is true, and there is no point considering what 
a new function would be called, because we don't know how to implement it.


Perhaps the above will help to at least provide some 

Re: [PHP-DEV] Sanitize filters

2022-10-08 Thread Yasuo Ohgaki
Kamil Tekiela :

> These are just the things I found confusing and strange about the sanitize
> filters. Let's try to put ourselves in the shoes of an average PHP
> developer trying to comprehend these filters. It's quite easy to shoot
> yourself in the foot if you try to use them. The PHP manual doesn't do a
> good job of explaining them, but that's probably because they are not easy
> to explain. I can't come up with good examples of when they should be used.
>

I agree there are many confusing names/features/behaviors.
IMO, input validation and output sanitization should be 2 different
features.

https://wiki.sei.cmu.edu/confluence/display/seccode/Top+10+Secure+Coding+Practices

Input validation is the 1st secure coding principle for input data
handling. Output sanitization
is the 7th secure coding principle for output data handling. Filter module
is mixing these up.
(And input validation should not sanitize input, but validate. Otherwise,
the web app is not
OWASP TOP 10 compliant. i.e. OWASP TOP 10 A09:2021 requires to detect DAST
attacks)

I wrote the input validation part years ago, if anyone is interested.
https://github.com/yohgaki/validate-php (Obsolete  C module. Do not use)
https://github.com/yohgaki/validate-php-scr (PHP library)

--
Yasuo Ohgaki
yohg...@ohgaki.net


Re: [PHP-DEV] Sanitize filters

2022-10-06 Thread Rowan Tommins

On 06/10/2022 14:44, Claude Pache wrote:


While it may be difficult to validate an email according to some 
IETF’s RFC, the HTML standard has pragmatically adopted a pattern 
(used to validate `` fields) that is both readable 
and suitable for most practical purposes. See:


https://html.spec.whatwg.org/multipage/input.html#valid-e-mail-address



Well, it would be a more clearly documented source than the current 
implementation, although the spec admits it's "wilfully" not following 
e-mail standards. I'd be happy to see it committed, maybe for PHP 8.3.


I note that it doesn't support internationalized addresses in their 
Unicode form, though, so it won't do for FILTER_FLAG_EMAIL_UNICODE.


Regards,

--
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] Sanitize filters

2022-10-06 Thread Claude Pache


> Le 6 oct. 2022 à 10:19, Rowan Tommins  a écrit :
> 
> You described FILTER_VALIDATE_EMAIL as "notorious for being next to useless"; 
> that gives us two possibilities:
> 
> a) A new function will be just as useless, because it will be based on the 
> same implementation
> b) There is a better implementation out there, which we should start using in 
> ext/filter right now
> 
> My gut feel is that (a) is true, and there is no point considering what a new 
> function would be called, because we don't know how to implement it.

Hi,

While it may be difficult to validate an email according to some IETF’s RFC, 
the HTML standard has pragmatically adopted a pattern (used to validate `` fields) that is both readable and suitable for most practical 
purposes. See:

https://html.spec.whatwg.org/multipage/input.html#valid-e-mail-address 


—Claude

Re: [PHP-DEV] Sanitize filters

2022-10-06 Thread Rowan Tommins

On 05/10/2022 22:35, David Gebler wrote:
There are multiple RFC standards for email address format but AFAIK 
PHP's FILTER_SANITIZE_EMAIL doesn't conform to any of them.



FILTER_SANITIZE_EMAIL is a very short list of characters which claims to 
be based on RFC 822 section 6: 
https://heap.space/xref/php-src/ext/filter/sanitizing_filters.c?r=4df3dd76#295


FILTER_VALIDATE_EMAIL doesn't say exactly which standard it's attempting 
to adhere to; it's one of many long unreadable regexes I've seen online 
claiming to cover all possible addresses. (Actually, there are now two 
regexes there, because there's a different version to support 
FILTER_FLAG_EMAIL_UNICODE). 
https://heap.space/xref/php-src/ext/filter/logical_filters.c?r=d8fc05c0#651



The idea behind my suggestion for something like is_valid_email 
(whatever it might be named) is as a step towards deprecating and 
removing the entire existing filter API, which I think many of us 
agree is a mess.



You described FILTER_VALIDATE_EMAIL as "notorious for being next to 
useless"; that gives us two possibilities:


a) A new function will be just as useless, because it will be based on 
the same implementation
b) There is a better implementation out there, which we should start 
using in ext/filter right now


My gut feel is that (a) is true, and there is no point considering what 
a new function would be called, because we don't know how to implement it.



Regards,

--
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] Sanitize filters

2022-10-05 Thread David Gebler
On Tue, Oct 4, 2022 at 11:34 AM Rowan Tommins 
wrote:

> The "notorious" thing I know is that validating e-mail addresses is next
> to impossible because of multiple overlapping standards, and a huge
> number of esoteric variations that might or might not actually be
> deliverable in practice. If you think the implementation can be
> improved, that doesn't need a new is_valid_email() function, just a
> tested and documented patch to the existing one; if it can't be
> improved, then any new function will be just as useless
>

There are multiple RFC standards for email address format but AFAIK PHP's
FILTER_SANITIZE_EMAIL doesn't conform to any of them.

The idea behind my suggestion for something like is_valid_email (whatever
it might be named) is as a step towards deprecating and removing the entire
existing filter API, which I think many of us agree is a mess. As you said
below "it's trying to be everything to everyone, and ends up with a
bewildering set of options" - a rewrite or replacement which also tries to
be everything to everyone won't solve that problem, but getting rid of it
entirely will.

That said, the nature of PHP as a web-first language means it's reasonable
to include some individual, smaller, better APIs for certain validations or
sanitizations on types of data which are very commonly encountered in HTTP
requests. Examples include strings we expect or want to be valid integers,
decimals, email addresses and URLs. I think these features should remain,
but I'd happily see them even as a set of new, individual core functions if
it meant binning off filter_var and filter_input in PHP 9.

Regardless, look - I don't want to derail here - if most people are happy
with just deprecating some of the crappier and more confusing sanitize
filters and leave it at that, I say great, go for it, it's still an
improvement. I'm just saying if someone's going to take the time to look at
that problem space, why not go more than half the distance and reconsider
the fundamental approach of something we all know is pretty sucky anyway?

Just food for thought.


Re: [PHP-DEV] Sanitize filters

2022-10-04 Thread Claude Pache
Hi,

> FILTER_SANITIZE_ENCODED
> FILTER_SANITIZE_SPECIAL_CHARS

See https://www.php.net/manual/en/function.filter-input.php 
 Example #1 for an 
example of use. Apparently, “escaping” is considered as part of  “sanitizing”?

If you want to educate your users, you can consider to deprecate them in favor 
of FILTER_DEFAULT followed by `urlencode()`, respectively `htmlspecialchars()`. 
Ditto for various other FILTER_SANITIZE_* filters.

> FILTER_UNSAFE_RAW

My wild guess is that “unsafe” means that “it is dangerous to use the result in 
random contexts (i.e., without properly escaping it, because we assume that you 
don’t even know what “escape” means). Use FILTER_SANITIZE_ENCODED, 
FILTER_SANITIZE_SPECIAL_CHARS and/or FILTER_SANITIZE_MAGIC_QUOTES if you want 
to be safe” (for some nonstandard definition of “safe”). Of course, it should 
be renamed, because “safety” may be achieved by alternative means.

—Claude



Re: [PHP-DEV] Sanitize filters

2022-10-04 Thread Rowan Tommins

On 04/10/2022 01:38, David Gebler wrote:

What about FILTER_VALIDATE_EMAIL which is notorious for being next to
useless?
[...]
Seems to me like there could at the very least be a plausible case for some
better [...] is_valid_email() etc. type functions in core
to replace some of the filter API.



The "notorious" thing I know is that validating e-mail addresses is next 
to impossible because of multiple overlapping standards, and a huge 
number of esoteric variations that might or might not actually be 
deliverable in practice. If you think the implementation can be 
improved, that doesn't need a new is_valid_email() function, just a 
tested and documented patch to the existing one; if it can't be 
improved, then any new function will be just as useless.


In practice, the most common typos don't result in invalid e-mail 
addresses anyway, just incorrect ones - "gamil.com" instead of 
"gmail.com", and so on. For those, you don't need to Validate or 
Sanitize; you need to Escape and Verify: escape what you're given 
(context-dependent, so necessarily part of an SMTP or API client 
library), attempt to send an e-mail, and wait for the user to verify 
they've received it.




On 04/10/2022 02:29, Vasilii Shpilchin wrote:

filter_input() is the only alternative to accessing superglobal arrays
directly.

[...]

FILTER_SANITIZE_EMAIL - helps to clean up typical mess caused by
copy-pasting an email.
FILTER_SANITIZE_URI - similar thing but to URIs.
FILTER_SANITIZE_NUMBER_FLOAT - nice since it provides a flag to control
scientific notation


None of these sounds very useful to me, but I think that just confirms 
the biggest problem with the extension: it's trying to be everything to 
everyone, and ends up with a bewildering set of options as a result. I 
don't think any rewrite or replacement can ever avoid that problem, 
because it's inherent in the problem space.


I have a draft proposal I might share soon for some "strict cast" 
functions, but even simple cases like "string to integer" could have a 
dozen different implementations which would all be equally "valid" 
according to some use case or opinion, so it's a bit of a quagmire.


Regards,

--
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] Sanitize filters

2022-10-03 Thread Vasilii Shpilchin
I believe we are still dIscussing about the sanitizing filters only. No
doubt the filter API in general should be kept in the core as it provides
functional access to input variables with the filter_input() function. The
filter_input() is the only alternative to accessing superglobal arrays
directly. I prefer to use them rather than userland helpers and facades
which may work differently to each other. If you wanted to get back to
superglobal arrays when coding without a framework in PHP 8.3 I won't
believe in that. The set of the sanitizing filters is not perfect, however;
some filters are great and userful:

FILTER_SANITIZE_EMAIL - helps to clean up typical mess caused by
copy-pasting an email.
FILTER_SANITIZE_URI - similar thing but to URIs.
FILTER_SANITIZE_NUMBER_FLOAT - nice since it provides a flag to control
scientific notation (did you know is_float("1e1") is false,
but is_float(1e1), however, you always get a string from input variables,
and there is no other way to handle this case without weird manipulations
on a string).

The purpose of some filters like FILTER_SANITIZE_STRING is difficult to
get, I agree, but the idea to solve common edge-cases with built-in
high-quality functionality is  great, PHP is a language for Web and should
consider web context.



On Mon, Oct 3, 2022 at 8:38 PM David Gebler  wrote:

> On Mon, Oct 3, 2022 at 11:29 AM Max Semenik  wrote:
>
> >
> > Is there a compelling need to have this in the core, as opposed to
> > Composer packages? The ecosystem has changed a lot since the original
> > function was introduced.
> >
>
> I don't know that there is, I suspect the answer is probably not and
> sanitization and validation is probably better left to userland code. The
> only argument I can offer as devil's advocate is that certain validations
> or transformations will be faster in core than in library scripts. I would
> wager the most common implementation of such userland libraries today are
> heavily reliant on preg_* functions so having some fast, low level baseline
> in core for common tasks in this category might still make sense.
>
> While we're on the topic, can I bring up FILTER_SANITIZE_NUMBER_FLOAT? Why
> is the default behaviour of FILTER_SANITIZE_NUMBER_FLOAT the same as
> FILTER_SANITIZE_NUMBER_INT unless you add extra flags to permit fractions?
> Why is the constant name FILTER_SANITIZE_NUMBER_FLOAT but its counterpart
> for validation is FILTER_VALIDATE_FLOAT (no NUMBER_)? Why does validating a
> float return a float but sanitizing a float return a string?
>
> What about FILTER_VALIDATE_EMAIL which is notorious for being next to
> useless?
>
> Seems to me like there could at the very least be a plausible case for some
> better to_float(), to_int(), is_valid_email() etc. type functions in core
> to replace some of the filter API.
>


Re: [PHP-DEV] Sanitize filters

2022-10-03 Thread David Gebler
On Mon, Oct 3, 2022 at 11:29 AM Max Semenik  wrote:

>
> Is there a compelling need to have this in the core, as opposed to
> Composer packages? The ecosystem has changed a lot since the original
> function was introduced.
>

I don't know that there is, I suspect the answer is probably not and
sanitization and validation is probably better left to userland code. The
only argument I can offer as devil's advocate is that certain validations
or transformations will be faster in core than in library scripts. I would
wager the most common implementation of such userland libraries today are
heavily reliant on preg_* functions so having some fast, low level baseline
in core for common tasks in this category might still make sense.

While we're on the topic, can I bring up FILTER_SANITIZE_NUMBER_FLOAT? Why
is the default behaviour of FILTER_SANITIZE_NUMBER_FLOAT the same as
FILTER_SANITIZE_NUMBER_INT unless you add extra flags to permit fractions?
Why is the constant name FILTER_SANITIZE_NUMBER_FLOAT but its counterpart
for validation is FILTER_VALIDATE_FLOAT (no NUMBER_)? Why does validating a
float return a float but sanitizing a float return a string?

What about FILTER_VALIDATE_EMAIL which is notorious for being next to
useless?

Seems to me like there could at the very least be a plausible case for some
better to_float(), to_int(), is_valid_email() etc. type functions in core
to replace some of the filter API.


Re: [PHP-DEV] Sanitize filters

2022-10-03 Thread juan carlos morales
My 2 cents on this.

We should keep what is web related IMO. It does not make any sense to
take things out, that later everyone will write by its own, or end up
using a 3rd party package.

PHP should have what is web related already to be use.

Another different thing is the naming, the implementation code, etc.

An RFC documenting each case would be very helpful, to centralize the
ideas on each case, instead of scrolling the mailing list.

Cheers.

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] Sanitize filters

2022-10-03 Thread Rowan Tommins
On 3 October 2022 11:29:40 BST, Max Semenik  wrote:
>пн, 3 окт. 2022 г., 03:18 David Gebler :
>
>> At a glance, I think all the examples mentioned in this thread have better
>> existing alternatives already in core and could just be deprecated then
>> removed. But it's worth asking, is that what we're talking about here, or
>> is there a suggestion of replacing the filter API with a more modern,
>> object API?
>>
>
>Is there a compelling need to have this in the core, as opposed to Composer
>packages? The ecosystem has changed a lot since the original function was
>introduced.


Quite the opposite, in my opinion - there are compelling reasons *not* to have 
this in core.

It turns out that making a universal validation and sanitisation library is 
really hard, and breaking changes and diverging needs are pretty much 
guaranteed. That's pretty much the worst case for something distributed with 
the language, and exactly what Composer excels at.

The only thing that does belong in core are narrowly targeted low-level 
functions that someone might use to build such a library. Certainly not some 
huge OO monster reimplementing the whole of ext/filter and making a whole bunch 
of new mistakes.

Regards,

-- 
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] Sanitize filters

2022-10-03 Thread Max Semenik
пн, 3 окт. 2022 г., 03:18 David Gebler :

> At a glance, I think all the examples mentioned in this thread have better
> existing alternatives already in core and could just be deprecated then
> removed. But it's worth asking, is that what we're talking about here, or
> is there a suggestion of replacing the filter API with a more modern,
> object API?
>

Is there a compelling need to have this in the core, as opposed to Composer
packages? The ecosystem has changed a lot since the original function was
introduced.

>


Re: [PHP-DEV] Sanitize filters

2022-10-02 Thread David Gebler
On Sun, Oct 2, 2022 at 4:10 PM Larry Garfield 
wrote:

> The filter extension has always been a stillborn mess.  Its API is an
> absolute disaster and, as you note, its functionality is unclear at best,
> misleading at worst.  Frankly it's worse than SPL.
>
> I'd be entirely on board with jettisoning the entire thing, but baring
> that, ripping out large swaths of it that are misleading suits me fine.
>
>
The whole thing is seriously grim. Looking at the documentation for
filter_var for example, look at what it says for the third parameter,
$options

>  Associative array of options or bitwise disjunction of flags. If filter
accepts options, flags can be provided in "flags" field of array. For the
"callback" filter, callable type should be passed.

At a glance, I think all the examples mentioned in this thread have better
existing alternatives already in core and could just be deprecated then
removed. But it's worth asking, is that what we're talking about here, or
is there a suggestion of replacing the filter API with a more modern,
object API?


Re: [PHP-DEV] Sanitize filters

2022-10-02 Thread Hans Henrik Bergan
FILTER_SANITIZE_EMAIL should burn. If you have a bad email address, i can't
imagine the correct solution is to remove characters until it becomes
valid, short of a trim()

On Sun, Oct 2, 2022, 17:10 Larry Garfield  wrote:

> On Sat, Oct 1, 2022, at 10:39 AM, Kamil Tekiela wrote:
> > Hi Internals,
> >
> > For quite some time now, PHP's sanitize filters have "Rustled My
> Jimmies".
> > These filters bother me because I can't really justify their existence. I
> > can understand that a few of them are sensible and may come in handy,
> but I
> > would like to talk about some of these in particular.
> >
> > In PHP 8.1, we have deprecated FILTER_SANITIZE_STRING which I deemed to
> be
> > a priority due to its confusing name and behaviour. The rest is slightly
> > less dangerous, but as was pointed out to me in a recent conversation
> with
> > a PHP developer, these filters are all very confusing.
> >
> > I would like to have some opinions on the following filters. What do you
> > think we should do with them? Deprecate? Fix? Provide better
> documentation?
> >
> > ---
> >
> > *FILTER_SANITIZE_ENCODED *- "URL-encode string, optionally strip or
> encode
> > special characters."
> > Now, what does that mean? PHP has two functions for URL encoding:
> urlencode
> > used for encoding query-string parts, and rawurlencode used for encoding
> > any other URL part (two different RFCs are followed by these functions).
> > Which of these RFCs is applied in this filter? Furthermore, the
> description
> > says that "special characters" can be stripped or encoded. Is one of
> these
> > actions the default and the other can be selected by a flag or are both
> > optional? What are these special characters? Are they special in the
> > context of URL? If so, why did we encode them first? If these are HTML
> > special characters (there's no single definition of special HTML chars),
> > then why does this filter encode them if the filter is for URL
> > sanitization? What does backtick have to do with any of this
> > (FILTER_FLAG_STRIP_BACKTICK)?
> >
> > *FILTER_SANITIZE_ADD_SLASHES - "*Apply addslashes(). (Available as of PHP
> > 7.3.0)"
> > This filter was added as a replacement for magic_quotes filter. According
> > to PHP documentation, addslashes is supposed to be used when injecting
> PHP
> > variables into eval'd string. Real-life showed that this function is used
> > in a lot of places that have nothing to do with PHP's eval. I am not sure
> > if the sanitize filter is misused in a similar fashion, but judging from
> > the fact that it was meant as a replacement for magic_quotes, my guess is
> > that it's very likely still abused.
> >
> > *FILTER_SANITIZE_EMAIL *- "Remove all characters except letters, digits
> and
> > !#$%&'*+-=?^_`{|}~@.[]."
> > Which RFC does this adhere to? It strips slashes and quoted parts,
> doesn't
> > allow IPv6 addresses and doesn't accept RFC 6530 email addresses. This
> > filter is ok for simple usage, but it isn't true to any known
> specification
> > AFAIK.
> >
> > *FILTER_SANITIZE_SPECIAL_CHARS *- "HTML-encode '"<>& and characters with
> > ASCII value less than 32, optionally strip or encode other special
> > characters."
> > What's the intended purpose of this filter? "Special characters" are
> still
> > not clearly defined, but at least it's more clear than
> > the FILTER_SANITIZE_ENCODED description. Same question about backticks
> > though: why? Why encode ASCII <32 chars?
> >
> > *FILTER_SANITIZE_FULL_SPECIAL_CHARS *- "Equivalent to calling
> > htmlspecialchars() with ENT_QUOTES set. Encoding quotes can be disabled
> by
> > setting FILTER_FLAG_NO_ENCODE_QUOTES. Like htmlspecialchars(), this
> filter
> > is aware of the default_charset and if a sequence of bytes is detected
> that
> > makes up an invalid character in the current character set then the
> entire
> > string is rejected resulting in a 0-length string. When using this filter
> > as a default filter, see the warning below about setting the default
> flags
> > to 0."
> > Not to be mistaken with FILTER_SANITIZE_SPECIAL_CHARS. As long as it's
> not
> > used with filter_input(), it's the least problematic. We
> > have htmlspecialchars() though, so how useful is this filter?
> >
> > *FILTER_UNSAFE_RAW *- What makes it unsafe? Why isn't this just
> > called FILTER_RAW_STRING? If the value being filtered is something other
> > than a string, what will this filter return? Integers, floats, booleans
> and
> > nulls are converted to a string, Arrays and objects make the filter fail.
> >
> > ---
> >
> > Let's quickly mention the filter flags.
> >
> > The FILTER_FLAG_STRIP_LOW flag will also remove tabs, carriage returns
> and
> > newlines as these are all less than 32 ASCII codes. When is this useful
> and
> > expected?
> >
> > The FILTER_FLAG_ENCODE_LOW flag "encodes" ASCII <32 codes presumably into
> > HTML entities, although that's not specified anywhere in the PHP manual.
> > The word HTML does not appear on the
> > 

Re: [PHP-DEV] Sanitize filters

2022-10-02 Thread Larry Garfield
On Sat, Oct 1, 2022, at 10:39 AM, Kamil Tekiela wrote:
> Hi Internals,
>
> For quite some time now, PHP's sanitize filters have "Rustled My Jimmies".
> These filters bother me because I can't really justify their existence. I
> can understand that a few of them are sensible and may come in handy, but I
> would like to talk about some of these in particular.
>
> In PHP 8.1, we have deprecated FILTER_SANITIZE_STRING which I deemed to be
> a priority due to its confusing name and behaviour. The rest is slightly
> less dangerous, but as was pointed out to me in a recent conversation with
> a PHP developer, these filters are all very confusing.
>
> I would like to have some opinions on the following filters. What do you
> think we should do with them? Deprecate? Fix? Provide better documentation?
>
> ---
>
> *FILTER_SANITIZE_ENCODED *- "URL-encode string, optionally strip or encode
> special characters."
> Now, what does that mean? PHP has two functions for URL encoding: urlencode
> used for encoding query-string parts, and rawurlencode used for encoding
> any other URL part (two different RFCs are followed by these functions).
> Which of these RFCs is applied in this filter? Furthermore, the description
> says that "special characters" can be stripped or encoded. Is one of these
> actions the default and the other can be selected by a flag or are both
> optional? What are these special characters? Are they special in the
> context of URL? If so, why did we encode them first? If these are HTML
> special characters (there's no single definition of special HTML chars),
> then why does this filter encode them if the filter is for URL
> sanitization? What does backtick have to do with any of this
> (FILTER_FLAG_STRIP_BACKTICK)?
>
> *FILTER_SANITIZE_ADD_SLASHES - "*Apply addslashes(). (Available as of PHP
> 7.3.0)"
> This filter was added as a replacement for magic_quotes filter. According
> to PHP documentation, addslashes is supposed to be used when injecting PHP
> variables into eval'd string. Real-life showed that this function is used
> in a lot of places that have nothing to do with PHP's eval. I am not sure
> if the sanitize filter is misused in a similar fashion, but judging from
> the fact that it was meant as a replacement for magic_quotes, my guess is
> that it's very likely still abused.
>
> *FILTER_SANITIZE_EMAIL *- "Remove all characters except letters, digits and
> !#$%&'*+-=?^_`{|}~@.[]."
> Which RFC does this adhere to? It strips slashes and quoted parts, doesn't
> allow IPv6 addresses and doesn't accept RFC 6530 email addresses. This
> filter is ok for simple usage, but it isn't true to any known specification
> AFAIK.
>
> *FILTER_SANITIZE_SPECIAL_CHARS *- "HTML-encode '"<>& and characters with
> ASCII value less than 32, optionally strip or encode other special
> characters."
> What's the intended purpose of this filter? "Special characters" are still
> not clearly defined, but at least it's more clear than
> the FILTER_SANITIZE_ENCODED description. Same question about backticks
> though: why? Why encode ASCII <32 chars?
>
> *FILTER_SANITIZE_FULL_SPECIAL_CHARS *- "Equivalent to calling
> htmlspecialchars() with ENT_QUOTES set. Encoding quotes can be disabled by
> setting FILTER_FLAG_NO_ENCODE_QUOTES. Like htmlspecialchars(), this filter
> is aware of the default_charset and if a sequence of bytes is detected that
> makes up an invalid character in the current character set then the entire
> string is rejected resulting in a 0-length string. When using this filter
> as a default filter, see the warning below about setting the default flags
> to 0."
> Not to be mistaken with FILTER_SANITIZE_SPECIAL_CHARS. As long as it's not
> used with filter_input(), it's the least problematic. We
> have htmlspecialchars() though, so how useful is this filter?
>
> *FILTER_UNSAFE_RAW *- What makes it unsafe? Why isn't this just
> called FILTER_RAW_STRING? If the value being filtered is something other
> than a string, what will this filter return? Integers, floats, booleans and
> nulls are converted to a string, Arrays and objects make the filter fail.
>
> ---
>
> Let's quickly mention the filter flags.
>
> The FILTER_FLAG_STRIP_LOW flag will also remove tabs, carriage returns and
> newlines as these are all less than 32 ASCII codes. When is this useful and
> expected?
>
> The FILTER_FLAG_ENCODE_LOW flag "encodes" ASCII <32 codes presumably into
> HTML entities, although that's not specified anywhere in the PHP manual.
> The word HTML does not appear on the
> https://www.php.net/manual/en/filter.filters.flags.php page. What do these
> characters look like when presented by HTML? When is it ever useful to use
> this flag?
>
> FILTER_FLAG_ENCODE_AMP & FILTER_FLAG_STRIP_BACKTICK - why is this even a
> thing?
>
> Due to flags, FILTER_VALIDATE_EMAIL will happily validate email addresses
> that would be otherwise mangled by FILTER_SANITIZE_EMAIL.
>
> These are just the things I found confusing and strange about the 

Re: [PHP-DEV] Sanitize filters

2022-10-02 Thread Lokrain
Hello Vasilii,

It’s okay to have different opinion I hope.

You are missing an important point here - beside my comments, the current
way this is developed brings confusion.

It would be great if you share your experience on this matter.

Regards,
Dimitar

On Sun, 2 Oct 2022 at 9:31, Vasilii Shpilchin 
wrote:

> All right if you are writing on PHP for 25 years, you noticed the PHP was
> always about high-order web-focused functionality out-of-box. This is one
> of basic benefits of PHP to other general-purpose languages where you can
> write everything you want and you also have to write it since the language
> itself is very basic. I'm for PHP to keep built-in solutions for most
> common problems in the context of the web. Having passe ZCE exam and
> writing just 15 years on php.
>
> On Sun, Oct 2, 2022, 2:19 AM Lokrain  wrote:
>
>> Hello Kamil,
>>
>> I believe that PHP should not try to act as a “framework” that provides
>> you
>> with ready solutions for such cases.
>>
>> Being able to actually modify the default behaviour of some functions
>> through the ini .. is even scarier.
>>
>> For 25 year writing in PHP I never relied on this “magic” for security:)
>>
>> Regards,
>> Dimitar
>>
>> On Sat, 1 Oct 2022 at 18:39, Kamil Tekiela  wrote:
>>
>> > Hi Internals,
>> >
>> > For quite some time now, PHP's sanitize filters have "Rustled My
>> Jimmies".
>> > These filters bother me because I can't really justify their existence.
>> I
>> > can understand that a few of them are sensible and may come in handy,
>> but I
>> > would like to talk about some of these in particular.
>> >
>> > In PHP 8.1, we have deprecated FILTER_SANITIZE_STRING which I deemed to
>> be
>> > a priority due to its confusing name and behaviour. The rest is slightly
>> > less dangerous, but as was pointed out to me in a recent conversation
>> with
>> > a PHP developer, these filters are all very confusing.
>> >
>> > I would like to have some opinions on the following filters. What do you
>> > think we should do with them? Deprecate? Fix? Provide better
>> documentation?
>> >
>> > ---
>> >
>> > *FILTER_SANITIZE_ENCODED *- "URL-encode string, optionally strip or
>> encode
>> > special characters."
>> > Now, what does that mean? PHP has two functions for URL encoding:
>> urlencode
>> > used for encoding query-string parts, and rawurlencode used for encoding
>> > any other URL part (two different RFCs are followed by these functions).
>> > Which of these RFCs is applied in this filter? Furthermore, the
>> description
>> > says that "special characters" can be stripped or encoded. Is one of
>> these
>> > actions the default and the other can be selected by a flag or are both
>> > optional? What are these special characters? Are they special in the
>> > context of URL? If so, why did we encode them first? If these are HTML
>> > special characters (there's no single definition of special HTML chars),
>> > then why does this filter encode them if the filter is for URL
>> > sanitization? What does backtick have to do with any of this
>> > (FILTER_FLAG_STRIP_BACKTICK)?
>> >
>> > *FILTER_SANITIZE_ADD_SLASHES - "*Apply addslashes(). (Available as of
>> PHP
>> > 7.3.0)"
>> > This filter was added as a replacement for magic_quotes filter.
>> According
>> > to PHP documentation, addslashes is supposed to be used when injecting
>> PHP
>> > variables into eval'd string. Real-life showed that this function is
>> used
>> > in a lot of places that have nothing to do with PHP's eval. I am not
>> sure
>> > if the sanitize filter is misused in a similar fashion, but judging from
>> > the fact that it was meant as a replacement for magic_quotes, my guess
>> is
>> > that it's very likely still abused.
>> >
>> > *FILTER_SANITIZE_EMAIL *- "Remove all characters except letters, digits
>> and
>> > !#$%&'*+-=?^_`{|}~@.[]."
>> > Which RFC does this adhere to? It strips slashes and quoted parts,
>> doesn't
>> > allow IPv6 addresses and doesn't accept RFC 6530 email addresses. This
>> > filter is ok for simple usage, but it isn't true to any known
>> specification
>> > AFAIK.
>> >
>> > *FILTER_SANITIZE_SPECIAL_CHARS *- "HTML-encode '"<>& and characters with
>> > ASCII value less than 32, optionally strip or encode other special
>> > characters."
>> > What's the intended purpose of this filter? "Special characters" are
>> still
>> > not clearly defined, but at least it's more clear than
>> > the FILTER_SANITIZE_ENCODED description. Same question about backticks
>> > though: why? Why encode ASCII <32 chars?
>> >
>> > *FILTER_SANITIZE_FULL_SPECIAL_CHARS *- "Equivalent to calling
>> > htmlspecialchars() with ENT_QUOTES set. Encoding quotes can be disabled
>> by
>> > setting FILTER_FLAG_NO_ENCODE_QUOTES. Like htmlspecialchars(), this
>> filter
>> > is aware of the default_charset and if a sequence of bytes is detected
>> that
>> > makes up an invalid character in the current character set then the
>> entire
>> > string is rejected resulting in a 0-length string. When using 

Re: [PHP-DEV] Sanitize filters

2022-10-02 Thread Vasilii Shpilchin
All right if you are writing on PHP for 25 years, you noticed the PHP was
always about high-order web-focused functionality out-of-box. This is one
of basic benefits of PHP to other general-purpose languages where you can
write everything you want and you also have to write it since the language
itself is very basic. I'm for PHP to keep built-in solutions for most
common problems in the context of the web. Having passe ZCE exam and
writing just 15 years on php.

On Sun, Oct 2, 2022, 2:19 AM Lokrain  wrote:

> Hello Kamil,
>
> I believe that PHP should not try to act as a “framework” that provides you
> with ready solutions for such cases.
>
> Being able to actually modify the default behaviour of some functions
> through the ini .. is even scarier.
>
> For 25 year writing in PHP I never relied on this “magic” for security:)
>
> Regards,
> Dimitar
>
> On Sat, 1 Oct 2022 at 18:39, Kamil Tekiela  wrote:
>
> > Hi Internals,
> >
> > For quite some time now, PHP's sanitize filters have "Rustled My
> Jimmies".
> > These filters bother me because I can't really justify their existence. I
> > can understand that a few of them are sensible and may come in handy,
> but I
> > would like to talk about some of these in particular.
> >
> > In PHP 8.1, we have deprecated FILTER_SANITIZE_STRING which I deemed to
> be
> > a priority due to its confusing name and behaviour. The rest is slightly
> > less dangerous, but as was pointed out to me in a recent conversation
> with
> > a PHP developer, these filters are all very confusing.
> >
> > I would like to have some opinions on the following filters. What do you
> > think we should do with them? Deprecate? Fix? Provide better
> documentation?
> >
> > ---
> >
> > *FILTER_SANITIZE_ENCODED *- "URL-encode string, optionally strip or
> encode
> > special characters."
> > Now, what does that mean? PHP has two functions for URL encoding:
> urlencode
> > used for encoding query-string parts, and rawurlencode used for encoding
> > any other URL part (two different RFCs are followed by these functions).
> > Which of these RFCs is applied in this filter? Furthermore, the
> description
> > says that "special characters" can be stripped or encoded. Is one of
> these
> > actions the default and the other can be selected by a flag or are both
> > optional? What are these special characters? Are they special in the
> > context of URL? If so, why did we encode them first? If these are HTML
> > special characters (there's no single definition of special HTML chars),
> > then why does this filter encode them if the filter is for URL
> > sanitization? What does backtick have to do with any of this
> > (FILTER_FLAG_STRIP_BACKTICK)?
> >
> > *FILTER_SANITIZE_ADD_SLASHES - "*Apply addslashes(). (Available as of PHP
> > 7.3.0)"
> > This filter was added as a replacement for magic_quotes filter. According
> > to PHP documentation, addslashes is supposed to be used when injecting
> PHP
> > variables into eval'd string. Real-life showed that this function is used
> > in a lot of places that have nothing to do with PHP's eval. I am not sure
> > if the sanitize filter is misused in a similar fashion, but judging from
> > the fact that it was meant as a replacement for magic_quotes, my guess is
> > that it's very likely still abused.
> >
> > *FILTER_SANITIZE_EMAIL *- "Remove all characters except letters, digits
> and
> > !#$%&'*+-=?^_`{|}~@.[]."
> > Which RFC does this adhere to? It strips slashes and quoted parts,
> doesn't
> > allow IPv6 addresses and doesn't accept RFC 6530 email addresses. This
> > filter is ok for simple usage, but it isn't true to any known
> specification
> > AFAIK.
> >
> > *FILTER_SANITIZE_SPECIAL_CHARS *- "HTML-encode '"<>& and characters with
> > ASCII value less than 32, optionally strip or encode other special
> > characters."
> > What's the intended purpose of this filter? "Special characters" are
> still
> > not clearly defined, but at least it's more clear than
> > the FILTER_SANITIZE_ENCODED description. Same question about backticks
> > though: why? Why encode ASCII <32 chars?
> >
> > *FILTER_SANITIZE_FULL_SPECIAL_CHARS *- "Equivalent to calling
> > htmlspecialchars() with ENT_QUOTES set. Encoding quotes can be disabled
> by
> > setting FILTER_FLAG_NO_ENCODE_QUOTES. Like htmlspecialchars(), this
> filter
> > is aware of the default_charset and if a sequence of bytes is detected
> that
> > makes up an invalid character in the current character set then the
> entire
> > string is rejected resulting in a 0-length string. When using this filter
> > as a default filter, see the warning below about setting the default
> flags
> > to 0."
> > Not to be mistaken with FILTER_SANITIZE_SPECIAL_CHARS. As long as it's
> not
> > used with filter_input(), it's the least problematic. We
> > have htmlspecialchars() though, so how useful is this filter?
> >
> > *FILTER_UNSAFE_RAW *- What makes it unsafe? Why isn't this just
> > called FILTER_RAW_STRING? If the value being filtered is 

Re: [PHP-DEV] Sanitize filters

2022-10-02 Thread Lokrain
Hello Kamil,

I believe that PHP should not try to act as a “framework” that provides you
with ready solutions for such cases.

Being able to actually modify the default behaviour of some functions
through the ini .. is even scarier.

For 25 year writing in PHP I never relied on this “magic” for security:)

Regards,
Dimitar

On Sat, 1 Oct 2022 at 18:39, Kamil Tekiela  wrote:

> Hi Internals,
>
> For quite some time now, PHP's sanitize filters have "Rustled My Jimmies".
> These filters bother me because I can't really justify their existence. I
> can understand that a few of them are sensible and may come in handy, but I
> would like to talk about some of these in particular.
>
> In PHP 8.1, we have deprecated FILTER_SANITIZE_STRING which I deemed to be
> a priority due to its confusing name and behaviour. The rest is slightly
> less dangerous, but as was pointed out to me in a recent conversation with
> a PHP developer, these filters are all very confusing.
>
> I would like to have some opinions on the following filters. What do you
> think we should do with them? Deprecate? Fix? Provide better documentation?
>
> ---
>
> *FILTER_SANITIZE_ENCODED *- "URL-encode string, optionally strip or encode
> special characters."
> Now, what does that mean? PHP has two functions for URL encoding: urlencode
> used for encoding query-string parts, and rawurlencode used for encoding
> any other URL part (two different RFCs are followed by these functions).
> Which of these RFCs is applied in this filter? Furthermore, the description
> says that "special characters" can be stripped or encoded. Is one of these
> actions the default and the other can be selected by a flag or are both
> optional? What are these special characters? Are they special in the
> context of URL? If so, why did we encode them first? If these are HTML
> special characters (there's no single definition of special HTML chars),
> then why does this filter encode them if the filter is for URL
> sanitization? What does backtick have to do with any of this
> (FILTER_FLAG_STRIP_BACKTICK)?
>
> *FILTER_SANITIZE_ADD_SLASHES - "*Apply addslashes(). (Available as of PHP
> 7.3.0)"
> This filter was added as a replacement for magic_quotes filter. According
> to PHP documentation, addslashes is supposed to be used when injecting PHP
> variables into eval'd string. Real-life showed that this function is used
> in a lot of places that have nothing to do with PHP's eval. I am not sure
> if the sanitize filter is misused in a similar fashion, but judging from
> the fact that it was meant as a replacement for magic_quotes, my guess is
> that it's very likely still abused.
>
> *FILTER_SANITIZE_EMAIL *- "Remove all characters except letters, digits and
> !#$%&'*+-=?^_`{|}~@.[]."
> Which RFC does this adhere to? It strips slashes and quoted parts, doesn't
> allow IPv6 addresses and doesn't accept RFC 6530 email addresses. This
> filter is ok for simple usage, but it isn't true to any known specification
> AFAIK.
>
> *FILTER_SANITIZE_SPECIAL_CHARS *- "HTML-encode '"<>& and characters with
> ASCII value less than 32, optionally strip or encode other special
> characters."
> What's the intended purpose of this filter? "Special characters" are still
> not clearly defined, but at least it's more clear than
> the FILTER_SANITIZE_ENCODED description. Same question about backticks
> though: why? Why encode ASCII <32 chars?
>
> *FILTER_SANITIZE_FULL_SPECIAL_CHARS *- "Equivalent to calling
> htmlspecialchars() with ENT_QUOTES set. Encoding quotes can be disabled by
> setting FILTER_FLAG_NO_ENCODE_QUOTES. Like htmlspecialchars(), this filter
> is aware of the default_charset and if a sequence of bytes is detected that
> makes up an invalid character in the current character set then the entire
> string is rejected resulting in a 0-length string. When using this filter
> as a default filter, see the warning below about setting the default flags
> to 0."
> Not to be mistaken with FILTER_SANITIZE_SPECIAL_CHARS. As long as it's not
> used with filter_input(), it's the least problematic. We
> have htmlspecialchars() though, so how useful is this filter?
>
> *FILTER_UNSAFE_RAW *- What makes it unsafe? Why isn't this just
> called FILTER_RAW_STRING? If the value being filtered is something other
> than a string, what will this filter return? Integers, floats, booleans and
> nulls are converted to a string, Arrays and objects make the filter fail.
>
> ---
>
> Let's quickly mention the filter flags.
>
> The FILTER_FLAG_STRIP_LOW flag will also remove tabs, carriage returns and
> newlines as these are all less than 32 ASCII codes. When is this useful and
> expected?
>
> The FILTER_FLAG_ENCODE_LOW flag "encodes" ASCII <32 codes presumably into
> HTML entities, although that's not specified anywhere in the PHP manual.
> The word HTML does not appear on the
> https://www.php.net/manual/en/filter.filters.flags.php page. What do these
> characters look like when presented by HTML? When is it ever