Hello Here,

I wanted to discuss something that bothered me for a while and I think
makes a good case for a small breaking change in our public interfaces for
Airflow 3 a slack conversation from today (where re2 was evidently crashing
due to some long regexp passed to it) triggered me to call for action.

I would like to propose that we get rid of regexp in all the places where
we have user controllable input where user can supply a regexp (for example
in REST APIs, APIs used by UI, CLI, operator APIs) and other places where
the user can provide their own regexp now. And that should allow us to get
rid of google-re2 as a dependency (google-re2 is a somewhat problematic
dependency to have).

*A bit of context:*

We've replaced stdlib re usage with re2 in June 2023 as a result of a
security report we got. In one (and likely few other API calls of ours) we
used parameter that we passed (root) as regex - and that parameter was
controllable by the user - and they could potentially cause ReDOS
https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS
issue (i.e. crash our webserver by providing a very small carefully crafted
regexp in root parameter). google-re2 is (theoretically) protecting
about this kind of issues b y limiting regexp supported and using c regexp
library written by google that is faster.

So as a result - as we were not sure if there are other cases (it was not
really obvious that "root" passed message is then parsed as regexp - we
used re2 everywhere in airflow, mandated use of re2 and added pre-commit
checking it. For those who have access to our security mailing list
https://lists.apache.org/thread/3mjt86r7djzdg5t22tby431t9cvg8drl is the
relevant thread.

*Problem*:

That caused a number of problems:
* google-re2 is binary package and it does not have binary wheel for newer
Python versions and some platforms - and it has to be compiled on those,
and it often fails, we had for example problem with installing airflow on
conda

* https://github.com/apache/airflow/discussions/32852
*
https://stackoverflow.com/questions/76701323/trying-to-download-apache-airflow-with-pip-but-error-pops-up-when-building-whee
* https://github.com/apache/airflow/issues/32849
*
https://erogluegemen.medium.com/how-to-install-apache-airflow-on-mac-df9bd5cf1ff8
*
https://stackoverflow.com/questions/77575842/unable-to-install-google-re2-on-macos-14-1-1

And the list goes on-and-on. Actually in my "Beach cleaning" review of
project - re2 is high on the list to "forego" - also because it's written
in C, and while it is developed in google, it might be that there are
similar regexp vulnerabilities as the original re vulnerabilities

*Proposed solution:*

I think a better approach would be to make sure that we never, ever parse a
string passed by a user as regexp. This means that we remove all the
regex-parsed user-controlled parameters in Airflow 3 (including REST API,
CLI, the APIs used by the UI, Task SDK if it's there, parameters of the
operators etc.). But also that means that we should add mechanisms
(automation/pre-commit. code review awareness with maintainers) - to
prevent adding such cases in the future.

Then we could just get rid of re2 and switch to stdlib re - if we control
all the regexp, it's safe to use.

I have not done an inventory of those usages, but there are a few places
where regexp can be provided from a client side (and maybe even fewer or
none) in the new UI. We have literally a few CLI commands in Airflow CLI
that take regex as input (dag and task command)  - and there are probably a
few more places we will have to look at. In all those cases - I THINK - we
should be able to use other, safer pattern types (`glob` for one is a good
alternative in many cases and it's also a simpler one to grasp even if less
flexible).

That means "breaking change" for all those usages, but one that is rather
unlikely to cause a great problem and if we provide globs or other patterns
as replacement, it should be quite easy to handle by our users.

So my recommendation would be to:
* remove and add protection so that we do not use regexp for any user input
* remove google-re2 as dependency

BTW. This s a pure embodiment of the famous saying "If you have a problem,
introduce a regexp, now you have two problems".


WDYT?

J.

Reply via email to