Hello Here, I wanted to discuss something that bothered me for a while and I think makes a good case for a small breaking change in our public interfaces for Airflow 3 a slack conversation from today (where re2 was evidently crashing due to some long regexp passed to it) triggered me to call for action.
I would like to propose that we get rid of regexp in all the places where we have user controllable input where user can supply a regexp (for example in REST APIs, APIs used by UI, CLI, operator APIs) and other places where the user can provide their own regexp now. And that should allow us to get rid of google-re2 as a dependency (google-re2 is a somewhat problematic dependency to have). *A bit of context:* We've replaced stdlib re usage with re2 in June 2023 as a result of a security report we got. In one (and likely few other API calls of ours) we used parameter that we passed (root) as regex - and that parameter was controllable by the user - and they could potentially cause ReDOS https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS issue (i.e. crash our webserver by providing a very small carefully crafted regexp in root parameter). google-re2 is (theoretically) protecting about this kind of issues b y limiting regexp supported and using c regexp library written by google that is faster. So as a result - as we were not sure if there are other cases (it was not really obvious that "root" passed message is then parsed as regexp - we used re2 everywhere in airflow, mandated use of re2 and added pre-commit checking it. For those who have access to our security mailing list https://lists.apache.org/thread/3mjt86r7djzdg5t22tby431t9cvg8drl is the relevant thread. *Problem*: That caused a number of problems: * google-re2 is binary package and it does not have binary wheel for newer Python versions and some platforms - and it has to be compiled on those, and it often fails, we had for example problem with installing airflow on conda * https://github.com/apache/airflow/discussions/32852 * https://stackoverflow.com/questions/76701323/trying-to-download-apache-airflow-with-pip-but-error-pops-up-when-building-whee * https://github.com/apache/airflow/issues/32849 * https://erogluegemen.medium.com/how-to-install-apache-airflow-on-mac-df9bd5cf1ff8 * https://stackoverflow.com/questions/77575842/unable-to-install-google-re2-on-macos-14-1-1 And the list goes on-and-on. Actually in my "Beach cleaning" review of project - re2 is high on the list to "forego" - also because it's written in C, and while it is developed in google, it might be that there are similar regexp vulnerabilities as the original re vulnerabilities *Proposed solution:* I think a better approach would be to make sure that we never, ever parse a string passed by a user as regexp. This means that we remove all the regex-parsed user-controlled parameters in Airflow 3 (including REST API, CLI, the APIs used by the UI, Task SDK if it's there, parameters of the operators etc.). But also that means that we should add mechanisms (automation/pre-commit. code review awareness with maintainers) - to prevent adding such cases in the future. Then we could just get rid of re2 and switch to stdlib re - if we control all the regexp, it's safe to use. I have not done an inventory of those usages, but there are a few places where regexp can be provided from a client side (and maybe even fewer or none) in the new UI. We have literally a few CLI commands in Airflow CLI that take regex as input (dag and task command) - and there are probably a few more places we will have to look at. In all those cases - I THINK - we should be able to use other, safer pattern types (`glob` for one is a good alternative in many cases and it's also a simpler one to grasp even if less flexible). That means "breaking change" for all those usages, but one that is rather unlikely to cause a great problem and if we provide globs or other patterns as replacement, it should be quite easy to handle by our users. So my recommendation would be to: * remove and add protection so that we do not use regexp for any user input * remove google-re2 as dependency BTW. This s a pure embodiment of the famous saying "If you have a problem, introduce a regexp, now you have two problems". WDYT? J.