And one comment as well. I **just** tried to attempt to add Python 3.13
support (finally):

And of course - the binary wheels google-re2 released for Python 3.13 are
not enough to be used on Debian Bookworm( 1) and the build process fails:

5.974       Built crcmod==1.7
6.196   × Failed to build `google-re2==1.1.20240702`
6.196   ├─▶ The build backend returned an error
6.196   ╰─▶ Call to `setuptools.build_meta:__legacy__.build_wheel` failed
(exit
6.196       status: 1)
6.196
6.196       [stdout]
6.196       running bdist_wheel
6.196       running build
6.196       running build_py
6.196       copying re2/__init__.py ->
build/lib.linux-x86_64-cpython-313/re2
6.196       running build_ext
6.196       building '_re2' extension
6.196       g++ -fno-strict-overflow -Wsign-compare -DNDEBUG -g -O3
6.196       -Wall -fPIC -I/root/.cache/uv/builds-v0/.tmp7L9INs/include
6.196       -I/usr/local/include/python3.13 -c _re2.cc -o
6.196       build/temp.linux-x86_64-cpython-313/_re2.o -fvisibility=hidden
6.196
6.196       [stderr]
6.196       _re2.cc:15:10: fatal error: absl/strings/string_view.h: No such
file
6.196       or directory
6.196          15 | #include "absl/strings/string_view.h"
6.196             |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
6.196       compilation terminated.
6.196       error: command '/usr/bin/g++' failed with exit code 1
6.196
6.196       hint: This usually indicates a problem with the package or the
build
6.196       environment.
6.196   help: `google-re2` (v1.1.20240702) was included because
`apache-airflow`
6.196         (v3.0.0.dev0) depends on `google-re2`
------
ERROR: failed to solve: process "/bin/bash -o pipefail -o errexit -o
nounset -o nolog -c bash /scripts/docker/install_airflow.sh" did not
complete successfully: exit code: 1

So - as far as I am concerned - google-re2 should go away

On Wed, Feb 19, 2025 at 10:26 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> Hello Here,
>
> I wanted to discuss something that bothered me for a while and I think
> makes a good case for a small breaking change in our public interfaces for
> Airflow 3 a slack conversation from today (where re2 was evidently crashing
> due to some long regexp passed to it) triggered me to call for action.
>
> I would like to propose that we get rid of regexp in all the places where
> we have user controllable input where user can supply a regexp (for example
> in REST APIs, APIs used by UI, CLI, operator APIs) and other places where
> the user can provide their own regexp now. And that should allow us to get
> rid of google-re2 as a dependency (google-re2 is a somewhat problematic
> dependency to have).
>
> *A bit of context:*
>
> We've replaced stdlib re usage with re2 in June 2023 as a result of a
> security report we got. In one (and likely few other API calls of ours) we
> used parameter that we passed (root) as regex - and that parameter was
> controllable by the user - and they could potentially cause ReDOS
> https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS
> issue (i.e. crash our webserver by providing a very small carefully crafted
> regexp in root parameter). google-re2 is (theoretically) protecting
> about this kind of issues b y limiting regexp supported and using c regexp
> library written by google that is faster.
>
> So as a result - as we were not sure if there are other cases (it was not
> really obvious that "root" passed message is then parsed as regexp - we
> used re2 everywhere in airflow, mandated use of re2 and added pre-commit
> checking it. For those who have access to our security mailing list
> https://lists.apache.org/thread/3mjt86r7djzdg5t22tby431t9cvg8drl is the
> relevant thread.
>
> *Problem*:
>
> That caused a number of problems:
> * google-re2 is binary package and it does not have binary wheel for newer
> Python versions and some platforms - and it has to be compiled on those,
> and it often fails, we had for example problem with installing airflow on
> conda
>
> * https://github.com/apache/airflow/discussions/32852
> *
> https://stackoverflow.com/questions/76701323/trying-to-download-apache-airflow-with-pip-but-error-pops-up-when-building-whee
> * https://github.com/apache/airflow/issues/32849
> *
> https://erogluegemen.medium.com/how-to-install-apache-airflow-on-mac-df9bd5cf1ff8
> *
> https://stackoverflow.com/questions/77575842/unable-to-install-google-re2-on-macos-14-1-1
>
> And the list goes on-and-on. Actually in my "Beach cleaning" review of
> project - re2 is high on the list to "forego" - also because it's written
> in C, and while it is developed in google, it might be that there are
> similar regexp vulnerabilities as the original re vulnerabilities
>
> *Proposed solution:*
>
> I think a better approach would be to make sure that we never, ever parse
> a string passed by a user as regexp. This means that we remove all the
> regex-parsed user-controlled parameters in Airflow 3 (including REST API,
> CLI, the APIs used by the UI, Task SDK if it's there, parameters of the
> operators etc.). But also that means that we should add mechanisms
> (automation/pre-commit. code review awareness with maintainers) - to
> prevent adding such cases in the future.
>
> Then we could just get rid of re2 and switch to stdlib re - if we control
> all the regexp, it's safe to use.
>
> I have not done an inventory of those usages, but there are a few places
> where regexp can be provided from a client side (and maybe even fewer or
> none) in the new UI. We have literally a few CLI commands in Airflow CLI
> that take regex as input (dag and task command)  - and there are probably a
> few more places we will have to look at. In all those cases - I THINK - we
> should be able to use other, safer pattern types (`glob` for one is a good
> alternative in many cases and it's also a simpler one to grasp even if less
> flexible).
>
> That means "breaking change" for all those usages, but one that is rather
> unlikely to cause a great problem and if we provide globs or other patterns
> as replacement, it should be quite easy to handle by our users.
>
> So my recommendation would be to:
> * remove and add protection so that we do not use regexp for any user input
> * remove google-re2 as dependency
>
> BTW. This s a pure embodiment of the famous saying "If you have a problem,
> introduce a regexp, now you have two problems".
>
>
> WDYT?
>
> J.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Reply via email to