Thanks for bringing this up Jarek! I like the idea of removing regexp definitions from everywhere.
First of all, for the CLI part, when we go remote, agree. We can make these changes and include them as breaking since we will already make a bunch of breaking changes there. For backward compatibility, it will vary between local and remote. Local commands shouldn't be backward compatible and we should document that. The remote will go to the API, and since the public API will be backwards compatible, this should be backwards compatible, too. Please correct me if I'm wrong. I would love to have no backwards compatibility there to be more flexible. I am up for `Please use it Wisely` documentation (should be one of the must-haves for local_commands) and most likely update with glob in the next available iteration. I think we also have in structure.py too where it is for UI. Again we have root there used for task_ids_or_regex. It seems the task groups and tasks are already loaded from the files. Should we need to support regexp here? Maybe I am missing a use case here but the method is only used in the API and old UI where it will be gone soon. Other usages use the method with task_id rather than regexp. The least effort could be dropping the support and updating the downstream usages accordingly. What do you think? I am not sure how easy would be to cover all cases of introducing regexp. We may end up maintaining a step where it may miss lots of edge cases. I don't have any strong opinion on this though, we may simply have checks and that would help to increase awareness already. On Thu, Feb 20, 2025 at 12:41 AM Jarek Potiuk <ja...@potiuk.com> wrote: > > Challenge would be... who has capacity to make it? > > This seems to be not much work - assuming that we agree on how to approach > it, I can do it with the help of involved people to understand > the single case where I am not sure. I would love some feedback. > > I actually did an "inventory" of sorts in the Python 3.13 first-pass PR > https://github.com/apache/airflow/pull/46891 - and I can extract google-re > removal from it while fixing the dependency issues. so part of the work is > already done. > > And I have concrete proposals for the approach (with one unsure thing). > > *1. Internal regexps - just use re* > > We can just use re here. I already removed and reviewed re2 where only > "internal" regexps are used. > > *2. Very likely no changes needed.* > > We can assume that an admin person configuring it will not be "rogue", so > likely setting regexp in configuration does not have to be "protected". > > * airflow/metrics/validators.py -> metrics allow/block list from > configuration > * airflow/serialization/serde.py -> allowed_deserialization_classes_regexp > from configuration > * airflow/models/dagrun.py -> allowed_run_id_pattern > * task_sdk/src/airflow/sdk/execution_time/secrets_masker.py -> patterns > configured by secret adapters > > *3. Security model clarification maybe? Or breaking changes?* > > I think while we could add some breaking changes, it will be better to > clarify things in security mode. > > * airflow/cli/commands/remote_commands/dag_command.py : dag command, > mark_success_pattern > * airflow/models.dag.py : mark_success_pattern: dag/test dag command > * airflow/utils/cli.py: get_dags : dags command > > Those regexps are provided by potentially "low access" CLI (especially > when later we do remote CLI) - so we could likely change it to glob. But > instead we can also make a security model explanation and explain > that whoever has access to CLI and those specific commands should be > trusted not to abuse it and cause ReDoc. Also - we can worry about it when > we go remot, and we can also mention that the CLI command line is not > covered by "back-compatibility" guarantee (so that we can change it later). > > * airflow/utils/file.py: those are .*ariflowignore* rules. DAG authors > could potentially modify them, but DAG authors can still do "dos" pretty > freely with just TOP level code and task doing "stuff" - so while we could > consider dropping regexp, I see no big value in it and we can use "re". > > *4. Unsure (here I need some help) * > > * task_ids_or_regex (airflow/models/dag.py, sdk/definitions, grid.py -> > this is the one that was flagged by the security report (passed as root > parameter) - this is used by both UI and exposed via task_sdk. I am not > really sure - if we really need to use regex here, maybe glob (which is > less flexible and safer) can be used instead of regex even if it will be > somewhat breaking/new behaviour for the user? > > I think Brent, Ash, Amogh, Pierre - you've been touching that one > recently. Would love to hear what you think whether we can do something > with it? > > *5. Future protection* > > Looking at the "number of cases" above and how those regexes could be > passed between UI/ task SDK API etc - I am not sure if we can automate > "future" avoidance easily. I think we might simply start paying attention > and treat the "introduce regexp - you have two problems" saying more > seriously during reviews and possibly maybe - similar to the "internal-api" > label - flag and fail if a PR introduces a regexp :) ?. Not sure if it's > worth it though > > WDYT? Does the approach look good? Any comments? > > J. > > > > > On Wed, Feb 19, 2025 at 9:35 PM Jens Scheffler <j_scheff...@gmx.de.invalid > > > wrote: > > > Sounds good to me - I was not aware of user-facing interfaces... and if > > removing support then 3.0 is the best point. > > > > Challenge would be... who has capacity to make it? > > > > On 19.02.25 15:39, Jed Cunningham wrote: > > > Sounds good to me. > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org > > For additional commands, e-mail: dev-h...@airflow.apache.org > > > > > -- Bugra Ozturk