potiuk commented on code in PR #26509:
URL: https://github.com/apache/airflow/pull/26509#discussion_r980206186
##########
airflow/models/dag.py:
##########
@@ -1189,11 +1195,25 @@ def relative_fileloc(self) -> pathlib.Path:
"""File location of the importable dag 'file' relative to the
configured DAGs folder."""
path = pathlib.Path(self.fileloc)
try:
- return path.relative_to(settings.DAGS_FOLDER)
+ rel_path = path.relative_to(self.dag_processor_dags_folder or
settings.DAGS_FOLDER)
+ if rel_path == pathlib.Path('.'):
+ return path
+ else:
+ return rel_path
except ValueError:
# Not relative to DAGS_FOLDER.
return path
+ @property
+ def dag_processor_dags_folder(self):
Review Comment:
I read the thread carefuly (sorry it took me some time, I wanted to make
sure I am not mixing things). There are some misunderstanding of what --sub is,
I will try to paraphrase my understanding and simply see if we are on the same
page.
A bit context as I understand it and remember :).
The idea of the "--subdir" flag and the way we've implemented it was not to
over-complicate it - i.e. to be able to do two things
Phase 1) separate out standalone DAG processor from scheduler (isolate code
execution to a different machine that for example might be in a different
security zone).
Phase 2) to be able to split your DAGS folder into separate subdirs - each
isolated from each other.
Case 1)you have all DAG File processors to run from DAGS_FOLDER (no --subdir
flag, equivalent to what we currently have when we hae multiple scheduler all
parsing data from the same folder - which is a bit redundant but acceptable.
This was not the main purpose of introducing standalone DAG processor to run
several of them at the same time)
Case 2) You have one (or more) separate DAG processors and each group of
those with its own, separte directory to process. No "top level" dag file
processor. This is equivalent to have several independent DAG File processors
and no "common" one at all. Eventually this might be connected to teams. The
main use-case here was to make it independent from the actual "airflow"
CORE_DAGS_FOLDER
TEAM_A -> DAG_PROCESSOR_A1, DAG_PROCESSOR_A2 (--subdir /FOLDER_OF_TEAM_A)
TEAM_B -> DAG_PROCESSOR_B1, DAG_PROCESSOR_B2 (--subdir /FOLDER_OF_TEAM_B)
This means that the user might configure it differently depending on the
deployment choice they want to have. And yeah the `--subdir` name might be
misleading in this context, I vaguely recall we had a discussion about it when
we chose the parameter name (we can find it I guess to get more context - I
can't even remember what was my preference there) but the `--subdir` in this
case is more about the fact that this is not "all of" DAGs but subset of them.
I think we chose the name because we chose to use existing `--subdir` parameter
in local which has already this "mistleading" semantics - and so that `airflow
dags test --subdir <ABSOLUTE_PATH>`, and have it equivalent to `dag-processor
--subdir <ABSOLUTE_PATH>`
Currently `--subdir` in local tasks has this behaviour:
1) If it is an absolute PATH use it as the absolute location of the DAG
2) If it has DAGS_FOLDER (anywhere)-> replace DAGS_FOLDER with the current
DAGS_FOLDER - this was fixed by @ashb in 2.2.0 to handle the case where we
could handle different locations on the scheduler and runners
(https://github.com/apache/airflow/pull/16860) - it was broken in 2.0 and 2.1
but worked in 1.10.
3) It can also expand the "user" path (god knows why but likely historical
too).
This is the current `process_subdir()` method:
```
def process_subdir(subdir: str | None):
"""Expands path to absolute by replacing 'DAGS_FOLDER', '~', '.', etc."""
if subdir:
if not settings.DAGS_FOLDER:
raise ValueError("DAGS_FOLDER variable in settings should be
filled.")
subdir = subdir.replace('DAGS_FOLDER', settings.DAGS_FOLDER)
subdir = os.path.abspath(os.path.expanduser(subdir))
return subdir
```
Originally the idea was that they might want to have them independent of the
the current
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#dags-folder
without relative paths. That's fine, notthing prevents us from doing so, I
believe we handle this perfectly well - we already handled the situation where
DagBag is passed an Absolute Path and it would store the absolute path of the
DAG where it comes from. This is for example used in case of "example_dags" and
this is what we wanted to get.
I believe (please correct me if I misundderstood it @dstandish) that this PR
here was very similar to what the #16860 change for @ashb was. But while Ashb
wanted to resurrect the relative approach working for the main DAGS folder, but
this one was to bring it to the standalone DAG processor.
This is fine - we did not want to do it originally, we have not thought this
might be useful, but when I think of it can have some uses - but only of you
create a deployment wher all the subdirs served by all file processors are
subdirs of the "CORE" dags folder (and there is still no top-level
parser/processor).
Then you could likely synchronise all your sub-folders as a single volume or
git-sync and scam each sub-folder of it through different group of dag file
processors. Originally the idea was that each DAG File processor (group) could
have a separate git-sync or volume. But this is an interesting approach where
they are isolated for execution but not isolated for syncing. This makes it a
bit more difficult to maintain the isolation and syncing (because you have to
have parallel filesystem access hierarchy to sub-processor hierarchy or
git-submodules to run syncing from multple reposiotories (but it's doable and a
number of our users already do it - for example Jagex in their airlfow Summit
talk from London).
Are we on the same page here :) ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]