[GitHub] [airflow] potiuk commented on a diff in pull request #26509: Fix airflow tasks run --local when dags_folder differs from that of processor

GitBox Mon, 26 Sep 2022 08:37:52 -0700


potiuk commented on code in PR #26509:
URL: https://github.com/apache/airflow/pull/26509#discussion_r980206186



##########
airflow/models/dag.py:
##########
@@ -1189,11 +1195,25 @@ def relative_fileloc(self) -> pathlib.Path:
         """File location of the importable dag 'file' relative to the 
configured DAGs folder."""
         path = pathlib.Path(self.fileloc)
         try:
-            return path.relative_to(settings.DAGS_FOLDER)
+            rel_path = path.relative_to(self.dag_processor_dags_folder or 
settings.DAGS_FOLDER)
+            if rel_path == pathlib.Path('.'):
+                return path
+            else:
+                return rel_path
         except ValueError:
             # Not relative to DAGS_FOLDER.
             return path
 
+    @property
+    def dag_processor_dags_folder(self):

Review Comment:
   I read the thread carefuly (sorry it took me some time, I wanted to make 
sure I am not mixing things). There are some misunderstanding of what --sub is, 
 I will try to paraphrase my understanding and simply see if we are on the same 
page.
   
   A bit context as I understand it and remember :).
   
   The idea of the "--subdir" flag and the way we've implemented it was not to 
over-complicate it - i.e. to be able to do two things 
   
   Phase 1) separate out standalone DAG processor from scheduler (isolate code 
execution to a different machine that for example might be in a different 
security zone). 
   
   Phase 2) to be able to split your DAGS folder into separate subdirs - each 
isolated from each other.
   
   
   Case 1)you have all DAG File processors to run from DAGS_FOLDER (no --subdir 
flag, equivalent to what we currently have when we hae multiple scheduler all 
parsing data from the same folder - which is a bit redundant but acceptable. 
This was not the main purpose of introducing standalone DAG processor to run 
several of them at the same time)
   
   Case 2) You have one (or more) separate DAG processors and each group of 
those with its own, separte directory to process. No "top level" dag file 
processor. This is equivalent to have several independent DAG File processors 
and no "common" one at all.  Eventually this might be connected to teams. The 
main use-case here was to make it independent from the actual "airflow" 
CORE_DAGS_FOLDER
   
   TEAM_A -> DAG_PROCESSOR_A1, DAG_PROCESSOR_A2 (--subdir /FOLDER_OF_TEAM_A)
   TEAM_B -> DAG_PROCESSOR_B1, DAG_PROCESSOR_B2 (--subdir /FOLDER_OF_TEAM_B)
   
   This means that the user might configure it differently depending on the 
deployment choice they want to have. And yeah the `--subdir` name might be 
misleading in this context, I vaguely recall we had a discussion about it when 
we chose the parameter name (we can find it I guess to get more context - I 
can't even remember what was my preference there) but the `--subdir` in this 
case is more about the fact that this is not "all of" DAGs but subset of them. 
I think we chose the name because we chose to use existing `--subdir` parameter 
in local which has already this "mistleading" semantics - and so that `airflow 
dags test --subdir <ABSOLUTE_PATH>`, and have it equivalent to `dag-processor 
--subdir <ABSOLUTE_PATH>`
   
   Currently `--subdir` in local tasks has this behaviour:
   
   1) If it is an absolute PATH use it as the absolute location of the DAG
   2) If it has DAGS_FOLDER (anywhere)-> replace DAGS_FOLDER with the current 
DAGS_FOLDER - this was fixed by @ashb in 2.2.0 to handle the case where we 
could handle different locations on the scheduler and runners 
(https://github.com/apache/airflow/pull/16860) - it was broken in 2.0 and 2.1 
but worked in 1.10.
   3) It can also expand the "user" path (god knows why but likely historical 
too).
   
   This is the current `process_subdir()` method:
   
   ```
   def process_subdir(subdir: str | None):
       """Expands path to absolute by replacing 'DAGS_FOLDER', '~', '.', etc."""
       if subdir:
           if not settings.DAGS_FOLDER:
               raise ValueError("DAGS_FOLDER variable in settings should be 
filled.")
           subdir = subdir.replace('DAGS_FOLDER', settings.DAGS_FOLDER)
           subdir = os.path.abspath(os.path.expanduser(subdir))
       return subdir
   ```
   
   Originally the idea was that they might want to have them independent of the 
the current 
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#dags-folder
 without relative paths. That's fine, notthing prevents us from doing so, I 
believe we handle this perfectly well - we already handled the situation where 
DagBag is passed an Absolute Path and it would store the absolute path of the 
DAG where it comes from. This is for example used in case of "example_dags" and 
this is what we wanted to get.
   
   
   I believe (please correct me if I misundderstood it @dstandish) that this PR 
here was very similar to what the #16860 change for @ashb was. But while Ashb 
wanted to resurrect the relative approach working for the main DAGS folder, but 
this one was to bring it to the standalone DAG processor. 
   
   This is fine - we did not want to do it originally, we have not thought this 
might be useful, but when I think of it can have some uses - but only of you 
create a deployment wher all the subdirs served by all file processors are 
subdirs of the "CORE" dags folder (and there is still no top-level 
parser/processor).
   
   Then you could likely synchronise all your sub-folders as a single volume or 
git-sync and scam each sub-folder of it through different group of dag file 
processors. Originally the idea was that each DAG File processor (group) could 
have a separate git-sync or volume. But this is an interesting approach where 
they are isolated for execution but not isolated for syncing. This makes it a 
bit more difficult to maintain the isolation and syncing (because you have to 
have parallel filesystem access  hierarchy to sub-processor hierarchy or 
git-submodules to run syncing from multple reposiotories (but it's doable and a 
number of our users already do it - for example Jagex in their airlfow Summit 
talk from London). 
   
   Are we on the same page here :) ? 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] potiuk commented on a diff in pull request #26509: Fix airflow tasks run --local when dags_folder differs from that of processor

Reply via email to