dstandish commented on code in PR #26509:
URL: https://github.com/apache/airflow/pull/26509#discussion_r980254962


##########
airflow/models/dag.py:
##########
@@ -1189,11 +1195,25 @@ def relative_fileloc(self) -> pathlib.Path:
         """File location of the importable dag 'file' relative to the 
configured DAGs folder."""
         path = pathlib.Path(self.fileloc)
         try:
-            return path.relative_to(settings.DAGS_FOLDER)
+            rel_path = path.relative_to(self.dag_processor_dags_folder or 
settings.DAGS_FOLDER)
+            if rel_path == pathlib.Path('.'):
+                return path
+            else:
+                return rel_path
         except ValueError:
             # Not relative to DAGS_FOLDER.
             return path
 
+    @property
+    def dag_processor_dags_folder(self):

Review Comment:
   > Are we on the same page here :) ?
   
   maybe sort of? :) i'm not sure what you're proposing though...
   
   taking it back down to where the rubber meets the road here...  what i'm 
trying to do here is fix a bug.  currently if the DAGS_FOLDER of 
scheduler/processor differs from that of k8s executor then all tasks will fail 
to run.  i think this issue is the same one you say worked in 1.10 but was 
broken in 2.0  and 2.1 -- it's broken again in 2.4 and that's what i'm trying 
to fix.
   
   this is easy to repro if you run scheduler in local virtualenv and run your 
k8s pods in docker desktop
   
   the cause is the recent optimization to not "process" dags in `airflow tasks 
run --local` but to read from ser dag.  the pod definition remains like this
   ```
     - args:
       - airflow
       - tasks
       - run
       - example_bash_operator
       - runme_0
       - manual__2022-09-23T20:04:52.306715+00:00
       - --local
       - --subdir
       - DAGS_FOLDER
   ```
   But when that gets converted to `tasks run --raw`  then the subdir becomes 
the full path (`fileloc`) of the dag _as it is located on the 
scheduler/processor`.  I think it breaks down because, our mechanism for 
handling different dags folders is `relative_fileloc`.  But currently the logic 
to calculate relative fileloc is based on the _current_ dags folder, which 
means, if the _current_ process has a different dags folder, it cannot be 
calculated!  So then it defaults to the full filepath as stored in serdag and 
therefore it fails to find the dag.
   
   So what do we do about it?  The relatively straightforward solution is to 
use the dags_folder of serializer instead of dags_folder of current process 
when calculating relative fileloc (or equivalently, you could actually store 
the relative fileloc when serializing).  Then the actual path can always be 
reassembled, provided that the dag _actually is_ relative to the dags_folder in 
all envs (and has the same relative path).
   
   So that's why I added dag_processor_dags_folder.
   
   But @mhenc is suggesting, I think, to just use `processor_subdir` instead.  
But the problem I see with that is, processor_subdir may not always equal 
dags_folder.  And in the case of multiple processors in subdirs of dags folder, 
then this would not work I think.  The only way it would work is if you had a 
1-1 relationship between dags_folder of worker and processor_subdir of 
processor.  But in that scenario, why not just set the DAGS_FOLDER of your 
processor to be the subdir?
   
   Fundamentally, I think that currently airflow assumes everything must be in 
the dags folder, and that dag processors, to the extent they have different 
paths (subdir) they are _subdirs_ of the dags folder.  This alone seems QED for 
the notion that processor_subdir !== dags_folder.  I think as we get closer to 
multitenancy, it seems likely we'll need to blow up the dags_folder concept; if 
it's no longer true that there is _one_ dags folder then the code and language 
need to reflect that.  There may be many dags folders.  "relative_fileloc" 
won't necessarily make sense as a concept (unless it's redefined to mean 
relative to the dags folder of this "sub-instance").  But for now, the two 
(processor_subdir) and dags folder seem to be two different things, and 
DAGS_FOLDER seems to have the best likelihood of giving a reliable answer for 
`relative_fileloc` so it seems worth storing separately.
   
   THAT SAID, perhaps we should make it a "private" attribute, so as not to 
increase the backcompat surface unnecessarily?
   
   WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to