[
https://issues.apache.org/jira/browse/AIRFLOW-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541666#comment-16541666
]
Ash Berlin-Taylor commented on AIRFLOW-1729:
--------------------------------------------
Closer to a fuller fix is this diff:
{code:python}
diff --git a/airflow/models.py b/airflow/models.py
index 089befef..e722b609 100755
--- a/airflow/models.py
+++ b/airflow/models.py
@@ -522,12 +522,27 @@ class DagBag(BaseDagBag, LoggingMixin):
if os.path.isfile(dag_folder):
self.process_file(dag_folder, only_if_updated=only_if_updated)
elif os.path.isdir(dag_folder):
+ patterns_by_dir = {}
for root, dirs, files in os.walk(dag_folder, followlinks=True):
- patterns = []
+ patterns = patterns_by_dir.get(root, []).copy()
+ self.log.info("Root %s dirs %r patterns %r", root, dirs,
patterns)
ignore_file = os.path.join(root, '.airflowignore')
if os.path.isfile(ignore_file):
+ self.log.info("Loading %s", ignore_file)
with open(ignore_file, 'r') as f:
patterns += [p for p in f.read().split('\n') if p]
+ #dirs[:] = list[d for d in dirs if not any([re.findall(p,
os.path.join(root, d)) for p in patterns])]
+
+ # If we can ignore any subdirs entirely we should - fewer paths
+ # to walk is better. We have to modify the ``dirs`` array in
+ # place for this to affect os.walk
+ dirs[:] = [d for d in dirs if not any(re.findall(p,
os.path.join(root, d)) for p in patterns)]
+
+ # We want patterns defined in a parent folder's .airflowignore
to
+ # apply to subdirs too
+ for d in dirs:
+ patterns_by_dir[os.path.join(root, d)] = patterns
+
for f in files:
try:
filepath = os.path.join(root, f)
{code}
Reasons I haven't just opened a PR with that: We need to add tests for this so
it doesn't break again; we should de-duplicate between this code and the almost
identical code in airflow.utils.dag_processing.
> Ignore whole directories in .airflowignore
> ------------------------------------------
>
> Key: AIRFLOW-1729
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1729
> Project: Apache Airflow
> Issue Type: Improvement
> Components: core
> Affects Versions: Airflow 2.0
> Reporter: Cedric Hourcade
> Assignee: Kamil Sambor
> Priority: Minor
> Fix For: 2.0.0
>
>
> The .airflowignore file allows to prevent scanning files for DAG. But even if
> we blacklist fulldirectory the {{os.walk}} will still go through them no
> matter how deep they are and skip files one by one, which can be an issue
> when you keep around big .git or virtualvenv directories.
> I suggest to add something like:
> {code}
> dirs[:] = [d for d in dirs if not any([re.findall(p, os.path.join(root, d))
> for p in patterns])]
> {code}
> to prune the directories here:
> https://github.com/apache/incubator-airflow/blob/cfc2f73c445074e1e09d6ef6a056cd2b33a945da/airflow/utils/dag_processing.py#L208-L209
> and in {{list_py_file_paths}}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)