[GitHub] [airflow] potiuk commented on a diff in pull request #28256: Include full path to Python files under zip path while clearing import errors.

2023-02-21 Thread via GitHub


potiuk commented on code in PR #28256:
URL: https://github.com/apache/airflow/pull/28256#discussion_r1113025147


##
airflow/dag_processing/manager.py:
##
@@ -782,7 +782,11 @@ def clear_nonexistent_import_errors(file_paths: list[str] 
| None, session=NEW_SE
 """
 query = session.query(errors.ImportError)
 if file_paths:
-query = query.filter(~errors.ImportError.filename.in_(file_paths))
+for file_path in file_paths:
+if file_path.endswith(".zip"):
+query = 
query.filter(~(errors.ImportError.filename.startswith(file_path)))
+else:
+query = query.filter(errors.ImportError.filename != 
file_path)

Review Comment:
   Simply speaking what happens with the extra column - we distribute the 
overhead connected with the path/DAG mapping to the time when the file is found 
and entry gets created, and by having this "source" informatio we have 
effectively the "cache" that (by using exact match index) we can query very, 
very efficiently.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [airflow] potiuk commented on a diff in pull request #28256: Include full path to Python files under zip path while clearing import errors.

2023-02-21 Thread via GitHub


potiuk commented on code in PR #28256:
URL: https://github.com/apache/airflow/pull/28256#discussion_r1113020259


##
airflow/dag_processing/manager.py:
##
@@ -782,7 +782,11 @@ def clear_nonexistent_import_errors(file_paths: list[str] 
| None, session=NEW_SE
 """
 query = session.query(errors.ImportError)
 if file_paths:
-query = query.filter(~errors.ImportError.filename.in_(file_paths))
+for file_path in file_paths:
+if file_path.endswith(".zip"):
+query = 
query.filter(~(errors.ImportError.filename.startswith(file_path)))
+else:
+query = query.filter(errors.ImportError.filename != 
file_path)

Review Comment:
   Yep. That's what I was afraid of. And the extra column in case where people 
have a lot of DAGs, seems to me the only possible solution to make it in truly 
performant way when there are many of them. The "IN" query takes the advantage 
of the index speedup in this case. Also what I am afraid when there are 
multiple filters added, the SELECT query generated by SQLAlchemy might simply 
be too big to be effectively parsed and passed to the database.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [airflow] potiuk commented on a diff in pull request #28256: Include full path to Python files under zip path while clearing import errors.

2023-02-20 Thread via GitHub


potiuk commented on code in PR #28256:
URL: https://github.com/apache/airflow/pull/28256#discussion_r1112190405


##
airflow/dag_processing/manager.py:
##
@@ -782,7 +782,11 @@ def clear_nonexistent_import_errors(file_paths: list[str] 
| None, session=NEW_SE
 """
 query = session.query(errors.ImportError)
 if file_paths:
-query = query.filter(~errors.ImportError.filename.in_(file_paths))
+for file_path in file_paths:
+if file_path.endswith(".zip"):
+query = 
query.filter(~(errors.ImportError.filename.startswith(file_path)))
+else:
+query = query.filter(errors.ImportError.filename != 
file_path)

Review Comment:
   Aren't we risking very long queries here @tirkarthi ? Potentially there can 
be many of those files) . I wonder If there are limits in SQLalchemy / 
converting to SQL when there are thousands of AND conditions. Is it possible to 
test it on a fake DAG  folder with say 10.000 DAG files (mysql/postgres)?
   
   For example this one - for SQL Server: 
https://stackoverflow.com/questions/1869753/maximum-size-for-a-sql-server-query-in-clause-is-there-a-better-approach
 suggests that IN (a,b,c) is nothing comparing to chained conditions and that 
the latter might easily fail due to stack size. It also suggests that nowadays 
stack sizes are deep, but I have a hunch that running huge query like that 
might be terribly expensive:
   
   > Other than that, your query is limited by runtime conditions. It will 
usually run out of stack size because x IN (a,b,c) is nothing but x=a OR x=b OR 
x=c which creates an expression tree similar to x=a OR (x=b OR (x=c)), so it 
gets very deep with a large number of OR. SQL 7 would hit a SO [at about 10k 
values in the IN](http://support.microsoft.com/kb/288095), but nowdays stacks 
are much deeper (because of x64), so it can go pretty deep.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [airflow] potiuk commented on a diff in pull request #28256: Include full path to Python files under zip path while clearing import errors.

2023-02-20 Thread via GitHub


potiuk commented on code in PR #28256:
URL: https://github.com/apache/airflow/pull/28256#discussion_r1112190405


##
airflow/dag_processing/manager.py:
##
@@ -782,7 +782,11 @@ def clear_nonexistent_import_errors(file_paths: list[str] 
| None, session=NEW_SE
 """
 query = session.query(errors.ImportError)
 if file_paths:
-query = query.filter(~errors.ImportError.filename.in_(file_paths))
+for file_path in file_paths:
+if file_path.endswith(".zip"):
+query = 
query.filter(~(errors.ImportError.filename.startswith(file_path)))
+else:
+query = query.filter(errors.ImportError.filename != 
file_path)

Review Comment:
   Aren't we risking very long queries here @tirkarthi ? Potentially there can 
be many of those) . I wonder If there are limits in SQLalchemy / converting to 
SQL when there are thousands of AND conditions. Is it possible to test it on a 
fake DAG  folder with say 10.000 DAG files (mysql/postgres)?
   
   For example this one - for SQL Server: 
https://stackoverflow.com/questions/1869753/maximum-size-for-a-sql-server-query-in-clause-is-there-a-better-approach
 suggests that IN (a,b,c) is nothing comparing to chained conditions and that 
the latter might easily fail due to stack size. It also suggests that nowadays 
stack sizes are deep, but I have a hunch that running huge query like that 
might be terribly expensive:
   
   > Other than that, your query is limited by runtime conditions. It will 
usually run out of stack size because x IN (a,b,c) is nothing but x=a OR x=b OR 
x=c which creates an expression tree similar to x=a OR (x=b OR (x=c)), so it 
gets very deep with a large number of OR. SQL 7 would hit a SO [at about 10k 
values in the IN](http://support.microsoft.com/kb/288095), but nowdays stacks 
are much deeper (because of x64), so it can go pretty deep.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [airflow] potiuk commented on a diff in pull request #28256: Include full path to Python files under zip path while clearing import errors.

2022-12-27 Thread GitBox


potiuk commented on code in PR #28256:
URL: https://github.com/apache/airflow/pull/28256#discussion_r1058096682


##
airflow/dag_processing/manager.py:
##
@@ -777,8 +777,9 @@ def clear_nonexistent_import_errors(self, session):
 :param session: session for ORM operations
 """
 query = session.query(errors.ImportError)
-if self._file_paths:
-query = 
query.filter(~errors.ImportError.filename.in_(self._file_paths))
+files = list_py_file_paths(self._dag_directory, 
include_examples=False, include_zip_paths=True)

Review Comment:
   > I think there are two possible alternatives. One is to introduce a new 
attribute on DagFileProcessorManager that stores the “full” paths, so we can 
use it instead of `_file_paths` here. The other is to introduce a new column on 
ImportError that store the filesystem path (i.e. path to the zip file) so we 
can filter it against `_file_paths`.
   > 
   > The root issue here is that both `_file_paths` and `ImportError.filename` 
essentially has double meaning—they both represent the actual filesystem entry 
(path to an actual file), and a Python code loading target (path for the 
interpreter). Right now `_file_paths` is a list of filesystem entries, while 
`ImportError.filename` is a code target, and trying to comparing them is 
fundamentally not a good idea.
   
   Agree. I think having filesystem_path in import errors is a good idea - and 
likely it's an easy one that can be automatically set on migration, so should 
be rather easy to do.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [airflow] potiuk commented on a diff in pull request #28256: Include full path to Python files under zip path while clearing import errors.

2022-12-27 Thread GitBox


potiuk commented on code in PR #28256:
URL: https://github.com/apache/airflow/pull/28256#discussion_r1057903720


##
airflow/dag_processing/manager.py:
##
@@ -777,8 +777,9 @@ def clear_nonexistent_import_errors(self, session):
 :param session: session for ORM operations
 """
 query = session.query(errors.ImportError)
-if self._file_paths:
-query = 
query.filter(~errors.ImportError.filename.in_(self._file_paths))
+files = list_py_file_paths(self._dag_directory, 
include_examples=False, include_zip_paths=True)

Review Comment:
   Do we have an idea how to do it better? I have not looked at all the 
processing paths that are involved, but I am not sure how "expensive" this is - 
seems to be done once per loop and something that we have to do anyway 
periodically (as we cannot really rely on inotify kind of API due to various 
filesystems DAG folder can be on.
   
   Maybe this check should happen independently on another theread in-parallel 
to parsing ? Any other ideas @ephraimbuddy ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org