rawwar opened a new pull request, #57238:
URL: https://github.com/apache/airflow/pull/57238
doing in operation on list, when having large number of files is not
optimal. So, we'll just create a set of the self._file_queue and then use it.
I did a test(Just wanted to give a try to see the gain)
```
import timeit
from collections import deque
from pathlib import Path
from airflow.dag_processing.manager import DagFileInfo
queue = deque(DagFileInfo(Path(f"dag_{i}.py"), f"bundle_{i%10}", f"{i%5}.0")
for i in range(400))
files = [DagFileInfo(Path(f"dag_{i}.py"), f"bundle_{i%10}", f"{i%5}.0") for
i in range(2000)]
def test_current(): return [f for f in files if f not in queue]
def test_optimized(): queue_set = set(queue); return [f for f in files if f
not in queue_set]
iterations = 1000
current = timeit.timeit(test_current, number=iterations)
optimized = timeit.timeit(test_optimized, number=iterations)
print(f"Current: {current:.3f}s ({current/iterations*1000:.2f}ms/call)")
print(f"Optimized: {optimized:.3f}s
({optimized/iterations*1000:.2f}ms/call)")
print(f"Speedup: {current/optimized:.1f}x faster")
```
```
Current: 97.487s (97.49ms/call)
Optimized: 0.281s (0.28ms/call)
Speedup: 346.6x faster
```
Assumed, there can be about 400 files in queue.. while parsing about 2k
files in total across dag-bundles. Intentionally choose high number of dags to
see perf. gain.
I'm keeping this in draft as I'm just checking how large can queue can be..
when parsing 2k files.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]